Week 1 discussion: Overview Adam Omidpanah Biost / Epi 518
January 3, 2011
In a nutshell, Regression is a systematic way of identifying trends in experimental data in the presence of errors. Variables which are measured or experimentally controlled and modify the outcome variable are called dependent variables, regressors, predictors, etc. Trends refer to patterns in the outcome variable observed as a function of the regressors Errors arise as a function of an infinite number of unobservable variables which mediate some outcome of interest: instrument calibration, weather patterns, moods, etc. Linear regression is the processes of determining a linear trend: recall a line is defined by its slope and intercept.
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
2 / 19
Some notation When we conduct an experiment, we have an outcome variable Y and a regressor X which are related in some way. Linear regression has an intercept α and a slope β to describe the trend. The mean model for the outcome variable is: E [Y |X = x] = α + βx read: the expected value of Y given X is alpha plus beta times X. Alternately: Y = α + βX + Where is an error term.
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
3 / 19
Essential to the statistician’s toolkit:
Regression has two important applications: as a data summary measure and as an inferential tool. When summarizing data, the slope parameter is interpreted as an expected change in the outcome variable for a unit increase in the regressor. Said differently: β = E [Y |X = x + 1] − E [Y |X = x] In inference, regression can be used to test for the association of two random variables. If they are not associated, then we would expect that β = 0. Hence large values of β are evidence of a strong association.
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
4 / 19
Straight line assumption versus population averaged trend
Many people believe, incorrectly, that if the trend isn’t linear then a linear regression model is useless. The population averaged trend is still a meaningful quantity even in this case. The distribution of X , however, is important and may influence your trend. What types of biases are you familiar with that arise depending on sampling strategies?
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
5 / 19
Sampling strategies Higher X samples, larger slope
Lower X samples, smaller slope
● ●
● ● ●
●
● ● ● ●
●
●
●
●
●●
● ●
f (x)
●
● ●
● ●
● ●
● ● ● ● ● ● ● ●●
● ●
●
● ●
●
x
Frequency
x
Distribution of X
Adam Omidpanah (518)
Distribution of X
Week 1 discussion: Overview
January 3, 2011
6 / 19
The theory
Linear regression is one of the oldest forms of regression: basically, it’s finding the line of best fit. Gauss (1821) nailed down a mathematical approach called “least squares” where the line of best fit minimizes squared vertical distances: errors, or differences between predicted values and observed values. Gauss also showed least squares was, in some sense, the best way to do this. Biostatisticians: see Gauss-Markov theorem.
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
7 / 19
The least squares fit
2
3
●
● ●
●
1
●
0
●
●
−1
●
−2
Sum of squared errors: 12.05
●
●
−1
−0.5
Adam Omidpanah (518)
0
0.5
1
1.5
Week 1 discussion: Overview
2 January 3, 2011
8 / 19
Some other fit
2
3
●
● ●
●
1
●
0
●
●
−1
●
−2
Sum of squared errors: 17.64
●
●
−1
−0.5
Adam Omidpanah (518)
0
0.5
1
1.5
Week 1 discussion: Overview
2 January 3, 2011
9 / 19
More theory
The parameter β is never known for sure, so we denote estimates of β ˆ These estimates have associated degrees of uncertainty, with little hats: β. or standard errors. When: ˆ > Z1−α/2 q β var (β) ˆ we conclude that there is an association. Exercise: if large values of the slope βˆ suggest strong associations, what q ˆ give us? kind of evidence does the standard error var (β)
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
10 / 19
The technology This course primarily uses Stata (cheap) though many users may prefer R (free). Other software worth knowing, at least by name, is SAS (expensive) and SPSS (bad). R is not for the faint of heart! Some tips for burgeoning programmers: Save all your work in .do files Use many, many comments in your work; lines beginning with ∗ are commented out in Stata, similarly # in R. Become intimate with help, Lots and lots of Google http://www.ats.ucla.edu/stat/stata/webbooks/reg/ is a good resource for regression and Stata help!
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
11 / 19
Reading the data Stata users can read in .dta files with ease, given the versions are consistent! > * STATA: read in the data > use http://students.washington.edu/omida/orange.dta
R users can read .dta files with the help of the foreign package: > # R: importing stata datasets > library(foreign) > read.dta(’http://students.washington.edu/omida/orange.dta’) Note: for those who tunnel or batch run stata over the Linux cluster, the Linux version is Stata10 while HSL uses Stata11. You can read .dta files in R and resave them in a format consistent with Stata10. (A bit difficult for young hackers, but I learned this to avoid paying for Stata)
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
12 / 19
Looking at the data Information on the growth of orange trees (circumference, mm) against age (days). > > > >
* examine a ‘‘head’’ of the data list in 1/10 * create a scatter plot scatter circumference age
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
13 / 19
Doing the regression * The regression command regress circumference age
Exercise: Interpret the stuff following age. What does this say about the relationship to the age and growth in circumference of orange trees? What is cons?
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
14 / 19
Residual plots Linear models give rise to predicted (fitted) values: Yˆ and residuals r . Examining plots of fitted values versus residuals can be informative. They should appear completely independent: as a point cloud. * do resid vs fitted plot rvfplot, yline(0)
What do you observe?
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
15 / 19
Summary Linear regression is one of the most important scientific tools of the century When researchers say “I want to know if X and Y are correlated”, linear regression is a knee jerk reflex Be wary of linear regression skeptics! Linear regression can be used to analyze many types of data, including ordered categorical data (ordinal regression) like Likert responses, count data, and even truncated data: interpretations are precarious but inference is valid! Sailing conditions for regression include symmetric, evenly distributed error terms, but these are not necessary!
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
16 / 19
Stata Lab 1
Download or directly source the orange trees dataset from http://students.washington.edu/omida/orange.dta
2
Perform the regression above and obtain the same output. Use the following syntax and obtain different results: > regress circumference age, robust. Search the web for “Robust standard errors” or read about the robust option in > help regress. Recheck the residual plot, explain any differences in the models.
3
Try > predict dfb, dfbeta(age) and then produce a histogram of this variable using > hist dfb. Type > help dfbeta or search the web to learn what dfbetas are.
4
Try > table dfb age to get a cross tabulation of dfbetas against the main regressor, age. What does this tell you about the trend of data? Is the linear model a good fit? How reliably can we predict growth in older orange trees? Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
17 / 19
R Lab: page 1/2 1
Load the datasets package. Type >ls(’package:datasets’)
2
to see all the contents of this package. We will use Orange. Type >with(Orange, plot(age, circumference, col=Tree)) >by(Orange, Orange$Tree, function(Oran) with(Oran, lines(age, circumference, col=Tree)))
3
4
What do you observe about the trend? Use a similar by loop to calculate the regression coefficient and its standard error for each level of Tree. Install the sandwich package. Type >vignette(sandwich) to learn about this package. Fit and store the population linear model: >fit require(sandwich) to the top so that R automatically loads the sandwich package if you haven’t.
2
Save this code in a special place, so that you can do robust standard error regression from here on!
3
Call >dfbetas(fit) and also >dfbeta(fit). What is the difference between these two?
4
Compare the sum of squared dfbeta for the age coefficient to its robust and non-robust standard errors. Google “Jackknife statistics”. Did you just create a new standard error estimator?
Adam Omidpanah (518)
Week 1 discussion: Overview
January 3, 2011
19 / 19