Chapter 16: Sample Linear Regression and Correlation -
Regression analysis: used to predict value of one variable on basis of other variables
-
Dependent variable: variable to be forecast
-
Independent variable: variable that practitioner believes are related to dependent variables (x1, x2, xk)
-
Correlation analysis: determines whether relationship exists
Model -
Deterministic models: equation that allow us to determine value of the dependent variable from values of independent variables
-
Probabilistic Model: method to represent randomness
-
e is the error variable: accounts for all variables, measurable and immeasurable, that are not part of the model
o
Its value varies from one ‘sale’ to the next even if ‘x’ remains constant
-
First-Order Linear Model (simple linear regression model): y (dependent variable) = B0 (y-int) – B1x(independent variable) + e (error variable)
-
X and y must be interval
(y-int)
Estimating the Coefficients -
Draw random samples from population of interest
-
Calculate sample statistics to estimate B0 and B1
-
Estimators based on drawing straight line though sample data; least squares line: comes closest to sample data points
o
Y-hat (predicted/fitted value of y) = b0 +b1x
o
B0 and b1 calculated so that sum of squared deviations is minimized
o
Y-hat on average comes closest to observed values of y
o
Least squares method: produces straight line that minimizes the sum of the squared difference between the points and the line
o
(b0) and (b1) are unbiased estimators of B0 and B1
o
Residuals: deviations between the actual data pints and the line, ei
o
Ei = yi – y-hat
Residuals are observations of the error variable
-
Minimized sum of squared deviation called sum of squares for error (SSE)
Residuals are differences between observed values of y1 and y hat
Note: we can’t determine value of y-hat for value of x that is far outside the range of sample values of x
Error Variable: Required Conditions -
Required conditions for the Error Variable
1. Probability distribution of e is normal 2. Mean of the distribution is 0: E€ = 0 3. Standard deviation of e is sigma e, which is constant regardless of value of x 4. Value of e associate with any particular value of y is independent of e associated with any other value of y
-
For 1, 2, 3: for each value of x, y is a normally distributed random variable with mean: E(y) = B0 + B1x, with standard deviation of sigma-e
o
Mean depends on x, std deviation constant for all values of x
o
For each x, y is normally distributed with same standard deviation
Observational and Experimental Data
-
Objective is to see how independent variable is related to dependent variable
-
When data is observational, both variables are random variable (don’t need to specify which is dependent and which is not)
-
Two variables must be bivariate normally distributed
Assessing the Model -
Least squares method produces best straight line
o -
Still may not be any relationship or nonlinear relationship between the two variable’s
Standard error of the estimate: t-test of the slope and coefficient of determination
Sum of Squares for Error
-
Least squares method determines coefficient that minimize sum of squared deviations between the points and the line defined by the coefficients
Standard Error of Estimate
-
If sigma e is large, some of the errors will be large
o -
Means that model’s fit is poor
If sigma e is small, errors tend to be close to mean = 0
o
Model Is a good fit
-
When SSE = 0, all points fall on the regression line
-
Compare Se to values of dependent variable “y” or sample mean “y-bar”
Testing the Slope
-
When line is horizontal: no matter what values of “x” is used, we estimate same value for “y-hat”
-
Y is not related to x; B1 = 0
-
Null hypothesis says that no linear relationship exists: H0 – B1=0
-
Normally perform two way test to see whether there is sufficient evidence to infer that linear relationship exists
-
H1: B1 can’t be 0
-
If null hypothesis true, does not necessarily mean that no relationship exists
o
Standard error of b1 decrease s when sample size increases or variance of independent variable increases
o
If alternative hypothesis is true, may be that a or a nonlinear relationship exists
One-Tail Tests
-
If trying to test for positive/negative linear relationship
-
H0: B1=0 and H1: B1 if wanting to test for positive
Coefficient of Determination
-
Useful to measure strength of linear relationship
-
Denoted R^2: the amount of variation in dependent variable explained by variation in independent variable
-
Some of the variation in “y” is explained by changes to “x”
-
Difference between yi and yi-hat, is residual; unexplained by variation in “x”
-
Variation in “y” = SSE + SSR o
SSE: measures variation in y that remains unexplained
o
SSR: measures variation in “y” that is explained by variation in independent variable “x”
-
“R-Square”, measures proportion of variation in “y” that can be explained by variation in “x”
-
Coefficient of determination does not have critical value
-
Higher the value, better the model fits the data
-
Provides us with strength of the relationship
Developing Understanding of Statistical Concepts
-
Partition the variation of dependent variable into two sources: variation explained by variation in independent variable and unexplained variation
Cause-and-Effect Relationship
-
Can’t infer causal relationship from statistics alone
-
Example: the more one smokes, the greater the chance of getting lung cancer
o
Does not prove that smoking causes cancer
o
Smoking and lunch cancer somehow related
o
Medical investigations established the connection
-
Coefficient of determination measures amount of variation in “y” that is explained by x (not caused)
-
Statistical analysis shows if relation exits, can’t say one variable causes another
Testing Coefficient of Correlation
-
Called, pearson coefficient of correlation
-
Used to measure strength of association between two variables
-
Can be used to test linear relationship between two variables
o -
When data is observational, and two variables are bivariate normally distributed, calculation coefficient of correlation to test for linear association
Denoted rho; p; population parameter Violation of Required Condition
-
No linear relationship: p=0
o
H0: P=0
o
H1: P can’t be 0
When to use certain formulas
-
-
Use t-test of B1 when:
o
Interest in discovering relationship between two variable
o
When we conduct experiment where we controlled values of the independent variable
Use t-test of p when:
o
If interested only in determining whether two random variables are linearly related
Using the Regression Equation -
Point prediction: y-hat is the point estimate/predicted value for y when x=#
Predicting Particular Value of “Y” given “X”
-
Used when we want to predict one-time occurrence for particular value of dependent variable when independent variable is a given value
-
Point estimator +/- bound on the error of estimation
Estimating Expected Value if “Y” for given “X”
-
Confidence interval estimate of expected value of “y” will be narrower than prediction interval fro same given value of x and confidence level
-
Less error in estimating a mean as opposed to predicting individual value
Effect of Given Value of “X” on Intervals
-
Farther the given value of x from x-bar, the greater the estimated error becomes
-
(xg – x-bar)^2 / (n-1)s^2x
Regression Diagnostics – 1 Residual Analysis
-
Standardize the residuals: subtract the mean and divide by standard deviation
-
Mean of residuals is 0
-
Simplest estimate of standard deviation is
o
se
Standardized residuals for point “I” = ei/se
-
Analysis of residuals helps determine whether error variable is non-normal
o
Whether error variance Is constant
o
Whether errors are independent
Hetroscedasticity
-
When error variable: sigma squared x is non constant
-
Plot residuals against predicted values of “y”; look for change in spread of plotted points
Non-independence of Error Variable
-
Cross-sectional data: observations made at approx. the same time
-
Time series: set of observations taken at successive points of time
o
When time series, data errors often correlated; called auto-correlated or serially correlated
o
Graph residuals against time period; if patterns emerge, independence requirement is violated
Outlier
-
Observations that are unusually small/large
o
Error in recording values
o
Point should not have been included in the sample
o
Observation was unusually large/small that belongs to sample and was recorded properly
Procedure for Regression Diagnostics
-
Least squares method: assesses model’s fit
-
Regression equation: predicts and estimates