Chapter 16: Sample Linear Regression and Correlation Model ...

Report 5 Downloads 69 Views
Chapter 16: Sample Linear Regression and Correlation -

Regression analysis: used to predict value of one variable on basis of other variables

-

Dependent variable: variable to be forecast

-

Independent variable: variable that practitioner believes are related to dependent variables (x1, x2, xk)

-

Correlation analysis: determines whether relationship exists

Model -

Deterministic models: equation that allow us to determine value of the dependent variable from values of independent variables

-

Probabilistic Model: method to represent randomness

-

e is the error variable: accounts for all variables, measurable and immeasurable, that are not part of the model

o

Its value varies from one ‘sale’ to the next even if ‘x’ remains constant

-

First-Order Linear Model (simple linear regression model): y (dependent variable) = B0 (y-int) – B1x(independent variable) + e (error variable)

-

X and y must be interval

(y-int)

Estimating the Coefficients -

Draw random samples from population of interest

-

Calculate sample statistics to estimate B0 and B1

-

Estimators based on drawing straight line though sample data; least squares line: comes closest to sample data points

o

Y-hat (predicted/fitted value of y) = b0 +b1x

o

B0 and b1 calculated so that sum of squared deviations is minimized

o

Y-hat on average comes closest to observed values of y

o

Least squares method: produces straight line that minimizes the sum of the squared difference between the points and the line

o

(b0) and (b1) are unbiased estimators of B0 and B1

o

Residuals: deviations between the actual data pints and the line, ei

o

Ei = yi – y-hat



Residuals are observations of the error variable

-



Minimized sum of squared deviation called sum of squares for error (SSE)



Residuals are differences between observed values of y1 and y hat

Note: we can’t determine value of y-hat for value of x that is far outside the range of sample values of x

Error Variable: Required Conditions -

Required conditions for the Error Variable

1. Probability distribution of e is normal 2. Mean of the distribution is 0: E€ = 0 3. Standard deviation of e is sigma e, which is constant regardless of value of x 4. Value of e associate with any particular value of y is independent of e associated with any other value of y

-

For 1, 2, 3: for each value of x, y is a normally distributed random variable with mean: E(y) = B0 + B1x, with standard deviation of sigma-e

o

Mean depends on x, std deviation constant for all values of x

o

For each x, y is normally distributed with same standard deviation

Observational and Experimental Data

-

Objective is to see how independent variable is related to dependent variable

-

When data is observational, both variables are random variable (don’t need to specify which is dependent and which is not)

-

Two variables must be bivariate normally distributed

Assessing the Model -

Least squares method produces best straight line

o -

Still may not be any relationship or nonlinear relationship between the two variable’s

Standard error of the estimate: t-test of the slope and coefficient of determination

Sum of Squares for Error

-

Least squares method determines coefficient that minimize sum of squared deviations between the points and the line defined by the coefficients

Standard Error of Estimate

-

If sigma e is large, some of the errors will be large

o -

Means that model’s fit is poor

If sigma e is small, errors tend to be close to mean = 0

o

Model Is a good fit

-

When SSE = 0, all points fall on the regression line

-

Compare Se to values of dependent variable “y” or sample mean “y-bar”

Testing the Slope

-

When line is horizontal: no matter what values of “x” is used, we estimate same value for “y-hat”

-

Y is not related to x; B1 = 0

-

Null hypothesis says that no linear relationship exists: H0 – B1=0

-

Normally perform two way test to see whether there is sufficient evidence to infer that linear relationship exists

-

H1: B1 can’t be 0

-

If null hypothesis true, does not necessarily mean that no relationship exists

o

Standard error of b1 decrease s when sample size increases or variance of independent variable increases

o

If alternative hypothesis is true, may be that a or a nonlinear relationship exists

One-Tail Tests

-

If trying to test for positive/negative linear relationship

-

H0: B1=0 and H1: B1 if wanting to test for positive

Coefficient of Determination

-

Useful to measure strength of linear relationship

-

Denoted R^2: the amount of variation in dependent variable explained by variation in independent variable

-

Some of the variation in “y” is explained by changes to “x”

-

Difference between yi and yi-hat, is residual; unexplained by variation in “x”

-

Variation in “y” = SSE + SSR o

SSE: measures variation in y that remains unexplained

o

SSR: measures variation in “y” that is explained by variation in independent variable “x”

-

“R-Square”, measures proportion of variation in “y” that can be explained by variation in “x”

-

Coefficient of determination does not have critical value

-

Higher the value, better the model fits the data

-

Provides us with strength of the relationship

Developing Understanding of Statistical Concepts

-

Partition the variation of dependent variable into two sources: variation explained by variation in independent variable and unexplained variation

Cause-and-Effect Relationship

-

Can’t infer causal relationship from statistics alone

-

Example: the more one smokes, the greater the chance of getting lung cancer

o

Does not prove that smoking causes cancer

o

Smoking and lunch cancer somehow related

o

Medical investigations established the connection

-

Coefficient of determination measures amount of variation in “y” that is explained by x (not caused)

-

Statistical analysis shows if relation exits, can’t say one variable causes another

Testing Coefficient of Correlation

-

Called, pearson coefficient of correlation

-

Used to measure strength of association between two variables

-

Can be used to test linear relationship between two variables

o -

When data is observational, and two variables are bivariate normally distributed, calculation coefficient of correlation to test for linear association

Denoted rho; p; population parameter Violation of Required Condition

-

No linear relationship: p=0

o

H0: P=0

o

H1: P can’t be 0

When to use certain formulas

-

-

Use t-test of B1 when:

o

Interest in discovering relationship between two variable

o

When we conduct experiment where we controlled values of the independent variable

Use t-test of p when:

o

If interested only in determining whether two random variables are linearly related

Using the Regression Equation -

Point prediction: y-hat is the point estimate/predicted value for y when x=#

Predicting Particular Value of “Y” given “X”

-

Used when we want to predict one-time occurrence for particular value of dependent variable when independent variable is a given value

-

Point estimator +/- bound on the error of estimation

Estimating Expected Value if “Y” for given “X”

-

Confidence interval estimate of expected value of “y” will be narrower than prediction interval fro same given value of x and confidence level

-

Less error in estimating a mean as opposed to predicting individual value

Effect of Given Value of “X” on Intervals

-

Farther the given value of x from x-bar, the greater the estimated error becomes

-

(xg – x-bar)^2 / (n-1)s^2x

Regression Diagnostics – 1 Residual Analysis

-

Standardize the residuals: subtract the mean and divide by standard deviation

-

Mean of residuals is 0

-

Simplest estimate of standard deviation is

o

se

Standardized residuals for point “I” = ei/se

-

Analysis of residuals helps determine whether error variable is non-normal

o

Whether error variance Is constant

o

Whether errors are independent

Hetroscedasticity

-

When error variable: sigma squared x is non constant

-

Plot residuals against predicted values of “y”; look for change in spread of plotted points

Non-independence of Error Variable

-

Cross-sectional data: observations made at approx. the same time

-

Time series: set of observations taken at successive points of time

o

When time series, data errors often correlated; called auto-correlated or serially correlated

o

Graph residuals against time period; if patterns emerge, independence requirement is violated

Outlier

-

Observations that are unusually small/large

o

Error in recording values

o

Point should not have been included in the sample

o

Observation was unusually large/small that belongs to sample and was recorded properly

Procedure for Regression Diagnostics

-

Least squares method: assesses model’s fit

-

Regression equation: predicts and estimates