ADMS 3330 WINTER 2014 MIDTERM #1 Chapter 16: Simple Linear ...

Report 20 Downloads 113 Views
ADMS 3330 WINTER 2014 MIDTERM #1 Chapter 16: Simple Linear Regression and Correlation Regression Analysis: used to predict the value of one variable (dependent) on the basis of the other variable (independent) 16.1 Model Type 1: Deterministic Model: set of equations that allow is to fully determine the value of the dependent variable from the values of the independent variables. Type 2: Probabilistic Model: method used to capture the randomness that is part of real-life process. - To create this model, we start with deterministic since it approximates the relationship we want and then add the random term which measures the error of the deterministic. Example: The cost of building a new house is about $75 per square foot and most lots sell for about $25,000. Hence the approximate selling price (y) would be: y = $25,000 + (75$/ft2)(x) (where x is the size of the house in square feet) - House size is independent and house price is the dependent - Probabilistic model: y = 25,000 + 75x + ɛ ɛ as the error variable. It is the difference between the actual selling rpice and the estimated price based on the size of the house. It will vary even is x is the same. Simple Linear Regression Model (First Order Linear model): straight line model with one independent variable B0 and B1 are population parameters which are usually unknown and estimated from the data. 16.2 Estimating the Coefficients We estimate B0 on b0 and B1 on b1, the y-intercept and slope of the least square or regression line by:

b1 is the y-intercept and b1 is the slope Sum of Squares for Error (SSE) and Standard error of estimate (S ɛ) - Used in the calculation of the Standard error of estimate (Sɛ) - If Sɛ is zero, all pints fall on regression line - Compare Sɛ and y-bar to judge value Example: Sɛ= 4.5 and ybar= 8.3 Interpretation: it appears to be large, meaning linear regression model is bad

16.3 Error Variable: Required Conditions To have a valid regression: 4 conditions for the error variable ( ɛ) • The probability distribution of ɛ is normal. • The mean of the distribution is 0; that is, E(ɛ ) = 0. • The standard deviation of ɛ is STDVɛ , which is a constant regardless of the value of x. • The value of ɛ associated with any particular value of y is independent of ɛ associated with any other value of y. 16.4 Assessing the Model

The least squares method will always produce a straight line, even if there is no relationship between the variables, or if the relationship is something other than linear. Hence, in addition to determining the coefficients of the least squares line, we need to assess it to see how well it “fits” the data. We’ll see these evaluation methods now. They’re based on the sum of squares for errors (SSE). Testing the Slope: If no linear relationship exists between the two variables, we would expect the regression line to be horizontal, that is, to have a slope of zero. 1. We want to see if there is a linear relationship, i.e. we want to see if the slope (B1) is something other than zero. Our research hypothesis becomes: Null hypothesis becomes: H0: B1= 0 And H1: B1 ≠ 0 T-Test Statistics: where Sb1 is the STDV of b1: Degrees of freedom: n-2 We can also estimate (to some level of confidence) and interval for the slope parameter B1, 2. If we want to test for positive or negative linear relationships, we conduct one-tail tests, i.e. our research hypothesis become: Null hypothesis: H0: B1= 0 H1: B1 < 0 for negative slope or H1: B1 > 0 for positive slope Coefficient of Determination Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination – R2. - The coefficient of determination is the square of the coefficient of correlation (r), hence R2 = (r)2 Variation in y = SSE + SSR SSE – Sum of Squares for Error – measures the amount of variation in y that remains unexplained (i.e. due to error) SSR – Sum of Squares for Regression – measures the amount of variation in y explained by variation in the independent variable x. Interpretation: R2 has a value of .49. This means 49% of the variation in the y is explained by the variation in the x. The remaining 51% is unexplained, i.e. due to error. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. - The higher the value of R2, the better the model fits the data. 2 R = 1: Perfect match between the line and the data points. R2 = 0: There are no linear relationship between x and y. Coefficient of Correlation We can use the coefficient of correlation to test for a linear relationship between two variables. The coefficient of correlation’s range is between –1 and +1. • If r = –1 (negative association) or r = +1 (positive association) every point falls on the regression line. • If r = 0 there is no linear pattern The population coefficient of correlation is denoted P (rho) We estimate its value from sample data with the sample coefficient of correlation: The test statistic for testing if P = 0 is: Degrees of freedom: n-2

T-Test of coefficient correlation as an alternate means to determine whether two variables are linearly related. Our research hypothesis is: H1: p ≠ 0 (i.e. there is a linear relationship) and our null hypothesis is: H0:p = 0 (i.e. there is no linear relationship when rho = 0) 16.5 Using the Regression Equation Point prediction: ybar calculated when x is given. These can estimate through interval.

Prediction Interval: used when we want to predict one particular value of the dependent variable, given a specific value of the independent variable.

xg is given the value of x we’re interested in Confidence Interval Estimator of the expected value of y: we are estimating the mean of y given a value of x. - used for infinitely large populations. - The confidence interval estimate of the expected value of y will be narrower than the prediction interval for the same given value of x and confidence level because there is less error in estimating a mean value as opposed to predicting an individual value. 16.6 Regression Diagnostics 1 Residual Analysis: examine the differences betweent he actual data points and those preicted by the linear equation -

ri y − yˆ i = i sε sε ri y − yˆ i = i Standardized residuals for point i = using minitab sri sri Standardized residuals for point i =

Where standard deviation of the ith residual sri = sε 1 − hi

1 ( xi − x) 2 Where hi = + n ( n − 1) s x2 Nonnormality: put into histogram and look for be shaped with the mean slope to zero Heteroscedasticity: when the requirement of a constant variance is violated: tunnel shape Nonindependence of the Error Variable: when the dates are time series, the errors are often correlated. Error terms that are correlated over time are said to be auto correlated or seriously correlated. - We can detect auto correlation by graphing residuals against time periods. Is pattern emerges, it is likely that the independence requirement is violated Patterns in the appearance of the residuals over time indicate that autocorrelation exist: Outliers: an observation that is unusually small or unusually large. Possible reasons for the existence of outliers include: • There was an error in recording the value • The point should not have been included in the sample * Perhaps the observation is indeed valid. - can be identified from a scatter plot If the absolute value of the standardized residual is >2, we suspect the point may be an outlier and investigate further since they can easily influence the least square line.

Remember: Steps in Calculating Least Square 1. Calculate the variance: Sxy 2. Calculate the variance of x : Sx² 3. Calculate average x and average y 4. Calculate b1 and b0 5. Least square line (sample regression) Procedure for Regression Diagnostics

1. Develop a model that has a theoretical basis. 2. Gather data for the two variables in the model. 3. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Identify possible outliers. 4. Determine the regression equation. 5. Calculate the residuals and check the required conditions 6. Assess the model’s fit. 7. If the model fits the data, use the regression equation to predict a particular value of the dependent variable and/or estimate its mean. Chapter 17: Multiple Regressions Multiple Regression: allows for any number of independent variables 17.1 Model and Required Conditions - We now have k independent variables potentially related to the one dependent variable. First-Order linear equation Required Condition: For these regression methods to be valid the following four conditions for the error variable (ɛ) must be met: • The probability distribution of the error variable (ɛ) is normal. • The mean of the error variable is 0. • The standard deviation of ɛ is STDVɛ, which is a constant. • The errors are independent. 17.2 Estimating the Coefficients and Assessing the Model Sample regression equation is We will use computer output to: Assess the model… How well it fits the data: Is it useful? Are any required conditions violated?

expressed as:

Employ the model… Interpreting the coefficients Predictions using the prediction equation Estimating the expected value of the dependent variable

Regression Analysis Step 1. Use a computer and software to generate the coefficients and the statistics used to assess the model. 2. Diagnose violations of required conditions. If there are problems, attempt to remedy them. 3. Assess the model’s fit. Standard error of estimate, Coefficient of determination, F-test of the analysis of variance. (Page 705) 4. If step 1,2, and 3 are OK, use the model to predict or estimate the expected value of the dependent variable. Example: Asses Model: 1. Standard error of estimate, 2. Coefficient of determination, and 3. F-test of the analysis of variance. Standard Error of Estimate n = sample size k = # of independent variables Compare Sɛ to ybar (average of y) Coefficient of Determination

Interpretation: This means that 33.74% of the variation in income is explained by the eight independent variables, but 66.26% remains unexplained. Adjusted R² value: the coefficient of determination adjusted for degrees of freedom. It takes into account the sample size n, and k, the number of independent variables, and is given by: In this model the coefficient of determination adjusted for degrees of freedom is .3180. Testing the Validity of the Model In a multiple regression model (i.e. more than one independent variable), we utilize an analysis of variance technique to test the overall validity of the model. Here’s the idea: H0: H1: At least one B1 is not equal to zero. If the null hypothesis is true, none of the independent variables is linearly related to y, and so the model is invalid. If at least one B1 is not equal to 0, the model does have some validity. - ANOVA table for regression analysis

Rejection Region: - Reject H₀ in favour of H₁ Interpreting the Coeeficients Intercept: The intercept is b0 = −51,785. This is the average income when all the independent variables are zero. As we observed in Chapter 16, it is often misleading to try to interpret this value, particularly if 0 is outside the range of the values of the independent variables (as is the case here). Age: The relationship between income and age is described by b1 = 461. From this number we learn that in this model, for each additional year of age, income increases on average by $461, assuming that the other independent variables in this model are held constant. Education: The coefficient b2 = 4101 specifies that in this sample for each additional year of education the income increases on average by $4101, assuming the constancy of the other independent variables. Hours of work: The relationship between hours of work per week is expressed by b3 = 620. We interpret this number as the average increase in annual income for each additional hour of work per week keeping the other independent variables fixed in this sample. Spouse’s hours of work: The relationship between annual income and a spouse’s hours of work per week is described in this sample b4 = −862 which we interpret to mean that for each additional hour a spouse works per week income decreases on average by $862 when the other variables are constant. Occupation prestige score: In this sample the relationship between annual income and occupation prestige score is described by b5 = 641. For each additional unit increase in the occupation prestige score annual income increases on average by $641 holding all other variables constant. Number of children:The relationship between annual income and number of children is expressed by b 6 = −331, which tells us that in this sample for each additional child annual income decreases on average by $331 Number of family members earning income: In this dataset the relationship between annual income and the number of family members who earn money is expressed by b7 = 687, which tells us that for each additional family member earner annual income increases on average by $687 assuming that the other independent variables are constant. Number of years with current job: The coefficient of the last independent variable in this model is b 8 = 330. This number means that in this sample for each additional year of job tenure with the current company annual income increases on average by $330 keeping the other independent variables constant Once we’re satisfied that the model fits the data as well as possible, and that the required conditions are satisfied, we can interpret and test the individual coefficients and use the model to predict and estimate… Test of β1 (Coefficient of age)

Value of the test statistic: t = 1.95; p-value = .0527 Test of β2 (Coefficient of education) Value of the test statistic: t = 4.84; p-value = 0 Test of β3 (Coefficient of number of hours of work per week) Value of the test statistic: t = 3.59; p-value = .0004 Test of β4 (Coefficient of spouse’s number of hours of work per week) Value of the test statistic: t = −4.67; p-value = 0 Test of β5 (Coefficient of occupation prestige score) Value of the test statistic: t = 3.64; p-value = .0003 Test of β6 (Coefficient of number of children) Value of the test statistic: t = −.22; p-value = .8279 Test of β7 (Coefficient of number of earners in family) Value of the test statistic: t = .23; p-value = .8147 Test of β8 (Coefficient of years with current employer) Value of the test statistic: t = 1.37; p-value = .1649 Interpretation: Testing the Coefficients There is sufficient evidence at the 5% significance level to infer that each of the following variables is linearly related to income Education Number of hours of work per week Spouse’s number of hours of work per week Occupation prestige score There is weak evidence to infer that income and age are linearly related. In this model there is not enough evidence to conclude that each of the following variables is linearly related to income Number of children Number of earners in the family Number of years with current employer Note that this may mean that there is no evidence of a linear relationship between these three independent variables. However, it may also mean that there is a linear relationship between the two variables, but because of a condition called multicollinearity, the t-test of revealed no linear relationship.

-

Prediction Interval for a particular value of y. we can also produce the confidence interval estimate of the expected value of y. Example: we’ll predict the income of a 50-year old, with 12 years of education, who works 40 hours per week, whose spouse also works 40 hours per week, has an occupation prestige score of 50, has 2 children, 2 earners in the family, and has worked for the same company for 5 years. Prediction Interval: The prediction interval is −20,719, 111,056. It is so wide as to be completely useless. To be useful in predicting values the model must be considerably better. The confidence interval estimate of the expected income of a population is LCL = 37,661 and UCL = 52,675

17.3 Regression Diagnostics-2 Calculate the residuals and check the following: Is the error variable nonnormal? Draw the histogram of the residuals Is the error variance constant? Plot the residuals versus the predicted values of y.

Are the errors independent (time-series data)? Plot the residuals versus the time periods. Are there observations that are inaccurate or do not belong to the target population? Double-check the accuracy of outliers and influential observations. Multicollinearity: Multiple regression models have a problem that simple regressions do not, namely multicollinearity ( independent variables are highly correlated). The adverse effect of multicollinearity is that the estimated regression coefficients of the independent variables that are correlated tend to have large sampling errors. There are two consequences of multicollinearity. 1. Because the variability of the coefficients is large the sample coefficient may be far from the actual population parameter, including the possibility that the statistic and parameter may have opposite signs. 2. When the coefficients are tested the t-statistics will be small, which leads to the inference that there is no linear relationship between the affected independent variables and the dependent variable. In some cases, this inference will be wrong. Fortunately multicollinearity does not affect the F-test of the analysis of variance. - Another problem caused by multicollinearity is the interpretation of the coefficients. We interpret the coefficients as measuring the change in the dependent variable when the corresponding independent variable increases by one unit while all the other independent variables are held constant. This interpretation may be impossible when the independent variables are highly correlated, because when the independent variable increases by one unit, some or all of the other independent variables will change. - Multicollinearity exists in virtually all multiple regression models. The problem becomes serious, however, only when two or more independent variables are highly correlated. Unfortunately, we do not have a critical value that indicates when the correlation between two independent variables is large enough to cause problems. - Minimizing the effect of multicollinearity is often easier than correcting it. The statistics practitioner must try to include independent variables that are independent of each other. Another alternative is to use a stepwise regression package. 17.4 Regression Diagnostics-3 (Time Series) Durbin-Watson Test: allows statistics practioner to determine whether there is evidence of first-order autocorrelation, condition which a relationship exists between consecutive residuals ei and ei – 1, where I is the time period. Remember: 1. Sɛ - error- smaller the better compare to ybar 2. R² - Strength- closer to 100% the better – problem with k and n 3. Adjusted R² 4. F-Test – validity (macro) 5. T-test (micro)- independent variables -

-

Regression equation = with numbers Regression model = with notation The higher the F-stat the more probably the model is valid

T-Test Positive Relationship = H1: B1 > 0 Negative Relationship = H1: B1 < 0 Neither- measuring a linear relationship = H1 ≠ 0

One-tail test One- tail test Two-tail test

t > ta, n-k-1 t < - ta, n-k-1 t > ta, n-k-1 and t < - ta, n-k-1

Chapter 18: Model Building Regression Analysis for Non-linear (polynomial) models and models that include nominal independent variables. Polynomial Models: independent variables may be functions of a smaller number of predictor variables such as polynomial models. 1. One predictor value (x)

-

First-Order Model: When p =1, we have a simple linear regression model straight line between dependent and independent variables. Positive relationship.

-

Second-Order Model: When p = 2, the polynomial is a parabola.

. There is a

-

Third-Order Model: When p = 3,

-

2. Two predictor Variables Perhaps we suspect that there are two predictor variables (x1 & x2) which influence the dependent variable:

-

First Order Model (no interaction)

-

(with interaction)

If it is a second order with interaction

Interaction: x1 changes the relationship of x2 to y and vice versa Selecting a Model 1. One predictor variable, or two (or more)? 2. First order? Second order? Higher order? 3. With interaction? Without? Nominal Independent Variables Indicator Variable: variable that can assume either one of only two values (usually 0 and 1) - A value of one usually indicates the existence of a certain condition, while a value of zero usually indicates that the condition does not hold.

Example: Thus, the price diminishes with additional mileage (x) a white car sells for $91.10 more than other colors (I 1) a silver car fetches $330.40 more than other colors (I 2)

Testing the Coefficient

Model Building Steps: 1. Identify the dependent variable; what is it we wish to predict? Don’t forget the variable’s unit of measure. 2. List potential predictors; how would changes in predictors change the dependent variable? Be selective; go with the fewest independent variables required. Be aware of the effects of multicollinearity. 3. Gather the data; at least six observations for each independent variable used in the equation. 4. Identify several possible models; formulate first- and second- order models with and without interaction. Draw scatter diagrams. 5. Use statistical software to estimate the models. 6. Determine whether the required conditions are satisfied; if not, attempt to correct the problem. 7. Use your judgment and the statistical output to select the best model! Remember: 1. Compare Sɛ and ybar (average) 2. R² (supporting evidence) – explained and unexplained 3. F-tests validity- hypotheses- H0: B1 = B2 = 0 and H1: At least one B1 is not equal to 0