Multiple Regression

Report 3 Downloads 432 Views
Multiple Regression

Multiple Regression Predicting Success in the International Baccalaureate Program

Multiple Regression

In This Module In this module, you will learn:        

R2 F-test Adjusted R2 Partial R2 (squared semi-partial correlations) Regression coefficients t-tests Standardized regression coefficients Collinearity and multicollinearity

Multiple Regression

Introduction Previously we learned about simple regression. In this module we will learn about multiple regression, which is used when we have more than one predictor. Typically, we use multiple regression for two reasons:  To minimize errors of prediction (predictions are often more accurate using multiple sources of information)  To statistically control for extraneous variables (for example, examining the relationship between hours of study and test performance while controlling for previous achievement)

Multiple Regression

Case 1: Purpose of Study

School districts in the state of Florida use a variety of procedures for determining which students will be admitted to International Baccalaureate (IB) programs. One district uses minimum requirements of a grade point average (GPA) of 2.5 and a Comprehensive Test of Basic Skills (CTBS) stanine of 6. Sometimes there are more applicants who meet these minimum requirements than can be accommodated by the IB program.

Multiple Regression

Case 1: Purpose of Study

To select among applicants meeting the minimum requirements, Pre-IB teachers recommended relying more heavily on the CTBS than GPA. The district wanted to check to see how well IB GPA could be predicted using CTBS stanine scores and GPA, and whether the best predictions could be made when CTBS was weighted more heavily than GPA.

Multiple Regression

Case 1: Method

In the first year of the IB program, all applicants meeting the minimum requirements in the county were admitted since the number of openings exceeded the number of applicants. In all, 100 students participated. The 8th grade CBTS stanine scores and the GPAs of the admitted students were obtained during the admission process. The students’ IB GPAs were obtained after participation in the IB program. the data set is available for download from the Attachments tab

Multiple Regression

Case 1: Data Set CTBS

GPA

IB-GPA

CTBS

GPA

IB-GPA

CTBS

GPA

IB-GPA

CTBS

GPA

IB-GPA

6

2.90

2.33

6

2.50

2.78

7

3.23

3.35

8

3.38

4.00

6

3.16

2.86

6

3.13

2.89

7

3.07

3.42

8

2.90

1.91

6

2.65

2.40

6

3.16

2.77

7

3.60

3.38

8

3.08

3.35

6

3.39

3.49

6

4.00

4.00

7

3.61

3.61

8

4.00

3.94

6

3.14

2.79

6

2.80

3.20

7

3.88

3.82

8

3.69

3.34

6

3.28

2.94

6

3.65

2.38

7

3.28

3.71

8

2.97

2.80

6

2.65

2.56

6

2.70

2.88

7

2.82

2.74

8

2.86

3.47

6

3.18

2.95

6

4.00

3.25

7

3.65

3.32

8

3.60

2.92

6

2.53

2.14

6

2.93

2.32

7

3.45

2.45

8

3.19

2.83

6

2.90

2.36

6

2.77

3.68

7

3.40

3.59

8

3.31

3.63

6

4.00

4.00

6

3.41

3.55

7

3.67

3.86

8

3.22

3.12

6

3.35

3.02

6

3.28

2.72

7

3.42

3.74

8

4.00

4.00

6

2.67

2.53

6

3.04

2.77

7

3.55

3.43

8

3.81

4.00

6

2.65

2.79

6

3.09

3.81

7

2.82

3.00

9

3.61

3.90

6

2.98

3.41

7

3.47

2.77

7

2.87

2.88

9

3.46

3.14

6

3.58

4.00

7

3.50

4.00

7

3.57

3.26

9

4.00

4.00

6

3.50

2.97

7

3.85

3.54

8

3.43

3.94

9

3.33

2.91

6

3.08

2.98

7

3.73

3.33

8

3.37

2.63

9

4.00

4.00

6

3.76

4.00

7

2.75

2.50

8

3.61

3.44

9

3.81

3.93

6

2.55

2.13

7

3.57

3.77

8

3.77

3.74

9

3.58

3.07

6

2.87

2.35

7

2.84

2.61

8

3.15

3.70

9

4.00

3.60

6

3.06

3.31

7

3.51

3.51

8

3.50

3.80

9

3.59

3.28

6

2.78

1.92

7

2.85

2.29

8

3.34

2.83

9

3.95

3.31

6

2.87

2.69

7

2.91

2.55

8

3.45

3.19

9

3.96

4.00

6

2.52

2.82

7

3.62

4.00

8

3.97

2.92

9

2.91

3.01

Multiple Regression

Analysis for this Case For this case we are trying to predict a continuous criterion variable (IBGPA) from multiple continuous predictor variables (CTBS and GPA). Thus we need to expand the regression framework you learned in the last case to accommodate multiple predictors. Note that we will now have statistics to help us understand the contribution of each of the individual predictors (CTBS and GPA) to our prediction, as well as overall statistics to help us understand how well we are predicting as a whole.

Multiple Regression

Analysis Plan To thoroughly analyze these data, we will be learning to conduct a multiple regression using SAS and to interpret the following multiple regression concepts: 1. R2 2. F-test 3. Adjusted R2 4. Partial R2 (squared semi-partial correlations) 5. Regression coefficients 6. t-tests 7. Standardized regression coefficients 8. Collinearity and multicollinearity

Multiple Regression

R2 Recall that in simple linear regression the r2 value told us the proportion of the variance in the criterion that was associated with our predictor. In multiple regression, the same proportion is represented by R2. The capital “R” indicates more than one predictor is involved. This value tells us the proportion of the variance in the criterion that is associated with our predictors, which is shown on the next page.

Multiple Regression

R2 The yellow shaded area represents the proportion of the variance in the criterion that is associated with the two predictors. The circle represents the variance of the criterion variable

The circle represents the variance of Predictor 1

The circle represents the variance of Predictor 2

Multiple Regression

R2 As with simple linear regression, we compute the R2 value by finding the various sums-of-squares. We will show you how to get SAS to do the computing for us later, but for now assume that we get the following values for this case: SSreg = 14.71085 SSresid = 16.84763 SStotal = 31.55848 We see that about 47% of the variation in the students’ IBGPA is associated with our predictors (CTBS and GPA).

Multiple Regression

F-test As with simple linear regression, we may wish to test to see if the R2 value is statistically significant. In other words, 2 we may wish to conclude the corresponding population  parameter, (rho squared), differs from zero. Numerator df equals k

Again we can test the null 2  hypothesis H0: = 0, using an F-test:

k= the number of predictors

Denominator df equals N-k-1

Multiple Regression

F-test For this case we find:

This is greater than the critical value, and the corresponding p-value is less than .05. Therefore, we conclude that there is some association between IBGPA and the set of predictors.

Multiple Regression

Adjusted R2 After concluding that in the population some of the variance in the criterion is associated with the predictors, we may want to provide an estimate of the amount. Although the R2 value does this, it is positively biased. In other words, if we took many samples from the population and for each sample computed the R2, we would find the average of all our R2 values was greater than the population value. Thus, there is a need to make a downward adjustment of the R2 value. We need to make bigger adjustments when the sample size, N, is small and when the number of predictors, k, is large.

Multiple Regression

Adjusted R2 The subscript “E” in this equation represents “Ezekiel,” the developer of this formula (we will learn more about him in future modules). In SAS, this adjustment is represented as “Adj R-Sq.”

Multiple Regression

Adjusted R2 For our case, the adjusted R2 is just a little less than the unadjusted value.

If we consider the population, we estimate about 46% of the variance in the criterion (IBGPA) is associated with the predictors (CTBS & GPA).

Multiple Regression

Partial R2 To this point we have concentrated on how the two predictors work as a set. In most settings, we also want to understand the unique contributions of each predictor to the prediction of the criterion. Conceptually, it would be nice if we could divide the R2 up and talk about how much is due to which predictor.

Multiple Regression

Partial R2 In the diagram, we used yellow to show the piece of the R2 that is unique to one predictor, green to show the piece that is unique to the other predictor, and blue to show the piece that is common to the two predictors. The circle represents the variance of the criterion variable

The circle represents the variance of Predictor 1

The circle represents the variance of Predictor 2

Multiple Regression

Partial R2 The unique pieces are referred to as the partial R2 for each predictor, or as the squared semipartial correlation. We compute the R2 for all the predictors (red + green + blue; R2y.x1x2) and the R2 for the other predictor(s) (green + blue; Ry.x2), then subtract the R2 for the other predictor(s) from the R2 for all the predictors.

Y

X1

X2

Multiple Regression

Partial R2

Yellow represents the common area between the criterion variable (Y) and predictor 1 (X1) partialling out the area associated with predictor 2 (X2)

Yellow, blue, and green represent the common area between the criterion variable (Y) and two predictors (X1 and X2)

Y

X1

X2 Blue and green represent the common area between the criterion variable (Y) and predictor 2 (X2)

Multiple Regression

Partial R2 We see in this case that GPA is contributing substantially to the R2, but that CTBS is contributing very little. Yellow represents the common area between the criterion variable (Y) and predictor 1 (X1), partialling out the area associated with predictor 2 (x2).

IBGPA

GPA CTBS

Blue represents the common area between the criterion variable (Y) and predictor 2 (X2), partialling out the area associated with predictor 1 (X1).

Multiple Regression

Prediction Equation Although the partial R2 of each predictor tells us a good deal about the individual contributions, it doesn’t provide the detail needed to make predictions. In order to do so, we will need to develop a regression equation. Recall that with simple linear regression we had a prediction equation. More specifically, we fit a line to the scatter plot so that it minimized the sum-of-squared residuals. Also recall that the line was defined by computing an intercept (a) and a slope (b), and was expressed in the general prediction equation as:

Multiple Regression

Prediction Equation: Two Predictors

The general prediction equation for multiple regression will be similar in form, but will include a term for each predictor. In general:

For our case, which has two predictors, we customize the equation and write: The specific value for a, b1, and b2 are computed to minimize the sum-of-squared residuals, and how to get SAS to compute them is shown later.

Multiple Regression

Case 1: Prediction Equation

For this case:

We could use this equation to make predictions. Suppose we had an applicant with a GPA of 3.0 and a CTBS of 8. The predicted IBGPA for this applicant would be:

Multiple Regression

Interpreting the Regression Coefficients In general the regression coefficients, or partial slopes, in the prediction equation are interpreted as the expected change in the criterion that is associated with a one unit change in a predictor while holding all other predictors constant. To make this a bit more concrete, let us consider the regression coefficients in this case.

Multiple Regression

Case 1: Interpreting the Regression Coefficients

The regression coefficient for GPA was .88164. This implies that if we had two applicants that differed from each other by 1 unit in GPA and had the same CTBS, we would predict the one with the higher GPA to have an IBGPA that was approximately 0.882 higher. Or, holding CTBS constant, every 1 unit higher in GPA leads to a predicted IBGPA that is approximately 0.882 units higher. When we turn to CTBS we see that for every one unit higher a student is on the standardized test, their predicted IBGPA is 0.026 higher, holding GPA constant.

Multiple Regression

t-tests for Regression Coefficients We may also want to test to see if each regression coefficient differs statistically from zero ( H 0 : β i  0 ). As with simple linear regression, we can conduct a t-test by dividing the coefficient by its standard error. In this case we get the following results from our significance tests.

We see that the coefficient for GPA is statistically significant, while the one for CTBS is not.

Multiple Regression

Standardized Regression Coefficients In addition to noting that one coefficient is statistically significant and the other is not, one may be tempted to compare the size of the regression coefficients. Although it may be tempting to say GPA’s coefficient of 0.88164 is large compared to CTBS’s of 0.02551, such a comparison is a bit dicey. The problem in making such comparisons is that the variables could be on very different scales. A one unit change in GPA may not be comparable to a one unit change in the standardized test score. Consequently, some researchers choose to standardize their regression coefficients to make the comparison a bit less dicey.

Multiple Regression

Standardized Regression Coefficients We use a β to symbolize the standardized regression coefficient. We also use β to represent the population coefficient, but these two quantities are not the same. However, they can usually be differentiated by the context in which they are used. For this case we get:

Multiple Regression

Standardized Regression Coefficients We can interpret the standardized regression coefficients much like we interpreted the unstandardized regression coefficients with the exception that we speak in standard deviation units. For every one standard deviation higher a student’s GPA is, we predict her IBGPA will be approximately .66 standard deviation higher, holding CTBS constant. How would you interpret the standardized coefficient of .047? Show Answer

Multiple Regression

Standardized Regression Coefficients

For every one standard deviation higher a student’s CTBS is, we predict her IBGPA will be .047 standard deviation higher, holding GPA constant.

Multiple Regression

Summary Table: Contributions of Individual Predictors

We have looked at the individual predictors in multiple ways and could summarize the results in a table.

All of these ways of looking at the individual predictors seem to be suggesting that GPA is doing almost all the predictive work, and that CTBS is doing little to nothing. Note this is in contrast to the teacher recommendations that most of the weight be given to the CTBS score when making admissions decisions.

Multiple Regression

Data Screening Recall that in the module on simple linear regression, we indicated that researchers needed to do some additional things so they understand the degree to which the data are consistent with underlying assumptions and so they know whether there are any outlying observations that are having an undo influence on the results. Multiple regression has the same assumptions so the same things should be done. In particular you will want to graph residuals with predicted values to look for homoscedasticity and linearity, you will want to output residuals and look for normality, and compute studentized residuals and Cook’s d values to screen for outliers.

Multiple Regression

Data Screening In addition to these things, with multiple regression you should also consider the correlations among your predictors, or more formally to consider collinearity and multicollinearity.

Multiple Regression

Collinearity Collinearity refers to a context where one of the predictors is completely redundant with the other predictors. Put another way, if there was collinearity, you would find that if you ran a regression predicting the redundant predictor from the others you would get an R2 value of 1.0. In such a case the redundant variable should not be included in the multiple regression. Because it is completely redundant, it can not possibly help you predict better.

Multiple Regression

Collinearity: Tolerance

Furthermore, its inclusion leads to mathematical difficulties. In the denominator of the formula for the standard error of the regression coefficient (shown below) is the term 1 – R2, where the R2 is from a regression predicting the predictor from all the other predictors (this term has the special name tolerance).

When there is a collinearity the R2 becomes 1. Under these circumstances the term 1 – R2 (or tolerance) becomes 0, and not even SAS is comfortable with dividing something by 0.

Multiple Regression

Multicollinearity Multicollinearity indicates that a predictor is somewhat redundant, but not completely redundant with other predictors. Generally researchers are advised to not include predictors that are too redundant with those they are already included. To screen for multicollinearity one could:  Examine the correlations among predictors.  Compute a Variance Inflation Factor (VIF) for each predictor: The R2 is obtained by predicting the predictor from all other predictors. Some suggest a VIF > 10 is problematic.

Multiple Regression

Predictor Correlations The correlation matrix for our case is shown below.

We see that the correlation between GPA and CTBS is approximately .47.

Multiple Regression

Variance Inflation Factor (VIF) In our case, the VIF for each variable would be the same, because we only have two predictors. Note that we don’t have collinearity and the VIF for each variable is low enough that we are also not worried about multicollinearity.

Multiple Regression

Collinearity Example

To provide an example where collinearity would be an issue consider a researcher who wanted to predict GPA using SATQuantitative scores, SAT-Verbal scores, and SAT-total scores. Conceptually there is a problem: the SAT-total score is simply the sum of the SAT-quantitative and SAT-verbal score and thus it can not tell us anything that we haven’t learned from the other two scores. If you predict SAT-total from SAT-quantitative and SAT-verbal, you would obtain an R2 = 1.0, indicating collinearity.

If you ignore the collinearity and try to run the multiple regression predicting GPA from SATQuantitative scores, SATVerbal scores, and SATtotal scores, you would get friendly error messages from SAS.

If you use just the SATQuantitative scores and SAT-Verbal scores as predictors, the regression should run fine.

Multiple Regression

SAS Code for Multiple Regression data j1; input ctbs gpa ibgpa; cards; 6 2.9 2.33 6 3.16 2.86 6 2.65 2.4 6 3.39 3.49 ; proc corr; var gpa ibgpa ctbs;

stb requests standardized regression coefficients

proc reg; model ibgpa=gpa ctbs / stb scorr2 p r vif; plot residual. * predicted.; output out=res residual=resid predicted=pred;

We are plotting the residuals by predicted values to look at homoscedasticity and linearity.

proc univariate plot; var resid; run;

scorr2 requests uniqueness's (squared semipartial correlations) p requests predicted values r requests residuals, studentized residuals, and Cook’s D vif requests variance inflation factors

To look at normality we are writing out the residuals to a data set called resid. We use proc univariate to examine their distribution.

the SAS code is available for download from the Attachments tab

Multiple Regression

SAS Output: Proc Corr

Multiple Regression

SAS Output: Proc Reg

Remember, SAS uses the terms “model” and “error” rather than “regression” and “residual.”

This is the p-value for the F-test for the H0 :  2  0

Notice, the R2, and adjusted R2 are shown here too.

Multiple Regression

SAS Output: Proc Reg

Remember from simple regression, these parameter estimates are used to write our regression equation.

Note, in multiple regression, we asked SAS to compute the standardized regression weights, partial R2 values, and VIFs.

Multiple Regression

SAS Output: Checking Outliers Using Cook’s D

Multiple Regression

SAS Output: Proc Univariate of Residuals

Multiple Regression

SAS Output: Proc Univariate of Residuals

Multiple Regression

SAS Output: Checking Homoscedasticity and Linearity

Multiple Regression

Online Testing Your next step is to apply your knowledge of the topics in this module to the online testing.