Chapter 3 Multiple Regression

Report 7 Downloads 468 Views
Chapter 3 Multiple Regression 3.1

Multiple Linear Regression Model

A fitted linear regression model always leaves some residual variation. There might be another systematic cause for the variability in the observations yi . If we have data on other explanatory variables we can ask whether they can be used to explain some of the residual variation in Y . If this is a case, we should take it into account in the model, so that the errors are purely random. We could write Yi = β0 + β1 xi + β2 zi + ε⋆i . | {z } previously εi

Z is another explanatory variable. Usually, we denote all explanatory variables (there may be more than two of them) using letter X with an index to distinguish between them, i.e., X1 , X2 , . . . , Xp−1 . Example 3.1. (Neter at al, 1996) Dwine Studios, Inc. The company operates portrait studios in 21 cities of medium size. These studios specialize in portraits of children. The company is considering an expansion into other cities of medium size and wishes to investigate whether sales (Y ) in a community can be predicted from the number of persons aged 16 or younger in the community (X1 ) and the per capita disposable personal income in the community (X2 ). If we use just X2 (per capita disposable personal income in the community) to model Y (sales in the community) we obtain the following model fit. 57

58

CHAPTER 3. MULTIPLE REGRESSION

The regression equation is Y = - 352.5 + 31.17 X2 S = 20.3863 Analysis of Source Regression Error Total

R-Sq = 69.9% Variance DF SS 1 18299.8 19 7896.4 20 26196.2

(a)

R-Sq(adj) = 68.3%

MS 18299.8 415.6

F 44.03

P 0.000

(b)

Figure 3.1: (a) Fitted line plot for Dwine Studios versus per capita disposable personal income in the community. (b) Residual plots.

The regression is highly significant, but R2 is rather small. It suggests that there could be some other factors, which are also important for the sales. We have data on the number of persons aged 16 or younger in the community, so we will examine whether the residuals of the above fit are related to this variable. If yes, then including it in the model may improve the fit.

Figure 3.2: The dependence of the residuals on X1 .

3.1. MULTIPLE LINEAR REGRESSION MODEL

59

Indeed, the residuals show a possible relationship with the number of persons aged 16 or younger in the community. We will fit the model with both variables, X1 and X2 included, that is Yi = β0 + β1 x1i + β1 x2i + εi ,

i = 1, . . . , n.

The model fit is following

The regression equation is Y = - 68.9 + 1.45 X1 + 9.37 X2 Predictor Constant X1 X2

Coef -68.86 1.4546 9.366

S = 11.0074

SE Coef 60.02 0.2118 4.064

R-Sq = 91.7%

Analysis of Variance Source DF SS Regression 2 24015 Residual Error 18 2181 Total 20 26196

T -1.15 6.87 2.30

P 0.266 0.000 0.033

R-Sq(adj) = 90.7%

MS 12008 121

F 99.10

P 0.000

Here we see that the intercept parameter is not significantly different from zero (p = 0.226) and so the model without the intercept was fitted. R2 is now close to 100% and both parameters are highly significant.

Regression Equation Y = 1.62 X1 + 4.75 X2 Coefficients Term Coef X1 1.62175 X2 4.75042 S = 11.0986 Analysis of Source Regression Error Total

SE Coef 0.154948 0.583246

T 10.4664 8.1448

R-Sq = 99.68% Variance DF Seq SS 2 718732 19 2340 21 721072

Adj SS 718732 2340

P 0.000 0.000 R-Sq(adj) = 99.64%

Adj MS 359366 123

F 2917.42

P 0.000



60

CHAPTER 3. MULTIPLE REGRESSION

Figure 3.3: Fitted surface plot and the Dwine Studios observations.

A Multiple Linear Regression (MLR) model for a response variable Y and explanatory variables X1 , X2 , . . . , Xp−1 is E(Y |X1 = x1i , . . . , Xp−1 = xp−1i ) = β0 + β1 x1i + . . . + βp−1 xp−1i var(Y |X1 = x1i , . . . , Xp−1 = xp−1i ) = σ 2 , i = 1, . . . , n cov(Y |X1 = x1i , .., Xp−1 = xp−1i , Y |X1 = x1j , .., Xp−1 = xp−1j ) = 0, i 6= j As in the SLR model we denote Yi = (Y |X1 = x1i , . . . , Xp−1 = xp−1i ) and we usually omit the condition on Xs and write µi = E(Yi ) = β0 + β1 x1i + . . . + βp−1 xp−1i var(Yi ) = σ 2 , i = 1, . . . , n cov(Yi , Yj ) = 0, i 6= j or

Yi = β0 + β1 x1i + . . . + βp−1 xp−1i + εi E(εi ) = 0 var(εi ) = σ 2 , i = 1, . . . , n cov(εi , εj ) = 0, i 6= j

For testing we need the assumption of Normality, i.e., we assume that Yi ∼ N (µi , σ 2 ) ind

3.2. LEAST SQUARES ESTIMATION

61

or εi ∼ N (0, σ 2 ) ind

To simplify the notation we write the MLR model in a matrix form Y = Xβ + ε,

(3.1)

that is,     

Y1 Y2 .. .



    =  

Yn | {z } := Y



|

1 x1,1 1 x1,2 .. .. . . 1 x1,n

··· ··· ··· ··· {z

xp−1,1 xp−1,2 .. . xp−1,n

:= X

     }|

β0 β1 .. .



    +  

βp−1 {z } := β



ε1 ε2 .. .

    

εn | {z } := ε

Here Y is the vector of responses, X is often called the design matrix, β is the vector of unknown, constant parameters and ε is the vector of random errors. εi are independent and identically distributed, that is ε ∼ N n (0n , σ 2 I n ). Note that the properties of the errors give Y ∼ N n (Xβ, σ 2 I n ).

3.2

Least squares estimation

To derive the least squares estimator (LSE) for the parameter vector β we minimise the sum of squares of the errors, that is S(β) =

n X

[Yi − {β0 + β1 x1,i + · · · + βp−1 xp−1,i }]2

i=1

= = = = = =

X

ε2i

εT ε (Y − Xβ)T (Y − Xβ) (Y T − β T X T )(Y − Xβ) Y T Y − Y T Xβ − β T X T Y + β T X T Xβ Y T Y − 2β T X T Y + β T X T Xβ.

62

CHAPTER 3. MULTIPLE REGRESSION

Theorem 3.1. The LSE βb of β is given by

βb = (X T X)−1 X T Y

if X T X is non-singular. If X T X is singular there is no unique LSE of β. Proof. Let β 0 be any solution of X T Xβ = X T Y . Then, X T Xβ 0 = X T Y and S(β) − S(β 0 ) = Y T Y − 2β T X T Y + β T X T Xβ − Y T Y + 2β T0 X T Y − β T0 X T Xβ 0 = −2β T X T Xβ 0 + β T X T Xβ + 2β T0 X T Xβ 0 − β T0 X T Xβ 0 = β T X T Xβ − 2β T X T Xβ 0 + β T0 X T Xβ 0 = β T X T Xβ − β T X T Xβ 0 − β T X T Xβ 0 + β T0 X T Xβ 0 = β T X T Xβ − β T X T Xβ 0 − β T0 X T Xβ + β T0 X T Xβ 0 = β T (X T Xβ − X T Xβ 0 ) − β T0 (X T Xβ − X T Xβ 0 ) = (β T − β T0 )(X T Xβ − X T Xβ 0 ) = (β T − β T0 )X T X(β − β 0 ) = {X(β − β 0 )}T {X(β − β 0 )} ≥ 0 since it is a sum of squares of elements of the vector X(β − β 0 ). We have shown that S(β) − S(β 0 ) ≥ 0. Hence, β 0 minimises S(β), i.e. any solution of X T Xβ = X T Y minimises S(β). If X T X is nonsingular the unique solution is βb = (X T X)−1 X T Y .

If X T X is singular there is no unique solution.



3.2. LEAST SQUARES ESTIMATION

63

Note that, as we did for the SLM in Chapter 2, it is possible to obtain this result by differentiating S(β) with respect to β and setting it equal to 0.

3.2.1

Properties of the least squares estimator

Theorem 3.2. If Y = Xβ + ε,

ε ∼ Nn (0, σ 2 I),

then βb ∼ N p (β, σ 2 (X T X)−1 ). Proof. Each element of βb is a linear function of Y1 , . . . , Yn . We assume that Yi , i = 1, . . . , n are normally distributed. Hence βb is also normally distributed.

The expectation and variance-covariance matrix can be shown in the same way as in Theorem 2.7.  Remark 3.1. The vector of fitted values is given by b = Yb = X βb µ = X(X T X)−1 X T Y = HY . The matrix H = X(X T X)−1 X T is called the hat matrix. Note that HT = H and also HH = X(X T X)−1 X T X(X T X)−1 X T | {z } =I T T −1 = X(X X) X = H. A matrix, which satisfies the condition AA = A is called an idempotent matrix. Note that if A is idempotent, then (I − A) is also idempotent.

64

CHAPTER 3. MULTIPLE REGRESSION

We now prove some results about the residual vector e = Y − Yb = Y − HY = (I − H)Y . As in Theorem 2.8, here we have Lemma 3.1. E(e) = 0. Proof. E(e) = = = =

(I − H) E(Y ) (I − X(X T X)−1 X T )Xβ Xβ − Xβ 0 

Lemma 3.2. Var(e) = σ 2 (I − H). Proof. Var(e) = (I − H) var(Y )(I − H)T = (I − H)σ 2 I(I − H) = σ 2 (I − H)  Lemma 3.3. The sum of squares of the residuals is Y T (I − H)Y . Proof. n X

e2i = eT e = Y T (I − H)T (I − H)Y

i=1

= Y T (I − H)Y  Lemma 3.4. The elements of the residual vector e sum to zero, i.e n X i=1

ei = 0.

3.3. ANALYSIS OF VARIANCE

65

Proof. We will prove this by contradiction. P Assume that ei = nc where c 6= 0. Then X X e2i = {(ei − c) + c}2 X X = (ei − c)2 + 2c (ei − c) + nc2 X X = (ei − c)2 + 2c( ei −nc) + nc2 | {z } =nc X 2 2 = (ei − c) + nc X > (ei − c)2 . P But we know that e2i is the minimum value of S(β) so there cannot exist values with a smaller sum of squares and this gives the required contradiction. So c =  0. Corollary 3.1.

n

1Xb Yi = Y¯ . n i=1 P P P ei = (Yi − Ybi ) but ei = 0. Hence Proof. The residuals ei = Yi − Ybi , so Pb P Yi and so the result follows. Yi = 

3.3

Analysis of Variance

We begin this section by proving the basic Analysis of Variance identity. Theorem 3.3. The total sum of squares splits into the regression sum of squares and the residual sum of squares, that is SST = SSR + SSE . Proof. SST = =

X X

(Yi − Y¯ )2 Yi2 − nY¯ 2

= Y T Y − nY¯ 2 .

66

CHAPTER 3. MULTIPLE REGRESSION

SSR = = =

X X X

(Ybi − Y¯ )2 Ybi2 − 2Y¯ Ybi2

X

Ybi +nY¯ 2 | {z } =nY¯

¯2

− nY

T = Yb Yb − nY¯ 2 T = βb X T X βb − nY¯ 2

= Y T X(X T X)−1 X T X(X T X)−1 X T Y − nY¯ 2 | {z } =I = Y T HY − nY¯ 2 . We have seen (Lemma 3.3) that SSE = Y T (I − H)Y and so SSR + SSE = Y T HY − nY¯ 2 + Y T (I − H)Y = Y T Y − nY¯ 2 = SST  F-test for the Overall Significance of Regression Suppose we wish to test the hypothesis H0 : β1 = β2 = . . . = βp−1 = 0, i.e. all coefficients except β0 are zero, versus H1 : ¬H0 , which means that at least one of the coefficients is non-zero. Under H0 , the model reduces to the null model Y = 1β0 + ε,

3.3. ANALYSIS OF VARIANCE

67

where 1 is a vector of ones. In testing H0 we are asking if there is sufficient evidence to reject the null model. The Analysis of variance table is given by Source Overall regression

d.f. p−1

SS Y HY − nY¯ 2

MS

VR

SSR p−1

M SR M SE

Residual

n−p

Y T (I − H)Y

SSE n−p

Total

n−1

Y T Y − nY¯ 2

T

As in SLM we have n − 1 total degrees of freedom. Fitting a linear model with p parameters (β0 , β1 , . . . , βp−1 ) leaves n − p residual d.f. Then the regression d.f. are n − 1 − (n − p) = p − 1. It can be shown that E(SSE ) = (n − p)σ 2 , that is M SE is an unbiased estimator of σ 2 . Also, SSE ∼ χ2n−p σ2 and if β1 = . . . βp−1 = 0, then SSR ∼ χ2p−1 . σ2 The two statistics are independent, hence M SR ∼ Fp−1,n−p . M SE H0 This is a test function for the null hypothesis H0 : β1 = β2 = . . . = βp−1 = 0, versus H1 : ¬H0 . We reject H0 at the 100α% level of significance if Fobs > Fα;p−1,n−p , where Fα;p−1,n−p is such that P (F < Fα;p−1,n−p ) = 1 − α.

68

3.4

CHAPTER 3. MULTIPLE REGRESSION

Inferences about the parameters

In Theorem 3.2 we have seen that

Therefore

βb ∼ N p (β, σ 2 (X T X)−1 ). βbj ∼ N (βj , σ 2 cjj ),

j = 0, 1, 2, . . . , p − 1,

where cjj is the jth diagonal element of (X T X)−1 (counting from 0 to p − 1). Hence, it is straightforward to make inferences about βj , in the usual way. A 100(1 − α)% confidence interval for βj is p βbj ± t α2 ,n−p S 2 cjj , where S 2 = M SE .

The test statistic for H0 : βj = 0 versus H1 : βj 6= 0 is βbj T =p ∼ tn−p if H0 is true. S 2 cjj Care is needed in interpreting the confidence intervals and tests. They refer only to the model we are currently fitting. Thus not rejecting H0 : βj = 0 does not mean that Xj has no explanatory power; it means that, conditional on X1 , . . . , Xj−1 , Xj+1 , . . . , Xp−1 being in the model Xj has no additional power. It is often best to think of the test as comparing models without and with Xj , i.e. H0 : E(Yi ) = β0 + β1 x1,i + · · · + βj−1 xj−1,i + βj+1 xj+1,i + · · · + βp−1 xp−1,i versus H1 : E(Yi ) = β0 + β1 x1,i + · · · + βp−1 xp−1,i . It does not tell us anything about the comparison between models E(Yi ) = β0 and E(Yi ) = β0 + βj xj,i .

3.5

Confidence interval for µ

We have

\) = µ b b = X β. E(Y

3.5. CONFIDENCE INTERVAL FOR µ

69

As with simple linear regression, we might want to estimate the expected response at a specific x, say x0 = (1, x1,0 , . . . , xp−1,0 )T , i.e. µ0 = E(Y |X1 = x1,0 , . . . , Xp−1 = xp−1,0 ). The point estimate will be b µ b0 = xT0 β.

Assuming normality, as usual, we can obtain a confidence interval for µ0 . Theorem 3.4.

Proof.

µ b0 ∼ N (µ0 , σ 2 xT0 (X T X)−1 x0 ).

(i) µ b0 = xT0 βb is a linear combination of βb0 , βb1 , . . . , βbp−1 , each of which is normal. Hence µ b0 is also normal. (ii)

b E(b µ0 ) = E(xT0 β) b = xT E(β) 0 xT0 β

= = µ0

(iii) b Var(b µ0 ) = Var(xT0 β) b 0 = xT Var(β)x 0

= σ 2 xT0 (X T X)−1 x0 . 

The following corollary is a consequence of Theorem 3.4. Corollary 3.2. A 100(1 − α)% confidence interval for µ0 is q µ b0 ± t α2 ,n−p S 2 xT0 (X T X)−1 x0 .

70

3.6

CHAPTER 3. MULTIPLE REGRESSION

Predicting a new observation

To predict a new observation we need to take into account not only its expectation, but also a possible new random error. The point estimator of a new observation $  Y0 = Y |X1 = x1,0 , . . . , Xp−1 = xp−1,0 = µ0 + ε0 is

Yb0 = xT0 βb (= µ b0 ),

which, assuming normality, is such that

Then and That is and hence

Yb0 ∼ N (µ0 , σ 2 xT0 (X T X)−1 x0 ). Yb0 − µ0 ∼ N (0, σ 2 xT0 (X T X)−1 x0 )

Yb0 − (µ0 + ε0 ) ∼ N (0, σ 2 xT0 (X T X)−1 x0 + σ 2 ). Yb0 − Y0 ∼ N (0, σ 2 {1 + xT0 (X T X)−1 x0 }) q

σ 2 {1

+

Yb0 − Y0

∼ N (0, 1).

xT0 (X T X)−1 x0 }

As usual we estimate σ 2 by S 2 and get Yb0 − Y0 q ∼ tn−p . T T 2 −1 S {1 + x0 (X X) x0 } Hence a 100(1 − α)% prediction interval for Y0 is given by q b Y0 ± t α2 ,n−p S 2 {1 + xT0 (X T X)−1 x0 }.

3.7. MODEL BUILDING

3.7

71

Model Building

We have already mentioned the principle of parsimony; we should use the simplest model that achieves our purpose. It is easy to get a simple model (Yi = β0 + εi ) and it is easy to represent the response by the data themselves. However, the first is generally too simple and the second is not a useful model. Achieving a simple model that describes the data well is something of an art. Often, there is more than one model which does a reasonable job. Example 3.2. Sales A company is interested in the dependence of sales on promotional expenditure (X1 in £1000), the number of active accounts (X2 ), the district potential (X3 coded), and the number of competing brands (X4 ). We will try to find a good multiple regression model for the response variable Y (sales). Data on last years sales (Y in £100,000) in 15 sales districts are given in the file Sales.txt on the course website.

Figure 3.4: The Matrix Plot indicates that Y is clearly related to X4 and also to X2 . The relation with other explanatory variables is not that obvious.

72

CHAPTER 3. MULTIPLE REGRESSION

Let us start with fitting a simple regression model of Y as a function of X4 only. The regression equation is Y = 396 - 25.1 X4 Predictor Constant X4

Coef 396.07 -25.051

S = 49.9868

SE Coef 49.25 5.242

R-Sq = 63.7%

Analysis of Variance Source DF SS Regression 1 57064 Residual Error 13 32483 Total 14 89547

T 8.04 -4.78

P 0.000 0.000

R-Sq(adj) = 60.9%

MS 57064 2499

F 22.84

P 0.000

We can see that the residuals versus fitted values indicate that there may be non-constant variance and also the linearity of the model is questioned. We will add X2 to the model. The regression equation is Y = 190 - 22.3 X4 + 3.57 X2 Predictor Constant X4 X2

Coef 189.83 -22.2744 3.5692

S = 6.67497

SE Coef 10.13 0.7076 0.1333

R-Sq = 99.4%

Analysis of Variance

T 18.74 -31.48 26.78

P 0.000 0.000 0.000

R-Sq(adj) = 99.3%

3.7. MODEL BUILDING Source Regression Residual Error Total Source X4 X2

DF 1 1

DF 2 12 14

SS 89012 535 89547

73 MS 44506 45

F 998.90

P 0.000

Seq SS 57064 31948

Still, there is some evidence that the standardized residuals may not have constant variance. Will this be changed if we add X3 to the model? The regression equation is Y = 190 - 22.3 X4 + 3.56 X2 + 0.049 X3 Predictor Constant X4 X2 X3

Coef 189.60 -22.2679 3.5633 0.0491

S = 6.96763

SE Coef 10.76 0.7408 0.1482 0.4290

R-Sq = 99.4%

Analysis of Variance Source DF SS Regression 3 89013 Residual Error 11 534 Total 14 89547 Source X4 X2 X3

DF 1 1 1

Seq SS 57064 31948 1

T 17.62 -30.06 24.05 0.11

P 0.000 0.000 0.000 0.911

R-Sq(adj) = 99.2%

MS 29671 49

F 611.17

P 0.000

74

CHAPTER 3. MULTIPLE REGRESSION

Not much better than before. Now, we add X1 , the least related explanatory variable to Y .

The regression equation is Y = 177 - 22.2 X4 + 3.54 X2 + 0.204 X3 + 2.17 X1 Predictor Constant X4 X2 X3 X1

Coef 177.229 -22.1583 3.5380 0.2035 2.1702

S = 5.11930

SE Coef 8.787 0.5454 0.1092 0.3189 0.6737

R-Sq = 99.7%

Analysis of Variance Source DF SS Regression 4 89285 Residual Error 10 262 Total 14 89547 Source X4 X2 X3 X1

DF 1 1 1 1

Seq SS 57064 31948 1 272

T 20.17 -40.63 32.41 0.64 3.22

P 0.000 0.000 0.000 0.538 0.009

R-Sq(adj) = 99.6%

MS 22321 26

F 851.72

P 0.000

3.7. MODEL BUILDING

75

The residuals now do not contradict the model assumptions. We analyze the numerical output. Here we see that X3 may be a redundant variable as we have no evidence to reject the hypothesis that β3 = 0 given that all the other variables are in the model. Hence, we will fit a new model without X3 .

The regression equation is Y = 179 - 22.2 X4 + 3.56 X2 + 2.11 X1 Predictor Constant X4 X2 X1

Coef 178.521 -22.1880 3.56240 2.1055

S = 4.97952

SE Coef 8.318 0.5286 0.09945 0.6479

R-Sq = 99.7%

Analysis of Variance Source DF SS Regression 3 89274 Residual Error 11 273 Total 14 89547 Source X4 X2 X1

DF 1 1 1

Seq SS 57064 31948 262

T 21.46 -41.98 35.82 3.25

P 0.000 0.000 0.000 0.008

R-Sq(adj) = 99.6%

MS 29758 25

F 1200.14

P 0.000

76

CHAPTER 3. MULTIPLE REGRESSION

These residual plots also do not contradict the model assumptions. On its own variable X1 explains only 1% of the variation but once X2 and X4 are included in the model then X1 is significant and also seems to cure problems with normality and non-constant variance.

3.7.1

F-test for the deletion of a subset of variables

Suppose the overall regression model as tested by the Analysis of Variance table is significant. We know that not all of the β parameters are zero, but we may still be able to delete several variables. We can carry out the Subset Test based on the extra sum of squares principle. We are asking if we can reduce the set of regressors X1 , X2 , . . . , Xp−1 to, say, X1 , X2 , . . . , Xq−1 (renumbering if necessary) where q < p, by omitting Xq , Xq+1 , . . . , Xp−1 . We are interested in whether the inclusion of Xq , Xq+1 , . . . , Xp−1 in the model provides a significant increase in the overall regression sum of squares or equivalently a significant decrease in residual sum of squares. The difference between the sums of squares is called the extra sum of squares due to Xq , . . . , Xp−1 given X1 , . . . , Xq−1 are already in the model and is defined by the equation

3.7. MODEL BUILDING

77

SS(Xq , . . . , Xp−1 |X1 , . . . , Xq−1 ) = SS(X1 , X2 , . . . , Xp−1 ) − | {z } regression SS for full model =

(red) SSE | {z } residual SS under reduced model



SS(X1 , X2 , . . . , Xq−1 ) | {z } regression SS for reduced model (full) SSE | {z } residual SS under full model.

Notation: Let β T1 = (β0 , β1 , . . . , βq−1 ) so that β=

β T2 = (βq , βq+1 , . . . , βp−1 ) 

β1 β2



.

Similarly divide X into two submatrices X 1 and X 2 so that X = (X 1 , X 2 ), where     1 x1,1 · · · xq−1,1 xq,1 · · · xp−1,1   .   .. .. .. X 1 =  ...  X 2 =  .. . . . . xq,n · · · xp−1,n 1 x1,n · · · xq−1,n The full model Y = Xβ + ε = X 1 β 1 + X 2 β 2 + ε has T 2 (full) SSR = Y T HY − nY = βb X T Y − nY¯ 2 T (full) SSE = Y T (I − H)Y = Y T Y − βb X T Y .

Similarly the reduced model

Y = X 1 β 1 + ε⋆ has T (red) SSR = βb1 X T1 Y − nY¯ 2 T (red) SSE = Y T Y − βb1 X T1 Y .

Hence the extra sum of squares is

T T SSextra = βb X T Y − βb1 X T1 Y .

78

CHAPTER 3. MULTIPLE REGRESSION

To determine whether the change in sum of squares is significant, we test the hypothesis H0 : βq = βq+1 = . . . = βp−1 = 0 versus H1 : ¬H0 It can be shown that, if H0 is true, F =

SSextra /(p − q) ∼ Fp−q,n−p S2

So, we reject H0 at the α level if F > Fα;p−q,n−p and conclude that there is sufficient evidence that some (but not necessarily all) of the ‘extra’ variables Xq , . . . , Xp−1 should be included in the model. The ANOVA table is given by Source Overall regression X1 , .., Xq−1 Xq , .., Xp−1 |X1 , .., Xq−1 Residual Total

d.f. p−1 q−1 p−q n−p n−1

SS (f ull) SSR (red) SSR SSextra SSE SST

MS

VR

SSextra p−q

SSextra (p−q)M SE

M SE

In the ANOVA table we use the notation Xq , . . . , Xp−1 |X1 , . . . , Xq−1 to denote that this is the effect of the variables Xq , . . . , Xp−1 given that the variables X1 , . . . , Xq−1 are already included in the model. Note that as F1,ν distribution is equivalent to t2ν we have that the F-test for H0 : βp−1 = 0, that is for the inclusion of a single variable Xp−1 , (this is the case q = p − 1) can also be performed by an equivalent T -test, where T =

βbp−1 ∼ tn−p se(βbp−1 )

where se(βbp−1 ) is the estimated standard error of βbp−1 .

Also, note that we can repeatedly test individual parameters and we get the following Sums of Squares and degrees of freedom

3.7. MODEL BUILDING

79

Source of variation Full model X1 X2 |X1 X3 |X1 , X2 .. .

df p−1 1 1 1

SS SSR SS(β1 ) SS(β2 |β1 ) SS(β3 |β1 , β2 ) .. .

Xp−1 |X1 , . . . , Xp−2 Residual Total

1 n−p n−1

SS(βp−1 |β1 , . . . , βp−2 ) SSE SST

The output depends on the order the predictors are entered into the model. The sequential sum of squares is the unique portion of SSR explained by a predictor, given any previously entered predictors. If you have a model with three predictors, X1 , X2 , and X3 , the sequential sum of squares for X3 shows how much of the remaining variation X3 explains given that X1 and X2 are already in the model.