Some properties of Likelihood Ratio Tests in Linear Mixed Models

Report 5 Downloads 124 Views
Some properties of Likelihood Ratio Tests in Linear Mixed Models Ciprian M. Crainiceanu∗

David Ruppert†

Timothy J. Vogelsang‡

September 19, 2003

Abstract We calculate the finite sample probability mass-at-zero and the probability of underestimating the true ratio between random effects variance and error variance in a LMM with one variance component. The calculations are expedited by simple matrix diagonalization techniques. One possible application is to compute the probability that the log of the likelihood ratio (LRT), or residual likelihood ratio (RLRT), is zero. The large sample chi-square mixture approximation to the distribution of the log-likelihood ratio, using the usual asymptotic theory for when a parameter is on the boundary, has been shown to be poor in simulations studies. A large part of the problem is that the finite-sample probability that the LRT or RLRT statistic is zero is larger than 0.5, its value under the chi-square mixture approximation. Our calculations explain these empirical results. Another application is to show why standard asymptotic results can fail even when the parameter under the null is in the interior of the parameter space. This paper focuses on LMMs with one variance component because we have developed a very rapid algorithm for simulating finite-sample distributions of the LRT and RLRT statistics for this case. This allows us to compare finite-sample distributions with asymptotic approximations. The main result is the asymptotic approximation are often poor, and this results suggests that asymptotics be used with caution, or avoided altogether, for any LMM regardless of whether it has one variance component or more. For computing the distribution of the test statistics we recommend our algorithm for the case of one variance component and the bootstrap in other cases.

Short title: Properties of (R)LRT Keywords: Effects of dependence, Penalized splines, Testing polynomial regression.



Department of Statistical Science, Cornell University, Malott Hall, NY 14853, USA. E-mail: [email protected] School of Operational Research and Industrial Engineering, Cornell University, Rhodes Hall, NY 14853, USA. E-mail: [email protected] ‡ Departments of Economics and Statistics, Cornell University, Uris Hall, NY 14853-7601, USA. Email:[email protected]

1

INTRODUCTION

This work was motivated by our research in testing parametric regression models versus nonparametric alternatives. It is becoming more widely appreciated that penalized splines and other penalized likelihood models can be viewed as LMMs and the fitted curves as BLUPs (e.g., Brumback, Ruppert, and Wand, 1999). In this framework the smoothing parameter is a ratio of variance components and can be estimated by ML or REML. REML is often called generalized maximum likelihood (GML) in the smoothing spline literature. Within the random effects framework, it is natural to consider likelihood ratio tests and residual likelihood ratio tests, (R)LRTs, about the smoothing parameter. In particular, testing whether the smoothing parameter is zero is equivalent to testing for polynomial regression versus a general alternative modeled by penalized splines. These null hypotheses are also equivalent to the hypothesis that a variance component is zero. LRTs for null variance components are non-standard for two reasons. First, the null value of the parameter is on the boundary of the parameter space. Second, the data are dependent, at least under the alternative hypothesis. The focus of our research is on the finite sample distributions of (R)LRT statistics, and the asymptotic distributions are derived as limits of the finite sample distributions in order to compare the accuracy of various types of asymptotics. For example, for balanced one-way ANOVA we compare asymptotics with a fixed number of samples and the number of observations per sample going to infinity with the opposite case of the number of samples tending to infinity with the number of observations per sample fixed. Our major results are a. The usual asymptotic theory for standard or boundary problems provides accurate approximations to finite-sample distributions only when the response vector can be partitioned into a large number of independent sub-vectors, for all values of the parameters. b. The asymptotic approximations can be very poor when the number of independent sub-vectors is small or moderate. c. Penalized spline models do not satisfy the condition described in point a. and standard asymptotic results fail rather dramatically. The usual asymptotics for testing that a single parameter is at the boundary of its range is a 50 : 50 mixture of point mass at zero (called χ20 ) and a χ21 distribution. A major reason why 1

the asymptotics fail to produce accurate finite-sample approximations is that the finite-sample probability mass at χ20 is substantially greater than 0.5, especially for the LRT but even for the RLRT. This paper studies the amount of probability mass at 0 for these test statistics. However, our methods are sufficiently powerful that the finite-sample distribution of the LRT and RLRT statistics, conditional on being zero, can also be derived. These distributions are studied in a later paper (Crainiceanu and Ruppert, 2003). Our work is applicable to most LMMs with a single random effects variance component, not only to penalized likelihood models. We study LMMs with only a single variance component for tractability. When there is only one variance component, the distributions of the LRT and RLRT statistics can be simplified in such a way that simulation of these distributions is extremely rapid. Although our results do not apply to LMMs with more than one variance component, they strongly suggest that asymptotic approximations be used with great caution for such models. Since asymptotic approximations are poor for LMMs with one variance, it is unlikely that they satisfactory in general when there is more than one variance components. Consider the following LMM · Y = Xβ + Zb + ², E

b ²

¸

· =

0K 0n

¸

· , Cov

b ²

¸

· =

σb2 Σ 0 0 σ²2 IK

¸ ,

(1)

where Y is an n-dimensional response vector, 0K is a K-dimensional column of zeros, Σ is a known K × K-dimensional, β is a p-dimensional vector of parameters corresponding to fixed effects, b is a K-dimensional vector of exchangeable random effects, and (b, ²) is a normal distributed random vector. Under these conditions it follows that E(Y ) = Xβ and Cov(Y ) = σ²2 V λ , where λ = σb2 /σ²2 is the ratio between the variance of random effects b and the variance of the error variables ², V λ = I n + λZΣZ T , and n is the size of the vector Y of the response variable. Note that σb2 = 0 if and only if λ = 0 and the parameter space for λ is [0, ∞). The LMM described by equation (1) contains standard regression fixed effects Xβ specifying the conditional response mean and random effects Zb that account for correlation. We are interested in testing H0 : λ = λ0 vs. HA : λ ∈ [0, ∞) \ {λ0 } ,

(2)

where λ0 ∈ [0, ∞). Consider the case λ0 = 0 (σb2 = 0) when the parameter is on the boundary of the parameter space under the null. Using non-standard asymptotic theory developed by Self and Liang (1987) 2

for independent data, one may be tempted to conclude that the finite samples distribution of the (R)LRT could be approximated by a 0.5χ20 + 0.5χ21 mixture. Here χ2k is the chi-square distribution with k degrees of freedom and χ20 means point probability mass at 0. However, results of Self and Liang (1987) require independence for all values of the parameter. Because the response variable Y in model (1) is not a vector of independent random variables, this theory does not apply. Stram and Lee (1994) showed that the Self and Liang result can still be applied to testing for the zero variance of random effects in LMMs in which the response variable Y can be partitioned into independent vectors and the number of independent subvectors tends to infinity. In a simulation study for a related model, Pinheiro and Bates (2000) found that a 0.5χ20 + 0.5χ21 mixture distribution approximates well the finite sample distribution of RLRT, but that a 0.65χ20 + 0.35χ21 mixture approximates better the finite sample distribution of LRT. A case where it has been shown that the asymptotic mixture probabilities differ from 0.5χ20 + 0.5χ21 is regression with a stochastic trend analyzed by Shephard and Harvey (1990) and Shephard (1993). They consider the particular case of model (1) where the random effects b are modeled as a random walk and show that the asymptotic mass at zero can be as large as 0.96 for LRT and 0.65 for RLRT. For the case when λ0 > 0 we show that the distribution of the (RE)ML estimator of λ has mass at zero. Therefore, even when the parameter is in the interior of the parameter space under the null, the asymptotic distributions of the (R)LRT statistics are not χ21 . We also calculate the probability of underestimating the true parameter λ0 and show that in the penalized spline models this probability is larger than 0.5, showing that (RE)ML criteria tend to oversmooth the data. This effect is more severe for ML than for REML. Section 6.2 of Khuri, Mathew, and Sinha (1998) studies mixed models with one variance component. (They consider the error variance as a variance component so they call this case “two variance components.”) In their Theorem 6.2.2, they derive the LBI (locally best invariant) test of H0 : λ = 0 which rejects for large values of F∗ =

eT ZΣZ T e , eT e

(3)

where e = {I − X(X T X)−1 X}Y is the residual vector when fitting the null model. A test is LBI if among all invariant tests it maximizes power in some neighborhood of the null hypothesis. Notice that the denominator of (3) is, except for a scale factor, the estimator of σ 2 under the null hypothesis and will be inflated by deviations from the null. This suggests that the test might have 3

low power at alternatives far from the null. Khuri (1994) studies the probability in a LMM that a linear combination of independent mean squares is negative. For certain balanced models, there is an estimator of σb2 of this form such that this estimator is negative if and only if the (R)LRT statistic is zero; see, for example, Sections 3.7 and 3.8 of Searle, Casella, and McCulloch (1992). However, in general Khuri’s results to do apply to our problem. 2

SPECTRAL DECOMPOSITION OF (R)LRTn

Consider maximum likelihood estimation (MLE) for model (1). Twice the log-likelihood of Y given the parameters β, σ²2 , and λ is, up to a constant that does not depend on the parameters, L(β, σ²2 , λ) = −n log σ²2 − log |V λ | −

(Y − Xβ)T V −1 λ (Y − Xβ) . 2 σ²

(4)

Residual or restricted maximum likelihood (REML) was introduced by Patterson and Thompson (1971) to take into account the loss in degrees of freedom due to estimation of β parameters and thereby to obtain unbiased variance components estimators. REML consists of maximizing the likelihood function associated with n−p linearly independent error contrasts. It makes no difference which n − p contrasts are used because the likelihood function for any such set differs by no more than an additive constant (Harville, 1977). For the LMM described in equation (1), twice the residual log-likelihood was derived by Harville (1974) and is REL(σ²2 , λ) = −(n − p) log σ²2 − log |V λ | − log(X T V λ X) −

b )T V −1 (Y − X β b ) (Y − X β λ λ λ , (5) 2 σ²

b = (X T V −1 X)−1 (X T V −1 Y ) maximizes the likelihood as a function of β for a fixed where β λ λ λ value of λ. The (R)LRT statistics for testing hypotheses described in (2) are LRTn = sup L(β, σ²2 , λ) − sup L(β, σ²2 , λ) , RLRTn = sup REL(σ²2 , λ) − sup REL(σ²2 , λ) (6) HA ∪H0

H0

HA ∪H0

H0

Denote by µs,n and ξs,n the K eigenvalues of the K×K matrices Σ1/2 Z T P 0 ZΣ1/2 and Σ1/2 Z T ZΣ1/2 respectively, where P 0 = I n − X(X T X)−1 X T . Crainiceanu and Ruppert (2003) showed that if λ0 is the true value of the parameter then " ½ ¾ X µ ¶# K 1 + λξs,n Nn (λ, λ0 ) D LRTn = sup n log 1 + − log , Dn (λ, λ0 ) 1 + λ0 ξs,n λ∈[0,∞) s=1

4

(7)

" D

RLRTn = sup

λ∈[0,∞)

½ ¾ X µ ¶# K 1 + λµs,n Nn (λ, λ0 ) − log , (n − p) log 1 + Dn (λ, λ0 ) 1 + λ0 µs,n

(8)

s=1

D

where “=” denotes equality in distribution, Nn (λ, λ0 ) =

K X (λ − λ0 )µs,n s=1

1 + λµs,n

ws2 , Dn (λ, λ0 ) =

K X 1 + λ0 µs,n s=1

1 + λµs,n

ws2 +

n−p X

ws2 ,

s=K+1

and ws , for s = 1, . . . , n − p, are independent N (0, 1). These null finite sample distributions are easy to simulate (Crainiceanu and Ruppert, 2003). 3

PROBABILITY MASS AT ZERO OF (R)LRT

Denote by f (·) and g(·) the functions to be maximized in equations (7) and (8) respectively. Note that the probability mass at zero for LRTn or RLRTn equals the probability that the function f (·) or g(·) has a global maximum at λ = 0. For a given sample size we compute the exact probability of having a local maximum of the f (·) and g(·) at λ = 0. This probability is an upper bound for the probability of having a global maximum at zero but, as we will show using simulations, it provides an excellent approximation. The first order condition for having a local maximum of f (·) at λ = 0 is f 0 (0) ≤ 0 , where the derivative is taken from the right. The finite sample probability of a local maximum at λ = 0 for ML, when λ0 is the true value of the parameter, is ( ) PK K 2 X (1 + λ µ )µ w 1 0 s,n s,n s s=1 ≤ P PK ξs,n , Pn−p 2+ 2 n (1 + λ µ )w 0 s,n s s=1 s=K+1 ws s=1

(9)

where µs,n and ξs,n are the eigenvalues of the K × K matrices Σ1/2 Z T P 0 ZΣ1/2 and Σ1/2 Z T ZΣ1/2 respectively, and ws are i.i.d. N(0,1) random variables. If λ0 = 0 then the probability of a local maximum at λ = 0 is

(P P

K 2 s=1 µs,n ws P n−p 2 s=1 ws

K

1X ξs,n ≤ n

) .

Using similar derivations for REML the probability mass at zero when λ = λ0 is ) ( PK K 2 1 X s=1 (1 + λ0 µs,n )µs,n ws µs,n , P PK ≤ Pn−p 2 2 n−p s=1 (1 + λ0 µs,n )ws + s=K+1 ws s=1 and, in the particular case when λ0 = 0, the probability of a local maximum at λ = 0 is (P ) K K 2 X µ w 1 s,n s s=1 P ≤ µs,n . P n−p 2 n−p w s=1 s s=1

5

(10)

s=1

(11)

(12)

Once the eigenvalues µs,n and ξs,n are computed explicitly or numerically, these probabilities can be simulated. Algorithms for computation of the distribution of a linear combination of χ21 random variables developed by Davies (1980) or Farebrother (1990) could also be used, but we used simulations because they are simple, accurate, and easier to program. For K = 20 we obtained 1 million simulations in 1 minute (2.66GHz, 1Mb RAM). Probabilities in equations (9) and (11) are the probabilities that λ = 0 is a local maximum and provide approximations of the probabilities that λ = 0 is a global maximum. The latter is equal to the finite sample probability mass at zero of the (R)LRT and of (RE)ML estimator of λ when the true value of the parameter is λ0 . For every value λ0 we can compute the probability of a local maximum at λ = 0 for (RE)ML using the corresponding equation (9) or (11). However, there is no close form for the probability of a global maximum at λ = 0 and we use simulation of the finite sample distributions of (R)LRTn statistics described in equations (7) or (8). In sections 5 and 6 we show that that there is close agreement between the probability of a local and global maximum at λ = 0 for two examples: balanced one-way ANOVA and penalized spline models. 4

PROBABILITY OF UNDERESTIMATING THE SIGNAL-TO-NOISE PARAMETER

bML and λ b1 the global and the first local maximum of f (·), respectively. Define λ bREML Denote by λ ML b1 and λ REML similarly using g(·). When λ0 is the true value of the signal-to-noise parameter, ³ ´ b1 < λ0 ≥ P P λ ML

(

) (K ¯ Pn−p 2 ) X ¯ ∂ w f (λ)¯¯ 0 . The probability mass at zero of the (R)LRTn is equal to the probability of a global maximum at λ = 0. The probability mass at zero under the alternative (λ0 > 0) is larger for LRTn than for RLRTn , thus suggesting that RLRTn may have better power properties than LRTn . We focus now on the properties of the null asymptotic distributions of (R)LRTn for testing the zero variance of random effects null hypothesis. Because the response variable Y can be partitioned into K J-dimensional i.i.d. sub-vectors corresponding to each level, when the number of levels K increases to infinity the asymptotic distribution is 0.5χ20 + 0.5χ21 . However, in applications both the number of levels K and the number of observations per level J are fixed. If K is small or moderate (< 100) then the 0.5 approximation of the probability mass at zero is far from the true value. To make the comparison simple we consider the case J → ∞ in equation (19) and we obtain null asymptotic probability mass at zero ML : PML (K) = P {XK−1 < K} and REML : PREML (K) = P (XK−1 < K − 1) , 8

Table 1: Probability of having a local (global) maximum at λ = 0 for ML and REML. The number of levels is K = 5. ML J\λ0 5 10 20 40

0 0.678

0.01 0.655

0.1 0.480

1 0.069

0 0.569

REML 0.01 0.1 0.545 0.377

1 0.047

(0.678)

(0.655)

(0.480)

(0.069)

(0.570)

(0.545)

(0.377)

(0.047)

0.696

0.648

0.353

0.023

0.583

0.533

0.264

0.015

(0.696)

(0.648)

(0.353)

(0.023)

(0.582)

(0.532)

(0.264)

(0.015)

0.705

0.610

0.204

0.007

0.588

0.493

0.145

0.004

(0.704)

(0.609)

(0.204)

(0.007)

(0.588)

(0.492)

(0.145)

(0.004)

0.709

0.531

0.091

0.002

0.591

0.417

0.062

0.001

(0.709)

(0.529)

(0.090)

(0.002)

(0.591)

(0.417)

(0.062)

(0.001)

Notes: The finite sample probability of having a global maximum (probability mass at zero of LRTn and RLRTn respectively) is reported within parentheses. It represents the frequency of estimating λ = 0 for different true values λ0 in 1 million simulations of the distributions described in equations (7) or (8) for K = 5 levels and different number of observations J per level. The standard deviation of each of these estimated probabilities is at most 0.0005.

where Xr denotes a random variable with a χ2 distribution with r degrees of freedom. Figure 1 shows PML (K) and PREML (K) versus K. By central limit theorem both PML (K) and PREML (K) tend to 0.5, but for K < 100 these probabilities are much larger than 0.5. Indeed, PML (5) = 0.713, PML (10) = 0.650, PML (20) = 0.605 and PML (100) = 0.547. 5.2

PROBABILITY OF UNDERESTIMATING THE SMOOTHING PARAMETER

We now investigate the probability of underestimating λ0 using ML and REML when λ0 is true and the design is balanced. It is easy to see that cs,n (λ0 ) = 1/K in equation (13) and ds,n (λ0 ) = 1/(K − 1), for s = 1, . . . , K − 1, and cK,n (λ0 ) = dK,n (λ0 ) = 0. Therefore pML (λ0 ) = P {FK−1,n−K < K/(K − 1)} and REML : pREML (λ0 ) = P (FK−1,n−K < 1) , (19) which are the probabilities obtained by the first order conditions, and do not depend on λ0 . Table 2 displays these probabilities for K = 5 levels and several values of J and compares them with the exact probability of underestimating λ0 calculated using 1 million simulations of the distributions described in equations (7) or (8). The latter is represented between parentheses. We used λ0 = 1 but similar results were obtained for other values of λ0 . There is close agreement between these probabilities and ML underestimates λ0 much more frequently than REML. 9

Calculations for the balanced one-way ANOVA model can be done analytically because the eigenvalues µs,n and ξs,n can be calculated explicitly and have a particularly simple form. Standard asymptotic theory for a parameter on the boundary holds when K and J are large but fails when K is moderate and J is large. Crainiceanu and Ruppert (2003) suggest using the finite sample distributions described in equations (7) and (8), which are very easy to simulate.

Table 2: Probability of underestimating the true value of the signal-to-noise ratio parameter λ0 for ML and REML. The number of levels is K = 5. J ML REML

5 0.678

10 0.696

20 0.705

40 0.709

(0.677)

(0.695)

(0.704)

(0.708)

0.569

0.583

0.588

0.591

(0.568)

(0.581)

(0.587)

(0.590)

Notes: The finite sample probability of underestimating λ0 is reported within parentheses. It represents the frequency of estimating λ < λ0 for different true values λ0 in 1 million simulations of the distributions described in equations (7) or (8) for K = 5 levels and different number of observations J per level. λ0 = 1 but other values give similar results. The standard deviation of each of these estimated probabilities is at most 0.0005.

5.3

THE UNBALANCED ONE-WAY DESIGN

For unbalanced data, Searle, Casella, and McCulloch (1992, p. 88) state that the probability mass at zero “cannot be easily specified.” Apparently, this quantity cannot be expressed simply using the F-distribution. However, it is a simple case of (9). 5.4

OTHER TESTS

Khuri, Mathew, and Sinha (1998) discuss several approaches to testing in LMMs besides (R)LRTs. Wald’s variance component test is simply the F-test assuming that all parameters are fixed effects. Under the null hypothesis the test statistic has an exact F-distribution even for unbalanced data. For the balanced one-way design, Wald’s test is UMPS (uniformly most powerful similar) and therefore UMPU (uniformly most powerful unbiased) and is also UMPI (uniformly most powerful invariant). In the case of the unbalanced one-way design, there are no UMPS, UMPU, or UMPI tests. However, there is a LBI (locally best invariant) test which was derived by Das and Sinha

10

(1987). The test statistic, which is given by equation (3) or (4.2.8) of Khuri, Mathew, and Sinha (1998), is a ratio of quadratic forms in Y so percentiles of its distribution can be found by Davies’s (1980) algorithm. 6

TESTING POLYNOMIAL REGRESSION VERSUS A NONPARAMETRIC ALTERNATIVE

In this section we show that nonparametric regression using P-splines is equivalent to a particular LMM. In this context, the smoothing parameter is the ratio between the random effects and error variances and testing assumptions about the shape of the regression function is equivalent to testing hypotheses about the smoothing parameter. We first focus on testing for a polynomial regression versus a general alternative modeled by penalized splines, which is equivalent to testing for a zero smoothing parameter (or zero random effects variance). For this hypothesis, we study the probability mass at zero of the (R)LRTn statistics under the null and alternative hypotheses. In particular, we show that the null probability mass at zero is much larger than 0.5. Because in the penalized spline models the data vector cannot be partitioned into more than one i.i.d. sub-vectors, the Self and Liang assumptions do not hold, but it was an open question as to whether the results themselves held or not. Our results show that the 0.5 : 0.5 mixture of χ2 distributions cannot be extended to approximate the finite sample distributions of (R)LRTn regardless of the number of knots used. We also investigate the probability of underestimating the true smoothing parameter, which in the context of penalized spline smoothing is the probability of oversmoothing. We show that first order results as described in Section 4 provide excellent approximations of the exact probability of oversmoothing and the probability of oversmoothing with (RE)ML is generally larger than 0.5. 6.1

P-SPLINES REGRESSION AND LINEAR MIXED MODELS

Consider the following regression equation yi = m (xi ) + ²i ,

(20)

¡ ¢ where ²i are i.i.d. N 0, σ²2 and m(·) is the unknown mean function. Suppose that we are interested in testing if m(·) is a p-th degree polynomial: H0 : m (x) = β0 + β1 x + . . . + βp xp . 11

To define an alternative that is flexible enough to describe a large class of functions, we consider the class of regression splines HA : m(x) = m (x, Θ) = β0 + β1 x + . . . + βp xp +

K X

bk (x − κk )p+ ,

(21)

k=1

where Θ = (β0 , . . . , βp , b1 , . . . , bK )T is the vector of regression coefficients, β = (β0 , . . . , βp )T is the vector of polynomial parameters, b = (b1 , . . . , bK )T is the vector of spline coefficients, and κ1 < κ2 < . . . < κK are fixed knots. Following Gray (1994) and Ruppert (2002), we consider a number of knots that is large enough (e.g. 20) to ensure the desired flexibility. The knots are taken to be sample quantiles of the x’s such that κk corresponds to probability k/(K + 1). To avoid overfitting, the criterion to be minimized is a penalized sum of squares n X

{yi − m (xi ; Θ)}2 +

i=1

1 T Θ WΘ , λ

(22)

where λ ≥ 0 is the smoothing parameter and W is a positive semi-definite matrix. Denote Y = (y1 , y2 , . . . , yn )T , X the matrix having the i-th row X i = (1, xi , . . . , xpi ) , Z the matrix having the © ª i-th row Z i = (xi − κ1 )p+ , (xi − κ2 )p+ , . . . , (xi − κK )p+ , and X = [X|Z]. In this paper we focus on matrices W of the form

· W =

0p+1×p+1 0p+1×K 0K×p+1 Σ−1

¸ ,

where Σ is a positive definite matrix and 0ml is an m × l matrix of zeros. This type of matrix W penalizes the coefficients of the spline basis functions (x − κk )p+ only and will be used in the remainder of the paper. A standard choice is Σ = I K but other matrices can be used according to the specific application. If criterion (22) is divided by σ²2 one obtains 1 1 T −1 kY − Xβ − Zbk2 + b Σ b. 2 σ² λσ²2

(23)

Define σb2 = λσ²2 , consider the vectors γ and β as unknown fixed parameters and the vector b as a set of random parameters with E(b) = 0 and cov(b) = σb2 Σ. If (bT , ²T )T is a normal random vector and b and ² are independent then one obtains an “equivalent” model representation of the penalized spline in the form of a LMM (Brumback, Ruppert, and Wand 1999; Ruppert, Wand, and Carroll, 2003):

µ Y = Xβ + Zb + ², cov

12

b ²



· =

0 σb2 Σ 0 σ²2 I n

¸ .

(24)

More specifically, the P-spline model is equivalent to the LMM in the following sense. Given a fixed value of λ = σb2 /σ 2² , the P-spline is equal to the BLUP of the regression function in the LMM. The P-spline model and LMM may differ in how λ is estimated. In the LMM it would be estimated by ML or REML. In the P-spline model, λ could be determined by cross-validation, generalized cross-validation or some other method for selecting a smoothing parameter. However, using ML or REML to select the smoothing parameter is an effective method and we will use it. There is naturally some concern about modeling a regression function by assuming that (bT , ²T )T is a normal random vector and cov(b) = σb2 Σ. However, under the null hypotheses of interest, b = 0 so this assumption does hold with σb2 = 0. Therefore, this concern is not relevant to the problem of testing the null hypothesis of a parametric model. One can also view the LMM interpretation of a P-spline model as a hierarchical Bayesian model and the assumption about (bT , ²T )T as part of the prior. This is analogous to the Bayesian interpretation of smoothing splines pioneered by Wahba (1978, 1990). In this context, testing for a polynomial fit against a general alternative described by a P-spline is equivalent to testing ¡ ¢ ¡ ¢ H0 : λ = 0 σb2 = 0 vs. HA : λ > 0 σb2 > 0 . Given the LMM representation of a P-spline model we can define LRTn and RLRTn for testing these hypotheses as described in Section 2. Because the bi ’s have mean zero, σb2 = 0 in H0 is equivalent to the condition that all coefficients bi of the truncated power functions are identically zero. These coefficients account for departures from a polynomial. For the P-spline model, Wald’s variance component test mentioned in Section 5.4 would be the F-test for testing polynomial regression versus a regression spline viewed as a fixed effects model. The fit under the alternative would be ordinary least squares. Because there would be no smoothing, it seems unlikely that this test would be satisfactory and, perhaps for this reason, it has not been studied, at least as far as we are aware. 6.2

PROBABILITY MASS AT ZERO OF (R)LRT

In this section we compute the probability that the (R)LRTn is 0 when testing for a polynomial regression versus a general alternative modeled by penalized splines. We consider testing for a constant mean, p = 0, versus the alternative of a piecewise constant spline and linear polynomial, 13

p = 1, versus the alternative of a linear spline. For illustration we analyze the case when the x’s are equally spaced on [0, 1] and K = 20 knots are used, but the same procedure can be applied to the more general case. Once the eigenvalues µs,n and ξs,n of the matrices Z T P 0 Z and Z T Z are calculated numerically, the probability of having a local maximum at zero for (R)LRTn is computed using equations (9) or (11). Results are reported in Tables 3 and 4. We also report, between parentheses, the estimated probabilities of having a global maximum at zero. As for one-way ANOVA, we used 1 million simulations from the spectral form of distributions described in equations (7) or (8). For RLRTn there is close agreement between the probability of a local and global maximum at zero for every value of the parameter λ0 . For LRTn the two probabilities are very close when λ0 = 0, but when λ0 > 0 the probability of a local maximum at zero is much larger than the probability of a global maximum. This happens because the likelihood function can be decreasing in a neighborhood of zero but have a global maximum in the interior of the parameter space. The restricted likelihood function can exhibit the same behavior but does so less often. An important observation is that the LRTn has almost all its mass at zero, that is ' 0.92 for p = 0, and ' 0.99 for p = 1. This makes the construction of a LRT very difficult, if not impossible, especially when we test for the linearity against a general alternative. Estimating λ to be zero with high probability when the true value is λ0 = 0 is a desirable property of the likelihood function. However, continuing to estimate λ to be zero with high probability when the true value is λ0 > 0 (e.g. 0.78 for n = 100, λ0 = 1 and p = 1) suggests that the power of the LRTn can be poor. The RLRTn has less mass at zero ' 0.65 for p = 0, and ' 0.67 for p = 1, thus allowing the construction of tests. Also, the probability of estimating zero smoothing parameter when the true parameter λ0 > 0 is much smaller (note the different scales in Tables 3 and 4) indicating that the RLRT is probably more powerful than the LRT. In a simulation study Crainiceanu, Ruppert, Claeskens and Wand (2003) showed that this is indeed the case. Columns corresponding to λ0 = 0 in Tables 3 and 4 show that the 0.5 approximation of the probability mass at zero for (R)LRTn is very poor for K = 20 knots, regardless of the number of observations. Using an analogy with the balanced one-way ANOVA case one may be tempted to believe that by increasing the number of knots K the 0.5 approximation will improve. To address this problem we calculate the asymptotic probability mass at zero when the number of observations n tends to infinity and the number of knots K is fixed. 14

Table 3: Probability of having a local and global minimum at λ = 0 for LRTn

n 50 100 200 400

λ0 = 0 0.953

p=0 λ0 = 0.1 0.430

λ0 = 0 > 0.999

p=1 λ0 = 1 > 0.999

λ0 = 10 0.9941

(0.917)

(0.312)

0.954

0.271

(0.021)

(0.993)

(0.891)

(0.422)

0.0365

> 0.999

0.999

0.773

(0.919)

(0.158)

0.954

0.146

(0.002)

(0.994)

(0.778)

(0.262)

0.008

> 0.999

0.972

(0.921)

0.545

(0.060)

(0)

(0.994)

(0.623)

(0.143)

0.954

0.064

0.001

> 0.999

0.874

0.381

(0.923)

(0.015)

(0)

(0.995)

(0.456)

(0.067)

λ0 = 1 0.114

Notes: The standard deviation of each of these estimated probabilities is at most 0.0005.

Consider the example of testing for a constant mean versus a general alternative modeled by a piecewise constant spline with equally spaced observations and K knots. In Appendix A1 we show that µs,n /n → µs and ξs,n /n → ξs where µs and ξs are the eigenvalues of two K × K matrices. Using these results in equations (10) and (12) it follows that the asymptotic probability mass at zero for LRTn and RLRTn are ! ! ÃK ÃK K K X X X X 2 2 and P µs ws ≤ µs , P µs ws ≤ ξs s=1

s=1

s=1

(25)

s=1

respectively, where ws are independent N (0, 1) random variables. Figure 2 shows these probabilities calculated using 1 million simulations in equation (25) for a number of knots 1 ≤ K ≤ 100. Over a wide range of number of knots the probabilities are practically constant, ≈ 0.95 for ML and ≈ 0.65 for REML. Approximating the null probability of estimating λ as 0 by 0.5 is very inaccurate, since 0.5 is not even correct asymptotically.

6.3

PROBABILITY OF UNDERESTIMATING THE SMOOTHING PARAMETER

We now investigate the probability of underestimating the true value of the smoothing parameter λ0 using ML or REML criteria. In the penalized spline context this probability is the probability of undersmoothing. The REML bias towards oversmoothing for the regression function estimates has been discussed before in the smoothing spline literature (e.g., Efron, 2001; and Kauerman 15

Table 4: Probability of having a local and global minimum at λ = 0 for RLRTn

n 50 100 200 400

λ0 = 0 0.653

p=0 λ0 = 0.01 0.474

λ0 = 0.1 0.151

(0.642)

(0.463)

(0.139)

λ0 = 0 0.671

p=1 λ0 = 0.01 0.667

λ0 = 0.1 0.625

(0.660)

(0.656)

(0.616)

0.654

0.376

0.074

0.674

0.664

0.589

(0.646)

(0.366)

(0.063)

(0.663)

(0.655)

(0.579)

0.656

0.266

0.028

0.675

0.655

0.529

(0.647)

(0.258)

(0.021)

(0.666)

(0.646)

(0.520)

0.656

0.166

0.008

0.676

0.638

0.447

(0.648)

(0.155)

(0.005)

(0.667)

(0.629)

(0.438)

Notes: The finite sample probability of having a global maximum (probability mass at zero of log-LR and RLRT respectively) is reported within parentheses. It represents the frequency of estimating λ = 0 for different true values λ in 1 million simulations of the spectral decomposition of RLRTn in equation (8). The standard deviation of each of these estimated probabilities is at most 0.0005.

2003). Based on first order conditions we provide a simple and accurate approximation of the finite sample probability of undersmoothing for penalized splines. The exact finite sample probability of undersmoothing can be obtained exactly by recording the ARGMAX at each simulation step of the distributions described in equations (7) and (8). As illustration, we use the same examples as in Section 6.2: piecewise constant spline, p = 0, and linear spline, p = 1, with equally spaced observations and K = 20 knots, even though results hold more generally for unequally spaced observations and any number of knots. Table 5 shows the approximate values pML (λ0 ) and pREML (λ0 ) of underestimating λ0 obtained using first order conditions, as described in Section 4. These values are obtained using 1 million simulations from the expressions that appear in the right hand side of equations (13) and (14), where µs,n and ξs,n are the eigenvalues of the matrices Z T P 0 Z and Z T Z respectively. The exact probability of underestimating λ0 is reported between parentheses and is obtained using 1 million simulations from the distribution described in (7) and (8) respectively. Results are reported for n = 100 and n = 400 observations. Table 5 shows a very close agreement between the approximate and exact probability of underestimating λ0 for all values of λ0 both for ML and REML criteria. For both criteria, probabilities decrease with the increase of the true value of the smoothing parameter but their values remain larger than 0.5 for all cases considered. Differences are more severe for small to moderate values 16

Table 5: Approximate and exact probability of underestimating the smoothing parameter λ0 n(ML) 100 (1) 400 (1) 100 (0) 400 (0)

0.01 0.741

0.1 0.606

p=0 1 0.530

10 0.516

105 0.513

0.01 > .999

0.1 0.955

p=1 1 0.829

10 0.728

105 0.501

(0.754)

(0.609)

(0.530)

(0.515)

(0.513)

(0.993)

(0.980)

(0.845)

(0.734)

(0.501)

0.662

0.567

0.539

0.536

0.535

0.992

0.878

0.772

0.690

0.533

(0.669)

(0.567)

(0.539)

(0.536)

(0.536)

(0.990)

(0.913)

(0.783)

(0.691)

(0.533)

0.616

0.563

0.531

0.529

0.528

0.674

0.669

0.637

0.595

0.528

(0.617)

(0.563)

(0.530)

(0.528)

(0.529)

(0.663)

(0.661)

(0.638)

(0.597)

(0.528)

0.586

0.549

0.539

0.539

0.539

0.673

0.656

0.613

0.581

0.539

(0.588)

(0.548)

(0.539)

(0.539)

(0.539)

(0.665)

(0.655)

(0.615)

(0.582)

(0.539)

Notes: Probabilities are reported for penalized splines with n equally spaced observations in [0, 1] and K = 20 equally spaced knots. Here ML = 1 corresponds to using the ML criterion and ML = 0 corresponds to using the REML criterion for estimating the smoothing parameter. The exact probability of underestimating λ0 is reported between parentheses and is obtained using 1 million simulations from the distribution described in (7) and (8) respectively. The approximate probability of underestimating λ0 is obtained using equations (13) and (14). The standard deviation of each of these estimated probabilities is at most 0.0005.

of λ0 . The approximate probabilities depend essentially on the weights cs,n (λ0 ) and ds,n (λ0 ), and converge to the probabilities described in equation (15). For example, for n = 100, K = 20 we have Kµ = Kξ = 20 and we obtain lim pML (λ0 ) = F20,79 (79/80) = 0.514 and

λ0 →∞

lim pREML (λ0 ) = F20,79 (1) = 0.528 .

λ0 →∞

These values agree with the simulation results presented in Table 5 for λ0 = 105 .

7

LONGITUDINAL MODELS

Modeling longitudinal data is one of the most common application area of LMMs. LRTs for null hypotheses that include zero variance components are routinely used in this context. We focus our discussion on (R)LRTs for a zero random effects variance. Asymptotic theory for longitudinal data models was developed (Stram and Lee, 1994) under the assumption that data can be partitioned into a large number of independent subvectors. One might argue, as one referee did, that for longitudinal data it is generally possible to partition the response vectors into independent clusters corresponding to subjects and that the appropriate

17

asymptotics is obtained by allowing the number of subjects to increase rather than the number of observations per subject. However, in our view, this argument has several drawbacks. The object of interest is the finite sample distribution of the test statistic. If this is available, then it should be used. If an asymptotic distribution is used, then the type of asymptotics used should be that which gives the best approximation to the finite-sample distribution. In Section 7.1, for the random intercept longitudinal model, we show that if the number of subjects is less than 100 then the 50 : 50 mixture of chi squared distributions is a poor approximation of the (R)LRT finite sample distribution, regardless of the number of observations per subject. For this example we would use the finite sample distribution of the (R)LRT statistic, as derived by Crainiceanu and Ruppert (2003). Moreover, for many commonly used semiparametric models there are random effects in the submodel for the population mean, so that the data cannot be divided into independent subvectors and the results of Stram and Lee (1994) do not apply to this type of models.. For example, suppose that n subjects in M groups are observed and their response curves are recorded over time. The mean response of a subject can be decomposed as the corresponding group mean and the deviation of the subject mean from the group mean. The group mean can farther be decomposed as the overall mean and the deviation of the group mean from the overall mean. In Section 7.2 we show that if the overall mean is modelled nonparametrically in this way, then the response vector cannot be partitioned into more than one independent subvector. Moreover, if the overall mean is modelled parametrically and the group deviations are modelled nonparametrically, then the response vector cannot be partitioned into more than M independent subvectors, where M rarely exceeds 5 in applications. 7.1

RANDOM INTERCEPT LONGITUDINAL MODEL

Consider the random intercept longitudinal model with K subjects and J(k) observations per subject k Ykj = β0 +

S X

(s)

βs xkj + bk + ²kj , k = 1, . . . , K and j = 1, . . . , J(k) .

(26)

s=1

where ²kj are i.i.d. random variables N (0, σ²2 ), bk are i.i.d. intercepts distributed N (0, σb2 ) independent of ²kj and denote by λ = σb2 /σ²2 . In model (26) β0 and βs , s = 1, . . . , S are fixed unknown (s)

parameters and xkj is the j-th observation for the k-th subject on the s-th covariate.

18

The matrix X for fixed effects is a JK × (S + 1) matrix with the first column containing only 1’s and the last S containing the corresponding values of x’s. The matrix Z is a JK × K matrix with every column containing only zeros with the exception of a J-dimensional vector of 1’s corresponding to the random intercept for each subject. For this model Σ = IK , p = S + 1 and P n= K k=1 J(k). The eigenvalues of the matrices Z T P 0 Z and Z T Z can be computed numerically and the finite sample distributions of (R)LRTn statistics can be simulated using results in section 2. In particular, properties such as probability mass at zero and probability of underestimating the true value of λ0 can be obtained in seconds. In some cases these eigenvalues can be calculated explicitly. Consider the case of equal number (1)

of observations per subject, J(k) = J for k = 1, . . . , K, one covariate, S = 1, and xkj = xkj = j. This example is routinely found in practice when K subjects are observed at equally spaced intervals in time and time itself is used as a covariate. It is easy to show that for this example µs,n = J for s = 1, . . . , K − 1, µK,n = 0 and ξs,n = J for s = 1, . . . , K. These eigenvalues are identical to the corresponding eigenvalues for the balanced one-way ANOVA model and all distributional results discussed in section 5 apply to this example. If the covariates are not equally spaced and/or there are unequal numbers of observations per subject then the eigenvalues of Z T P 0 Z cannot be computed explicitly. In particular, the F distribution cannot be used to compute the probability mass at zero or the probability of underestimating the true signal to noise ratio, but our methodology could be. 7.2

NONPARAMETRIC LONGITUDINAL MODELS

Nested families of curves in longitudinal data analysis can be modeled nonparametrically using penalized splines. An important advantage of low order smoothers, such as penalized splines, is the reduction in the dimensionality of the problem. Moreover, smoothing can be done using standard mixed model software because splines can be viewed as BLUPs in mixed models. In this context, likelihood and restricted likelihood ratio statistics could be used to test for effects of interest. For example, the hypothesis that a curve is linear is equivalent to a zero variance component being zero. While computing the test statistics is straightforward using standard software, the null distribution theory is complex. Modified asymptotic theory for testing whether a parameter is on the boundary of the parameter space does not apply because data are correlated 19

under the alternative (usually the full model). Brumback and Rice (1998) study an important class of models for longitudinal data. In this paper, we consider a subclass of those models. In particular, we assume that repeated observations are taken on each of n subjects divided into M groups. Suppose that yij is the j-th observation on P the i-th subject recorded at time tij , where 1 ≤ i ≤ I, 1 ≤ j ≤ J(i), and n = Ii=1 J(i) is the total number of observations. Consider the nonparametric model yij = f (tij ) + fg(i) (tij ) + fi (tij ) + ²ij ,

(27)

where ²ij are independent N (0, σ²2 ) errors and g(i) denotes the group corresponding to the i-th subject. The population curve f (·), the deviations of the g(i) group from the population curve, fg(i) (tij ), and the deviation of the ith subject’s curve from the population average, fi (·), are modeled nonparametrically. Models similar to (27) have been studied by many other authors, e.g., Wang (1998). We model the population, group and subject curves as penalized splines with a relatively small number of knots

 P P 1 p  f (t) = ps=0 βs ts + K  k=1 bk (t − κ1,k )+     P P 2 p fg (t) = ps=0 βgs ts + K k=1 ugk (t − κ2,k )+       f (t) = Pp a ts + PK3 v (t − κ )p . i 3,k + s=0 is k=1 ik

For the population curve β0 , . . . , βp will be treated as fixed effects and b1 , . . . , bK will be treated as independent random coefficients distributed N (0, σb2 ). In LMM’s, usually parameters are treated as random effects because they are subject-specific and the subjects have been sampled randomly. Here, b1 , . . . , bK are treated as random effects for an entirely different reason. Modeling them as random specifies a Bayesian prior and allows for shrinkage that assures smoothness of the fit. For the group curves βgs , g = 1, . . . , M , s = 1, . . . , p, will be treated as fixed effects and ug k, k = 1, . . . , K, will be treated as random coefficients distributed N (0, σg2 ). To make the model identifiable we impose restrictions on βgs M X

βgs = 0 or βM s = 0 for s = 1, . . . , p .

g=1

This model assumes that the groups are fixed, e.g., are determined by fixed treatments. If the groups were chosen randomly, then the model would be changed somewhat. The parameters βgs , 20

g = 1, . . . , M , s = 1, . . . , p, would be treated as random effects and the σg2 might be assumed to be equal or to be i.i.d. from some distribution such as an inverse gamma. For the subject curves, ais , 1 ≤ i ≤ n, will be treated as independent random coefficients 2 ), which is a typical random effects assumption since the subjects are sampled distributed N (0, σa,s

randomly. Moreover vik will be treated as independent random coefficients distributed N (0, σv2 ). With the additional assumption that bk , aik , vik and ²ij are independent, the nonparametric model (27) can be rewritten as a LMM with p + M + 3 variance components Y = X β + X G β G + Z b b + Z G uG + Z a a + Z v v + ² ,

(28)

where β = (β0 , . . . , βp )T , β g = (βg0 , . . . , βgp )T , β G = (β T1 , . . . , β TM ), ug = (ug1 , . . . , ugK2 )T , uG = (uT1 , . . . , uTM )T , as = (a1s , . . . , ans )T , a = (aT0 , . . . , aTp )T and v = (v11 , . . . , vnK3 ). The X and Z matrices are defined accordingly. By inspecting the form (28) of model (27), some interesting features of the model are revealed. If σb2 > 0 then the vector Y cannot be partitioned into more than one independent subvectors due to the term Zb. If σb2 = 0 but at least one of σg2 > 0 then Y cannot be partitioned into more than M independent subvectors due to the term Z G uG . In other words, if the overall or one of the group means are modeled nonparametrically then the assumptions for obtaining the Self and Liang type of results fail to hold. When longitudinal data are modeled semiparametrically in this way and there is more than one variance component, deriving the finite sample distribution of (R)LRT for testing that certain variance components are zero is not straightforward. The distribution of the LRT and RLRT statistics will depend on those variance components, if any, that are not completely specified by the null hypothesis. Therefore, exact tests are generally not possible. We recommend that the parametric bootstrap be used to approximate the finite sample distributions of the test statistics. However, the bootstrap will be much more computationally intensive that the simulations we have been able to develop. 8

DISCUSSION

We considered estimation of the ratio between the random effects and error variances in a LMM with one variance component and we examined the probability mass at zero of the (RE)ML estimator, as well as the probability of underestimating the true parameter λ0 . We provided simple formulae for 21

calculating the approximations of these probabilities based on first order conditions and show that these approximations are very good at least for balanced one-way ANOVA and penalized spline models. One important application of our results is to provide exact finite sample and asymptotic probability mass at zero of the (R)LRTn when testing that the random effects variance is zero in a LMM with one variance component. Our results explain why asymptotic theory, such as derived by Self and Liang, should be used cautiously. In particular, even in the simplest example of balanced one-way ANOVA model, which satisfies the Self and Liang independence assumptions, the 0.5 : 0.5 mixture approximation holds only when the number of levels goes to infinity but fails when the number of levels is moderate (< 100), regardless of the number of observations per level. For penalized spline models the 0.5 : 0.5 mixture approximation does not hold regardless of the number of knots K and number of observations n. As we mentioned in the introduction, another model where the Self and Liang theory does not apply is the regression with a stochastic trend analyzed by Shephard and Harvey (1990) and Shephard (1993). In contrast, the exact finite sample distribution of (R)LRTn can be easily obtained as described by Crainiceanu and Ruppert (2003). We also derived the probability of underestimating the true ratio of random effects and error variances λ0 using ML and REML criteria. We showed that using first order conditions provides an excellent approximation of the exact probability. Both for one-way ANOVA and penalized spline models this probability is larger than 0.5 showing that (RE)ML underestimates more often than overestimates the true parameter. Acknowledgement We thank the Associate Editor and two referees for their helpful comments. REFERENCES Brumback, B., and Rice, J.A. (1998), “Smoothing spline models for the analysis of nested and crossed samples of curves (with discussion),” Journal of the American Statistical Association, 93, 961–994. Brumback, B., Ruppert, D., and Wand, M. P. (1999), “Comment on ‘Variable selection and function estimation in additive nonparametric regression using data-based prior’ ” by Shively,

22

Kohn, and Wood, Journal of the American Statistical Association, 94, 794–797. Crainiceanu, C. M., Ruppert, D. (2003), “Likelihood ratio tests in linear mixed models with one variance component,” Journal of Royal Statistical Society - Series B, to appear. Crainiceanu, C. M., Ruppert, D., and Claeskens, G., Wand, M. P. (2003), “Exact Likelihood Ratio Tests for Penalized Splines,” under revision, available at www.orie.cornell.edu/~davidr/papers. Das, R., and Sinha, B. K. (1987), “Robust optimum invariant unbiased tests for variance components,” In Proceedings of the Second International Tampere Conference in Statistics, T. Pukkila and S. Puntanen, ed., University of Tampere, Finland, pp. 317–342. Davies, R. B. (1980), “The distribution of a linear combination of χ2 random variables,” Applied Statistics, 29, 323–333. Efron, B. (2001), “Selection criteria for scatterplot smoothers,” The Annals of Statistics, 29, 470–504. Farebrother, R. W. (1990), “The distribution of a quadratic form in normal variables,” Applied Statistics, 39, 294–309. Gray, R. J. (1994), “Spline-Based Tests in Survival Analysis,” Biometrics, 50, 640–652. Harville, D. A. (1974), “Bayesian inference for variance components using only error contrasts,” Biometrika, 61, 383–385. Harville, D. A. (1977), “Maximum likelihood approaches to variance component estimation and to related problems,” Journal of the American Statistical Association, 72, 320–338. Hoffman, K., Kunze, R. (1971), Linear Algebra, Prentice-Hall, Englewood Cliffs, New Jersey. Kauerman, G. (2003), “A note on bandwidth selection for penalised spline smoothing,” manuscript. Khuri, A. I. (1994), “The probability of a negative linear combination of independent mean squares,” Biometrical Journal, 36, 899–910. Khuri, A. I., Mathew, T., Sinha, B. K. (1998), Statistical Tests for Mixed Linear Models, Wiley, New York. Patterson, H. D., and Thompson, R. (1971), “Recovery of inter-block information when block sizes are unequal,” Biometrika, 58, 545–554.

23

Pinheiro, J. C., Bates, D.M. (2000), Mixed-Effects Models in S and S-Plus, Springer-Verlag, New York. Ruppert, D. (2002), “Selecting the number of knots for penalized splines,” Journal of Computational and Graphical Statistics, 11, 205–224. Ruppert, D., Wand, M. P., and Carroll, R. J. (2003) Semiparametric Regression, Cambridge University Press, Cambridge. Searle, S. R., Casella, G., McCulloch, C. E. (1992), Variance components, Wiley, New York. Self, S. G., and Liang, K. Y. (1987), “Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under non-standard conditions,” Journal of the American Statistical Association, 82, 605-610. Self, S. G., and Liang, K. Y. (1995), “On the Asymptotic Behaviour of the Pseudolikelihood Ratio Test Statistic,” Journal of Royal Statistical Society B, 58, 785-796. Shephard, N. G., Harvey, A. C. (1990), “On the probability of estimating a deterministic component in the local level model,” Journal of Time Series Analysis, 4, 339-347. Shephard, N. G. (1993), “Maximum Likelihood Estimation of Regression Models with Stochastic Trend Components,” Journal of the American Statistical Association, 88, 590-595. Stram, D. O., Lee, J. W. (1994), “Variance Components Testing in the Longitudinal Mixed Effects Model,” Biometrics, 50, 1171-1177. Wahba, G. (1978), “Improper priors, spline smoothing, and the problem of guarding against model errors in regression,” Journal of Royal Statistical Society B, 40, 364–372. Wahba, G. (1990), Spline Models for Observational Data, SIAM, Philadelphia. Yu, H., Searle, S. R., and McCulloch, C. E. (1994), “Properties of maximum likelihood estimators of variance components in the one-way classification model, balanced data,” Communications in Statistics - Simulation and Computation, 23, 897-914. Appendix A1 Consider the case of a piecewise constant spline with K knots yi = β0 +

K X

bk I(xi > κk ) + ²i ,

k=1

24

where I(·) is the indicator function. Denote by nk (n) the number of covariate observations larger than the k-th knot and assume that limn→∞ nk (n)/n = uk . In the case of equally spaced observations in [0, 1] this assumption holds and uk = 1 − k/(K + 1). The (i, k)-th element of the Z matrix is I(xi > κk ) showing the (l, k)-th element of Z T Z matrix is ul,k (n) =

n X

I(xi > κl )I(xi > κk ) = nmax(l,k) (n) .

i=1

Therefore ul,k (n)/n → umax(l,k) or, if we denote by U the K × K matrix with umax(l,k) as the (l, k)-th entry,

ZT Z =U . n→∞ n lim

Similarly we obtain that

Z T P 0Z =V , n→∞ n lim

where the (l, k)-th entry of the matrix V is vl,k = umax(l,k) − ul uk . If we denote by µs and ξs the eigenvalues of V and U matrices it follows that lim µs,n /n = µs and

n→∞

25

lim ξs,n /n = ξs .

n→∞

Figure 1: one-way ANOVA model: asymptotic probability of having a local minimum at λ = 0 when the true value of λ is λ = 0. The number of levels is constant and the number of observations per level goes to infinity. 0.9

ML REML

0.85

0.8

Probability

0.75

0.7

0.65

0.6

0.55

0.5

0

10

20

30

40

50

60

Number of levels

26

70

80

90

100

Figure 2: Non-parametric testing for constant mean vs. a general alternative modeled by a piecewise constant spline: asymptotic probability of having a local minimum at λ = 0 when the true value of λ is λ = 0. The number of knots is constant and the number of observations goes to infinity. 1

0.95

ML REML

0.9

Probability

0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.5

0

10

20

30

40

50

60

Number of knots

27

70

80

90

100