Testing equality of covariance matrices when data ... - Semantic Scholar

Report 8 Downloads 115 Views
Computational Statistics & Data Analysis 51 (2007) 4227 – 4239 www.elsevier.com/locate/csda

Testing equality of covariance matrices when data are incomplete Mortaza Jamshidiana,∗ , James R. Schottb a Department of Mathematics, California State University, 800 N. State College Blvd., Fullerton, CA 92834, USA b Department of Statistics, University of Central Florida, 4000 Central Florida Blvd., Orlando, FL 32816, USA

Received 13 July 2005; received in revised form 5 May 2006; accepted 9 May 2006 Available online 6 June 2006

Abstract In the statistics literature, a number of procedures have been proposed for testing equality of several groups’ covariance matrices when data are complete, but this problem has not been considered for incomplete data in a general setting. This paper proposes statistical tests for equality of covariance matrices when data are missing. A Wald test (denoted by T1 ), a likelihood ratio test (LRT) (denoted by R), based on the assumption of normal populations are developed. It is well-known that for the complete data case the classic LRT and the Wald test constructed under the normality assumption perform poorly in instances when data are not from multivariate normal distributions. As expected, this is also the case for the incomplete data case and therefore has led us to construct a robust Wald test (denoted by T2 ) that performs well for both normal and non-normal data. A re-scaled LRT (denoted by R ∗ ) is also proposed. A simulation study is carried out to assess the performance of T1 , T2 , R, and R ∗ in terms of closeness of their observed significance level to the nominal significance level as well as the power of these tests. It is found that T2 performs very well for both normal and non-normal data in both small and large samples. In addition to its usual applications, we have discussed the application of the proposed tests in testing whether a set of data are missing completely at random (MCAR). © 2006 Elsevier B.V. All rights reserved. Keywords: Likelihood ratio test; Missing completely at random; Missing data; Re-scaled likelihood ratio test; Robust tests; Test of homogeneity of covariance matrices; Wald test

1. Introduction Testing equality of covariance matrices between several groups comes up in various contexts. These tests are of primary interest in applications of some commonly used multivariate procedures that have an underlying assumption of common covariance matrices. For example, Fisher’s linear discriminant function and Hotelling’s T 2 generalization of the univariate t test assume a common covariance matrix. Zhang and Boos (1992) list several other applications and references to applications in the areas of covariance structure models, education, and biometrics. The most commonly cited method for testing equality of covariance matrices is an asymptotically 2 -test, known as Bartlett’s test. This test is a modified likelihood ratio test (LRT), constructed under the assumption of multivariate normality and modified as to yield an unbiased test (see, for example, Muirhead, 1982, Section 8.2). It is well known that this test is sensitive to violation of the normality assumption, and so other more robust procedures have been proposed. Tiku and Balakrishnan (1985) developed a robust test for equality of two covariance matrices based on a robust test ∗ Corresponding author. Tel.: +1 714 278 2398; fax: +1 714 278 1431.

E-mail addresses: [email protected] (M. Jamshidian), [email protected] (J.R. Schott). 0167-9473/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2006.05.005

4228

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

for two mean vectors. O’Brien (1992) gives two multivariate generalizations of the univariate Levene’s test of equality of variances. Zhang and Boos (1992) present a pooled bootstrap procedure that replaces the 2 -approximation used in Bartlett’s test to make the test robust to deviation from normality. More recently, Schott (2001) has developed four Wald tests for equality of covariance matrices, denoted by T1 , T2 , T3 , and T4 . T1 is asymptotically equivalent to the LRT and is for normal populations. T2 is a statistic constructed under the assumption that the populations have elliptical distributions, and T3 and T4 are appropriate for more general populations. Aslam and Rocke (2005), which seems to be the most recent work, uses an S-estimate of the covariance matrix in the LRT to obtain a robust test of equality of covariance matrices. All the methods mentioned above assume that the data are completely observed, and they do not directly apply to problems with incomplete observations. In practice, it often happens that data are not completely observed. This paper aims to develop methods for testing equality of covariance matrices when data are missing completely at random (MCAR). Data are MCAR when the mechanism that leads to a missing datum is independent of the observed or the missing values in the data set (see e.g., Little and Rubin, 2002, Section 1.3). Based on our simulation studies, the new tests also work well for data that are missing at random (MAR). Data are MAR when the mechanism that leads to a missing datum is independent of the missing values. The tests that we develop have utility in various multivariate procedures, including the applications that we mentioned in the first paragraph. Tests of homogeneity of covariance matrices in the context of missing data, based on LRT and generalized least squares, have been proposed by Little (1988) and Kim and Bentler (2002). These tests, however, apply to a special problem where a group is defined by and is comprised of all the cases that have observations on the same variables (i.e. a group consists of all the cases with the same pattern of missing data). Then, equality of covariance matrices between various patterns of missing data are examined. These tests were mainly developed to test the assumption of MCAR among various patterns of missing data. If the mechanism of missingness for every case with a given pattern is the same, then these tests provide a means to test for MCAR. On the other hand, it is plausible that two different mechanisms lead to the same pattern of missingness for a pair of cases. In such cases, the Little (1988) and Kim and Bentler (2002) tests are not useful for testing MCAR. As we explain in Section 4, the tests that we develop here can be used for testing MCAR in a different setting as that of Little (1988) and Kim and Bentler (2002). The remaining sections of the paper are organized as follows: in Section 2, we develop two Wald tests; one works well under the assumption of normality and the other is a robust test designed under elliptical distribution theory. We also describe the LRT and a re-scaled LRT for the incomplete data in that section. The LRT and re-scaled LRT are in the same spirit as the tests proposed in structural equation models for testing nested covariance structures (see e.g., Arbuckle, 1996; Jamshidian and Bentler, 1999; and Yuan and Bentler, 2000) for a set of data, however, they are different from these in that they test equality of covariances (saturated model) across known groups in a data set. In Section 3, we report the results of a simulation study that compares the estimated significance levels and the power for the LRT, re-scaled LRT, and the Wald tests for both MCAR and MAR data. As in the complete case situation, the LRT and the Wald test that are constructed under the assumption of normality are very sensitive and liberal when data deviate from multivariate normality. While the re-scaled LRT improves upon LRT, it too is liberal, with it being more effective for larger sample sizes. On the other hand, the robust Wald test that we develop performs very well when data are from the multivariate normal as well as non-normal distributions. In Section 4 we give a summary and discussion of our methods. 2. Tests for equality of covariance matrices (l)

(l)

Let x1 , . . . , xn(l) , l = 1, . . . , m, denote n(l) independent observations for the lth group and from a p-variate distri(l)

bution with mean μ(l) and covariance matrix (l) . Each xj is assumed to have at least one observed component. Our aim here is to develop methods to test the null hypothesis H0 : (1) = · · · = (m) = 

(1)

versus the alternative hypothesis that H0 is not true, where  is an unknown common covariance matrix. We develop two Wald tests, a LRT, and a re-scaled LRT. The Wald tests are equivalent to Schott’s (2001) T1 and T2 tests when data are complete, but extend Schott’s tests to handle incomplete data cases. The re-scaled LRT is in the spirit of that developed by Yuan and Bentler (2000) in the context of covariance structure models.

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

4229

2.1. A Wald test for multivariate normal populations For a p×p matrix A, let (A) denote the 21 p(p+1)×1 column vector that is obtained from the p 2 ×1 vector vec(A) by     eliminating all superdiagonal elements of A. Furthermore, let (1) =  (1) − (m) , . . . , (m−1) =  (m−1) − (m) , T  and construct the 21 p(p + 1)(m − 1) × 1 vector  = (1)T , . . . , (m−1)T . Using this notation, the hypothesis in (1) can be tested via H0 :  = 0

versus

Ha :   = 0.

  ˆ (l) be a consistent and asymptotically normal estimate of (l) for l=1, . . . , m.Accordingly let ˆ (l) = ˆ (l) − ˆ (m) Let  T  and ˆ = ˆ (1)T , . . . , ˆ (m−1)T . It is easily seen that the variance–covariance matrix of ˆ is given by ⎛ (1)  + (m)   ⎜ (m) Cov ˆ ≡  = ⎜ .. ⎝ .

(m)  + (m) .. .

(m)

where (l)

(m)



(m) (m) .. .

··· ··· .. .

(2)

⎟ ⎟, ⎠

(2)

· · · (m−1) + (m)   ˆ 1 denote the estimate of  ˆ (l) for l = 1, . . . , m. Let  denotes the variance–covariance matrix of  

ˆ (l) that is consistent under the assumption of normality. Then obtained by replacing each (l) in (2) by an estimate  1 the Wald statistic in this setting is given by ˆ ˆ −1 , T1 = ˆ T  1

(3)

and under H0 has a central 2 -distribution with 21 p(p + 1)(m − 1) degrees of freedom for its asymptotic distribution. ˆ (l) in the complete case is fairly clear, namely, we might either use the maximum To compute (3), the choice of  likelihood estimate of the population covariance matrix or its usual unbiased estimate. For the incomplete case, however, there are a number of choices. For example, Little and Rubin (2002, Section 3.4) give various available-case methods for estimating the covariance matrix from incomplete data (see Section 4 for a discussion of other possible choices). Because it facilitates construction of the robust Wald test that we will describe shortly, we have opted to use an available case method that utilizes all the available data for a variable when estimating its mean and variance, and utilizes all the cases with data available for the pair when estimating the covariance between a pair of variables. More specifically, 

(l)  (l) (l) t = 1, . . . , n(l) , Ai denote the subset of 1, 2, . . . , n(l) indicating the let xti denote the ith component of xt (l) (l) (l) (l) indexes of the vectors x1 , . . . , xn(l) for which the ith variable is observed, and ni be the number of elements in Ai . Then the mean and variance of the ith variable for the lth group are respectively estimated by (l)

x¯ i =



1 (l)

ni

(l)

(l) t∈Ai

 

1

(l)

xti

and sii =

(l)

ni − 1

(l) t∈Ai

(l)

 (l) 2

xti − x¯i

.

(4)

(l)

(l)

(l)

(l)

Let Aij denote the index of cases for which both the ith and the jth variables in x1 , . . . , xn(l) are observed, and nij (l)

be the number of elements of Aij . Then, we estimate the covariance between the ith and the jth variables in the lth group by (l)

sij =

1 (l) nij

−1

  (l) t∈Aij

(l)

(l)

xti − x˜ i



(l)

(l)

xtj − x˜ j



(l)

where x˜ j =

1



(l) nij (l) t∈Aij

(l)

xtj .

(5)

ˆ 1 when the covariance To complete all the ingredients needed to compute (3), we need to specify a formula for  matrix (l) is estimated by using (4) and (5), andthis involves finding estimates for the elements of (l) , l = 1, . . . , m.  (l) (l) (l) (l) Note that each element of (l) is of the form cov shi , sj k for some choice of h, i, j , and k. If shi and sj k are computed

4230

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

using (4) and (5), then it can be shown that    (l) (l) (l) (l) (l) (l)    n(l)  − 1   +   n n hij k hij k hj ik hk ij hij k (l) (l) (l) (l) (l) (l)    . cov shi , sj k = (l) (l) hj ik + hk ij + (l) (l) (l) (l) nhi nj k nhi nj k nhi − 1 nj k − 1

(6)

(l)

Here nhij k denotes the number of cases in which the hth, ith, jth, and kth variables are all observed. This formula holds for all choices of h, i, j , and k if we use the convention that eliminates duplicate values as subscripts of n(l) . For (l) (l) (l) (l) instance, nii is the same as ni , and nij ik is the same as nij k , that is, the number of cases in which the ith, jth, and kth variables are all observed. The construction of our Wald statistic only requires estimates of asymptotic covariances so the second of the two terms in (6) can be omitted, although one might expect that the use of the exact covariance would yield a procedure that performed better for small sample sizes. This was not the case so the simulations reported in Section 3 used the shortened formula. ˆ (l) are obtained by using (6) after replacing the (l) ’s by estimates, and then the  ˆ (l) ’s are used to The elements of  1 1 ij ˆ 1 . Since under the null hypothesis, (1) = · · · = (m) , we use the pooled estimator compute  ij

ˆ ij =

m n(l) − 1  ij l=1

nij − m

ij

(l)

sij ,

(7)

 (l) (l) ˆ (l) where nij = m l=1 nij , and sij is as defined in (4) and (5). Note that the 1 ’s will not necessarily be equal, as the number of observed cases may vary from group to group. ˆ 1 . Eq. (2) has a special structure and so does its inverse. One of the major costs in computing (3) is the inversion of  In particular, we find that ⎛

W1 − W1 W −1 W1 ⎜ −W2 W −1 W1 ˆ −1 = ⎜  .. 1 ⎝ . −Wm−1 W −1 W1

−W1 W −1 W2 W2 − W2 W −1 W2 .. . −Wm−1 W −1 W2

··· ··· .. .

−W1 W −1 Wm−1 W2 W −1 Wm−1 .. .

⎞ ⎟ ⎟, ⎠

· · · Wm−1 − Wm−1 W −1 Wm−1

 −1  ˆ (l) where Wl =  and W = m l=1 Wl (see James, 1954). Using this and some simple matrix algebra, (3) can be 1 written in the following equivalent form: T1 =

m−1 



T

ˆ (i) − ¯

Wi ˆ (i) =

i=1

m  

ˆ (i) − ¯ ∗

T

  Wi ˆ (i) − ¯ ∗ ,

(8)

i=1

where ¯ = W −1

m−1  j =1

(j ) Wj ˆ ,

¯ ∗ = W −1

m 

(j ) Wj ˆ .

j =1

Eq. (8) is more computationally efficient than direct implementation of (3), and we have used this form for our computations. Note that the final expression for T1 reveals that this statistic does not depend on which group is identified as the mth group. While we have encountered very few instances in our simulation study, it is theoretically possible that because of the ˆ (l) be non-positive definite. In these cases, way we estimate the (l) ’s, the corresponding estimated covariance matrix  1 ˆ (l) , zero out the non-positive eigenvalues, and invert the positive eigenvalues we write the spectral decomposition of  1 ˆ (l) by the positive semidefinite matrix and reconstruct a pseudo-inverse to replace Wl . This corresponds to replacing  1 closest to it (see Schwertman and Allen, 1979).

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

4231

2.2. Elliptical populations with common kurtosis parameter In this section, we construct a Wald statistic under the assumption of elliptical distributions, which performs better than T1 outside the class of normal distributions and, based on our simulations, performs as well as T1 for the normal population case. A p × 1 random vector x has an elliptical distribution with mean vector μ and covariance matrix    T if its characteristic function has the form (t) = eit μ tT t , where is a function scaled so that  (0) = 1/2 (see Fang et al., 1990). If this elliptical distribution has finite fourth moments, then each fourth-order moment depends on a single kurtosis parameter ; in particular, we have      E (x − μ)(x − μ)T ⊗ (x − μ)(x − μ)T = (1 + ) Ip2 + Kpp ( ⊗ ) + vec()vec()T , (9) where Kpp is a commutation matrix (see, for example, Schott, 2005, Section 8.6). If x has a multivariate normal distribution, then = 0. We will assume that each of our m groups has an elliptical distribution with finite fourth moments and common kurtosis parameter . The Wald test here differs from the normal population case only in the estimate of (l) . Eq. (9) can be used to show that  cov

(l) (l) shi , sj k



(l) nhij k 

=

(l) (l)

nhi nj k

  (l) (l) (l) (l) (l) (l)

hi j k + (1 + ) hj ik + hk ij

   (l) (l) (l) (l) (l) (l) nhij k nhij k − 1 hj ik + hk ij    , + (l) (l) (l) (l) nhi nj k nhi − 1 nj k − 1

(10)

which of course reduces to formula (6) when = 0. As with (6), we omitted the second term of the formula in our simulations since it did not improve the performance of the test. In some situations (Muirhead and Waternaux, 1980; Tyler, 1983), LRT’s and Wald statistics constructed under normality can be adjusted by a simple scalar multiple so as to retain their asymptotic null distributions over the class of elliptical distributions. The fact that the first term in (10) is not a scalar multiple of the first term in (6) indicates that such an adjustment is not possible in this setting. The choice of the estimate of can have a significant impact on the significance levels for small samples. We adapt the estimate used by Schott (2001) for our incomplete case. Define ⎫ ⎧ ⎪ ⎪ ⎬ ⎨      1 (l) (l) (l) 4 (l) 2 zi = (l) , − 6 sii xti − x¯i ⎪ ni − 4 ⎪ ⎭ ⎩ (l) t∈Ai

and (l) wi

(l)

=

ni

(l)

ni − 1

(l)





 (l) 2 sii

(l)



zi

(l)

ni

 ,

(l)

where x¯i and sii are as defined in (4). Then an estimate of the kurtosis parameter for the lth group, (l) , is given by

ˆ (l) =

p (l) 1  zi − 1, (l) 3p i=1 wi

while the common kurtosis parameter is estimated by

ˆ =

m  l=1

ˆ (l) /m.

(11)

4232

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

(l) ˆ (l) ˆ (l) We obtain an estimator of (l) , which we denote as  2 , by using (7) and (11) in (10). Substituting 2 for  in ˆ (2) yields an estimator which we denote by 2 . Following the development of T1 , we find that T −1 ˆ 2 ˆ T2 = ˆ  m  m−1 T   T   = ˆ (i) − ˜ Vi ˆ (i) = ˆ (i) − ˜ ∗ Vi ˆ (i) − ˜ ∗ , i=1

i=1

has an asymptotic 2 -distribution with  V = m l=1 Vl , ˜ = V −1

m−1 

(j )

Vj ˆ

and

1 2 p(p

˜ ∗ = V −1

j =1

 (l) −1 ˆ2 + 1)(m − 1) degrees of freedom under H0 , where Vl =  ,

m 

(j )

Vj ˆ

.

j =1

As with T1 , T2 does not depend on which group is taken as the mth group. 2.3. The LRT for normal populations Let x1 , . . . , xn be a set of observations from a p-variate normal distribution with mean μ and covariance matrix . We allow the possibility that xi (i = 1, . . . , n) be partially observed, and we denote the observed and the missing part of xi by xobs,i and xmis,i , respectively. Then the contribution of each case xi to the observed data log-likelihood is           + xobs,i − μo T oo −1 xobs,i − μo , (12) i (μ, ) = − 21 pi log(2 ) + log oo i i i i where pi is the number of observed components of xi , and μoi and oo i are the subvector and submatrix of μ and  that correspond to the observed components of xi . Using (12), the observed log-likelihood is given by (μ, ) =

n 

i (μ, ).

(13)

i=1

The LRT requires the maximum value of the log-likelihood under the restriction imposed by the null hypothesis, say L0 , and the unrestricted maximum of the log-likelihood, say L1 . The null hypothesis requires a common covariance for all the groups. Thus to obtain L0 , we maximize (13) with x1 , . . . , xn consisting of all the cases observed for all m groups with a common . The unrestricted log-likelihood is obtained by allowing a different covariance matrix for each group. Thus to obtain L1 , we maximize (13) using the data for each of the m groups and then aggregating the m maximum values obtained. Eq. (13) can be maximized, for example, via the EM algorithm. The LRT is then given by R = −2 (L0 − L1 ) ∼ 2(m−1)(p+p(p+1)/2) . If the mean is assumed to be zero, then one would set μ = 0 in all of the above cases and the degrees of freedom for the 2 -test will be (m − 1)p(p + 1)/2. This is what we have used in our simulation studies of the next section. 2.4. The re-scaled LRT for nonnormal populations As in Yuan and Bentler (2000), it can be shown that R does not follow a 2 -distribution for non-normal data. We construct a re-scaled version of R that has an approximate 2 -distributions for non-normal data. Proofs of these assertions are not given here, because they parallel those given in Yuan and Bentler (2000). For simplicity, we consider the case of μ = 0. Let (i) denote the p∗ = 21 p(p + 1) × 1 column vector that is obtained   T T T be a p˜ × 1 vector where p˜ = m × p∗ , containing from (i) for each i = 1, . . . , m, and let  = (1) , . . . , (m)

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

4233

all the parameters. Our re-scaled LRT tests the null hypothesis H0 : (i) − (1) = 0 for i = 2, . . . , m; (1) is chosen without loss of generality. Consider the p˜ × p˜ matrices n 1  j2 i () H () = − lim n→∞ n jjT i=1

n 1  ji () ji () and G() = lim , n→∞ n j jT i=1

  where i () is as defined in (12) with μ = 0. Note that if case i belongs to group j, then i () = i (j ) . Let ˆ be     n ˆ = G ˆ are, respectively, the ML estimate of  obtained by maximizing () = i=1 i (). Then Hˆ = H ˆ and G ˆ = Hˆ −1 G ˆ Hˆ −1 , the estimates of the Fisher information and the empirical covariance of the score vector. Consider  −1  ˙ T Hˆ ,where ˙ = j/j(1) . well-known sandwich type estimate of covariance of , ˆ and let V = Hˆ − Hˆ ˙ ˙ T Hˆ ˙ Then following Yuan and Bentler (2000), the re-scaled LRT in our context will be   ˆ R ∗ = [(m − 1)p(p + 1)/2]R/trace V which has an approximate 2 -distribution with (m − 1)p(p + 1)/2  It turnsout that the quantity  degrees  of freedom. ˆ ˆ ˆ ii , where Hˆ ii and G ˆ ii trace Hˆ ii−1 G trace V has a fairly simple form in our context, namely trace V = m i=2 ˆ when the latter matrices are partitioned into p ∗ × p ∗ matrices. are, respectively, the ith block-diagonal of Hˆ and G 3. Simulation results In this section, we present the results of our simulation studies that aim to compare the test statistics introduced. We start by making a comparison of the observed significance level to the nominal significance level for each test, and then we conclude the section with a comparison of the power of the tests. Tables 1 and 2 compare the estimated significance levels for the Wald test T1 (normal populations), the Wald test T2 (elliptical populations), the normal-theory LRT R, and the re-scaled LRT R ∗ . In each case, data for three groups with equal sample sizes per group of ni = 30, 50, and 100 were generated. We choose equal sample size per group only for simplicity. Our results apply to unequal sample sizes per group, and the simulation results can be extrapolated to those cases. We have considered cases with p = 3, 5, and 8. In some of our simulations we have omitted p = 8 to save time, but in those cases we have ensured that our conclusions extend naturally. Each of the tests is performed on each generated data set under the conditions that 0%, 10% and 30% of data are missing. Missing data were generated artificially and completely at random. In all of our comparisons, we use the nominal significance level of 5%. We have used three population covariance matrices in order to examine the effect of correlation between variables on the tests. Specifically, the ijth element of the common population covariance matrix is ij = |i−j | with = 0, 0.5 and 0.8 to cover a wide range of correlation between variables. The results for the tests T1 and T2 are based on 2000 simulations per case and those for R and R ∗ are based on 1000 simulations per case. The reason for using a smaller number of simulations for R and R ∗ is to save computation time, because computation of these requires iterative methods and is more intensive than that of T1 and T2 . Furthermore, based on a few experiments, we have come to the conclusion that the messages that our simulations convey does not change when we use 1000 simulations or a larger number of simulations. Table 1 shows the results of our simulations for data generated from the multivariate normal with mean 0 and the given covariance matrix  in the table. It is interesting to note that when data are complete, R performs inferior to both T1 and T2 , and this is even more so as the number of variables p increases. All four tests reject more liberally as the percentage of missing data increases from 0% to 30%. R is superior to R ∗ in small samples for this case, but their differences reduce as the sample size grows. In fact, we have compared R and R ∗ on samples as large as ni = 2000, and they were virtually the same in this case. As expected, because the data were generated from the multivariate normal distribution, overall T1 performs better than T2 but the differences are not striking. Generally, the difference between T1 and T2 decreases as the sample size increases. For the larger sample sizes of 50 and 100 both T1 and T2 ’s performances are satisfactory and this is also true about the smaller sample size of 30 when the percentage of missing and the number of variables are smaller. While both R and R ∗ improve as the sample size increases, their performance continues to be less satisfactory than that of T1 and T2 for the sample size 100, p = 8 and the highly correlated case with = 0.8.

4234

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

Table 1 Estimated significance levels, in percents, when the nominal significance level is 5% p=3 % Missing

0

p=5

p=8

10

30

0

10

30

0

10

30

(a) Sample size per group = 30 =0 R 5.90 10.3 R∗ 4.55 T1 6.25 T2

6.90 11.5 5.55 6.80

9.80 19.3 4.50 6.80

9.60 16.7 5.15 5.85

12.10 21.0 5.55 7.30

36.90 49.0 6.15 8.45

23.7 – 4.85 7.15

34.4 – 5.20 7.50

[99.2] – 7.45 11.30

= 0.5 R R∗ T1 T2

5.90 13.0 4.45 5.80

7.40 14.5 4.95 5.80

10.6 18.0 6.55 8.10

11.4 17.8 5.15 7.70

14.0 21.0 5.25 6.60

39.2 54.8 6.55 9.80

22.9 – 3.85 6.86

34.4 54.7 5.21 8.06

99.7 – 9.95 13.20

= 0.8 R R∗ T1 T2

8.10 11.3 4.10 5.95

9.20 12.8 4.05 6.10

8.70 19.2 5.95 6.85

9.30 17.8 4.80 7.65

10.6 24.2 6.20 7.40

45.90 56.5 [11.05] [12.45]

25.00 – 4.65 8.95

36.60 – 8.25 11.35

[99.67] – [23.50] [23.30]

(b) Sample size per group = 50 =0 R 7.00 9.4 R∗ 4.90 T1 5.40 T2

6.10 9.2 4.55 5.35

7.60 11.1 4.70 6.10

7.10 14.1 4.60 5.15

8.50 20.7 5.45 6.05

13.3 18.0 5.30 6.20

12.0 – 4.60 5.75

16.9 – 5.55 7.00

64.8 – 6.00 7.75

= 0.5 R R∗ T1 T2

7.00 8.50 4.80 6.40

7.00 9.20 5.70 7.15

7.40 10.6 4.95 6.25

8.30 12.0 4.65 5.50

9.50 13.2 4.65 5.60

13.30 21.9 4.50 5.65

12.0 – 4.25 6.20

17.3 27.4 4.75 6.10

74.9 – 7.60 7.50

= 0.8 R R∗ T1 T2

6.60 8.70 4.95 5.90

6.10 9.40 4.90 5.50

6.70 12.6 5.90 6.05

8.20 13.0 5.45 6.45

9.10 14.6 4.65 6.80

15.5 22.2 7.20 7.30

10.7 – 4.50 6.60

16.0 – 5.90 6.85

75.8 – [10.30] [11.85]

(c) Sample size per group = 100 =0 R 6.20 7.20 R∗ 5.20 T1 5.55 T2

5.70 7.40 5.55 6.15

5.30 7.50 5.40 6.15

4.80 8.00 5.05 5.15

6.10 8.00 4.65 4.85

7.30 11.8 5.15 5.75

8.40 – 6.20 6.50

9.60 – 5.65 5.85

16.1 – 6.00 7.00

= 0.5 R R∗ T1 T2

6.20 6.80 5.40 5.70

6.20 6.80 5.20 5.55

4.70 9.40 5.15 5.70

5.90 7.80 5.35 6.30

5.50 7.70 5.85 6.30

9.30 10.8 4.70 5.80

7.40 – 4.25 5.50

9.90 12.3 4.35 4.45

16.7 – 5.55 6.05

= 0.8 R R∗ T1 T2

6.10 5.60 5.25 5.40

6.30 6.10 5.05 5.55

6.90 7.30 6.25 6.10

5.80 7.80 5.00 5.95

6.40 9.00 4.45 5.00

10.30 12.2 6.55 6.45

7.50 – 4.95 5.55

9.50 – 4.20 4.85

18.80 – 6.00 6.60

Data were generated using the multivariate normal with mean 0 and covariance matrix ij = |i−j | . Data are MCAR. “–” indicate cases that are not computed.

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

4235

Table 2 Estimated significance levels, in percent, when the nominal significance level is 0.05 p=3 % Missing

0

p=5

p=8

10

30

0

10

30

0

10

30

(a) Sample size per group = 30 =0 R 50.7 28.0 R∗ 45.8 T1 7.30 T2

50.7 29.0 44.4 7.10

49.9 38.2 41.1 7.40

77.1 52.6 66.9 4.65

77.3 61.4 64.4 4.65

84.6 78.2 60.5 6.10

95.7 – 84.7 2.05

96.6 – 83.9 2.00

100.0 – 79.4 4.20

= 0.5 T1 T2

48.6 7.30

48.8 7.05

41.8 8.40

67.5 5.30

64.7 5.60

61.5 6.50

84.7 3.55

83.9 3.55

82.1 6.05

= 0.8 R R∗

53.7 29.6

52.9 33.8

52.1 39.4

76.9 48.6

79.4 58.2

87.1 81.8

94.9 –

97.0 –

100.0 –

T1 T2

49.1 11.35

47.1 8.90

43.0 8.00

66.0 9.85

63.8 7.65

[62.3] [9.20]

83.3 7.91

82.8 8.06

[83.6] [11.15]

(b) Sample size per group = 50 =0 R 59.4 57.7 51.2 49.7 T1 6.20 7.20 T2

54.3 49.0 6.55

78.7 73.8 4.80

78.2 73.2 5.15

76.3 69.5 6.35

96.0 93.2 2.90

95.7 92.0 3.05

98.4 88.8 3.85

= 0.5 T1 T2

55.1 8.55

54.5 8.70

53.1 7.40

76.2 6.40

74.3 6.55

70.8 6.20

93.1 3.25

91.0 3.90

90.8 3.90

= 0.8 R T1 T2

59.8 56.8 10.10

57.7 53.3 9.30

57.4 51.4 8.95

81.7 73.6 9.50

80.8 72.0 7.25

80.1 72.4 6.65

97.2 91.9 9.25

96.5 89.8 6.25

99.7 [89.4] [8.00]

(c) Sample size per group = 100 =0 R 65.1 62.8 16.4 17.4 R∗ 63.3 62.0 T1 6.70 6.65 T2

61.4 21.0 59.5 7.30

85.8 27.8 83.8 6.30

85.1 29.0 83.0 6.60

82.6 38.6 80.5 7.05

97.4 – 96.8 4.40

97.0 – 96.6 4.20

97.8 – 96.0 4.55

= 0.5 T1 T2

63.1 8.00

63.4 7.95

61.0 8.15

83.5 7.90

84.6 7.10

81.2 6.50

97.2 6.00

96.9 5.25

96.1 5.70

= 0.8 R R∗ T1 T2

64.4 16.2 61.6 11.2

64.2 17.8 58.3 8.50

62.2 20.0 59.5 8.05

86.2 28.2 84.1 13.6

85.3 29.6 80.3 10.9

84.8 34.0 80.1 6.65

98.3 – 97.1 11.5

98.1 – 96.3 8.05

97.9 – 95.0 4.85

Data were generated from the p variate t distribution with 5 degrees of freedom and covariance matrix ij = |i−j | . Data are MCAR. “–” indicate cases that are not computed.

The data for Table 2 were generated as in Table 1, except each case was generated according to a multivariate t with 5 degrees of freedom. To save time, and because we felt it does not add additional information, we did not perform simulations of R or R ∗ for ij = 0.5|i−j | , and for the same reason we omitted simulation of R ∗ for ni = 50. It is clear that R, R ∗ , and T1 perform poorly in this case, rejecting in too many instances far beyond the 5% nominal level. R ∗ performs slightly better than R in every case here, but its observed significance levels are still much larger than the

4236

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

Table 3 Estimated significance levels, in percent, when the nominal significance level is 0.05 % Missing

T1

T2

T2 (t-dist)

R

R∗

Sample size per group = 30 0 10 30

4.76 7.14 11.09

6.18 9.55 12.97

5.44 10.28 12.46

10.70 12.60 32.40

17.80 23.80 51.30

Sample size per group = 50 0 10 30

4.25 5.75 6.80

6.40 6.85 8.70

5.80 9.10 11.60

6.70 7.30 13.50

10.10 12.80 22.00

Sample size per group = 100 0 10 30

4.10 3.95 6.75

5.00 5.75 8.05

6.80 9.95 11.05

6.00 6.60 8.50

8.10 8.60 11.50

Data for all columns, except the column labeled “T2 (t-dist)” were generated from the p =5-variate normal distribution with covariance ij =0.5|i−j | . Data for column “T2 (t-dist)” was generated based on t with 5 degrees of freedom. Data are MAR.

nominal 5% level. On the other hand, the performance of T2 is quite acceptable for all the cases with = 0 and 0.5. The convergence to the nominal 5% level is slower for the highly correlated case of = 0.8, but it is acceptable. To ensure that the observed significance levels for T2 in fact converge to the nominal significance level of 5%, we have done simulations with larger sample sizes. For example, the worse case in Table 2 for T2 is when ni = 100, p = 5, = 0.8, and no datum is missing with the observed significance level of 13.6%. When we increase the sample size to 5000, the observed significance level decreases to 8.7%. We believe that the slow convergence in this case is mainly due to the estimation of . We know that the true value of is 2 for this case, and when we used this true value in a set of simulations, we got estimated significance levels that are much closer to the nominal ones for smaller sample sizes, as compared to the cases where we use the estimate of computed based on Eq. (11). We note that the percentages inside the bracket in rows corresponding to the R in Tables 1 and 2 indicate cases where over 10% of the simulations failed due to insufficient number of observed data. Similarly those percentages in ˆ (l) s were encountered. the bracket in rows T1 and T2 indicate cases where over 10% of non-positive definite  As mentioned, data for Tables 1 and 2 were generated according to the MCAR missing data mechanism. To see how our tests behave with data that are MAR we duplicated the simulations in Tables 1 and 2 with data that were MAR. To save space, and because the message is consistent with the other cases not reported, we have only presented the cases with = 0.5 and p = 5 in Table 3. To generate MAR data, if for a given case the absolute value of the observed value for variable 1 is larger than a prescribed constant c, then we would impose missingness for the remaining variables of that case, each with probability 50%. Note that in this case, variable 1 is completely observed, and the value of c is chosen so that we approximately achieve the desired percentage of missingness, namely 10% and 30%. Table 3 shows the observed significance levels for T1 , T2 , R, and R ∗ for normally distributed data, and the column titled “T2 (t-dist)” contains the observed significance levels for T2 when data are t distributed. In small samples, the observed significance levels are somewhat more inflated for MAR data as compared to MCAR data. Interestingly, however, as the sample size increases, the results for both MCAR and MAR are very comparable. This trend is not only true for the cases that we have considered in Table 3, but it is also true in all the other cases reported in Tables 1 and 2. Table 4 shows the result of our simulations to assess the power of the tests T1 , T2 , and R. For these simulations, we have selected those scenarios from our previous simulations that led to reasonable observed significance levels. In particular, we only consider the power of T2 with data from the multivariate t with 5 degrees of freedom because R and T1 reject the null hypothesis too often in this case. For normally distributed data we compare all three tests. The first column of Table 4 indicates the type of distribution used to generate the data. We set the population covariance matrices for two of the three groups to the identity matrix and the covariance matrix for the third group to three different settings of ij = |i−j | , for = 0, 0.3, 0.5. In Table 4, the columns under = 0 correspond to the observed significance levels and should be compared to the nominal level of 5%. The table shows some expected patterns, namely the power of the tests increases as increases,

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

4237

Table 4 Comparison of the power of R, T1 , and T2 when the nominal significance level is 5%

=0 % Missing

0

= 0.3 10

30

= 0.5

0

10

30

0

10

30

(a) Number of variables = 3 Sample size per group = 30 R (normal) (%) 7.30 5.10 T1 (normal) (%) 6.40 T2 (normal) (%) 6.50 T2 (t, 5 d.f.) (%)

6.20 5.60 6.30 7.00

9.40 5.30 7.30 6.80

23.8 17.7 18.5 13.9

20.4 13.9 16.6 12.4

17.9 10.6 13.7 9.6

61.8 51.5 49.5 29.3

53.3 42.0 40.3 24.6

38.3 25.2 27.7 19.1

Sample size per group = 50 R (normal) (%) 6.70 4.80 T1 (normal) (%) 5.70 T2 (normal) (%) 6.50 T2 (t, 5 d.f.) (%)

7.50 5.50 5.90 7.20

7.50 4.90 7.00 8.30

31.2 28.0 28.2 18.9

26.1 22.4 23.9 16.3

18.3 15.4 16.1 13.2

88.4 80.7 81.9 49.7

79.9 67.2 69.1 39.5

53.8 42.3 41.7 25.5

Sample size per group = 100 R (normal) (%) 6.00 5.20 T1 (normal) (%) 5.30 T2 (normal) (%) 5.90 T2 (t, 5 d.f.) (%)

5.20 4.80 5.50 6.30

5.50 4.70 5.90 6.40

63.8 59.5 61.2 34.8

52.8 48.1 49.6 29.2

33.3 27.5 28.6 20.5

100 99.7 99.7 81.6

98.5 97.0 97.6 70.9

87.8 78.7 79.7 47.5

(b) Number of variables = 5 Sample size per group = 30 R (normal) (%) 12.6 5.00 T1 (normal) (%) 6.70 T2 (normal) (%) 5.40 T2 (t, 5 d.f.) (%)

12.2 4.40 6.70 6.00

35.6 6.70 8.90 7.00

32.7 18.9 22.8 12.0

31.4 16.6 21.3 10.7

46.3 12.0 15.9 9.50

84.5 66.4 65.6 31.2

75.5 54.5 54.8 26.8

70.8 32.8 36.0 17.3

Sample size per group = 50 R (normal) (%) 8.60 5.60 T1 (normal) (%) 6.50 T2 (normal) (%) 5.80 T2 (t, 5 d.f.) (%)

9.60 5.30 6.50 6.30

12.9 5.50 6.90 6.80

45.7 38.4 38.0 20.7

37.7 27.9 29.1 17.1

33.3 17.8 20.2 13.0

97.6 95.2 93.0 59.5

93.5 86.0 83.9 50.1

79.7 56.5 57.9 30.9

Sample size per group = 100 R (normal) (%) 5.80 4.00 T1 (normal) (%) 5.80 T2 (normal) (%) 5.30 T2 (t, 5 d.f.) (%)

6.80 4.30 5.10 5.50

7.20 5.70 5.70 6.70

82.3 79.9 78.4 42.2

70.1 66.2 65.0 34.5

48.2 38.4 39.9 21.2

100 100.0 99.9 86.4

98.0 94.3 94.3 62.6

100 100 100 92

sample size increases, or percentage of missing data decreases. The power of the tests for p = 5 in each case is larger than its p = 3 counterpart, although this power comparison is meaningful mainly when the observed significance levels are close for both cases. For example, for R, when ni = 30 and 30% of the data are missing, the observed significance level is 9.40% and the p = 3 case, as compared to the 35.6% for the p = 5 case, thus we would expect a higher rate of rejection (power) for the p = 5 case as compared to the p = 3 case. This, however, does not make the test necessarily a better one for the p = 5 case, as the observed significance levels are not close. Finally, the power of the tests T1 and T2 are fairly close when data are from a normal population and the power of T2 is lower for the multivariate t population as compared to multivariate normal population. 4. Summary and discussion This paper develops and discusses four statistics for testing equality of covariance matrices of several groups of multivariate populations when observed data are incomplete. Two Wald tests T1 and T2 , based on the assumptions of normally and elliptically distributed populations, are developed and the normal theory LRT, called R, and a re-scaled version of it are also discussed. It is well known that the normal theory based tests for this problem are sensitive to

4238

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

deviation from the normality assumption. Our simulations in Section 3 confirms the same phenomenon as T1 and R are too liberal in rejecting H0 when data comes from the t distribution with 5 degrees of freedom. R ∗ performs somewhat better than R, but overall its performance is inferior to T2 . The test T2 , based on our simulations, behaves very well for testing H0 when data come from normal or non-normal data, and thus we recommend this test. As we mentioned, both T1 and T2 are equivalent to the tests proposed by Schott (2001), and so it is no surprise that our simulation results for the complete case agree with those of Schott (2001). It should be pointed out that although the formulas given by Schott (2001) for T1 and T2 had the degrees of freedom as the multiplier, the simulations reported in that paper used T1 and T2 with the total sample size as the multiplier and these versions are equivalent to the T1 and T2 used in this paper. The latest published robust test of equality of covariances was given by Aslam and Rocke (2005), and we were curious as to how the test T2 compares to their test. We replicated the simulation study that they report in their Table 3 (the cases with = 0.05) using the T2 test. Based on this simulation T2 performs better than the Aslam and Rocke (2005) test in that the estimated significance levels for T2 are closer to the nominal 5% level, with T2 being specially superior for smaller sample sizes and larger p’s. Moreover, T2 is simpler to compute than the Aslam and Rocke (2005) test. A more detailed comparison of T2 with the Aslam and Rocke (2005) test, however, is warranted to make a fair comparison of the two tests. In the development of our Wald tests, we have used an available case method to compute the covariances from incomplete data. The main reason for this choice was that we were able to get closed form formulas for the estimates of (l) and (l) . This in turn allowed us to develop the robust test T2 . Other choices exists for estimating (l) and (l) . A choice that stands out is the normal-theory maximum likelihood estimate of (l) computed from incomplete data, say by using the EM algorithm (see e.g., Little and Rubin, 2002, Section 11.2). In this case (l) can, for example, be estimated by the observed information matrix (see, Jamshidian and Jennrich, 2000). While we have not experimented with this method, we think that this method will also be sensitive to non-normal data and because it involves iterative processes its modification to make it robust does not seem straightforward to us. As mentioned in the Introduction, one may use the tests that we have developed here to assess whether data are MCAR. If data are MCAR, then every subset of the cases should be a random sample from the same population. Thus, one may test for MCAR by dividing incomplete data into several groups and test for equality of covariance matrices. Rejection of equality of covariances may be interpreted as data not MCAR. The question of how the data should be partitioned into groups and how such a method might work in practice is an interesting subject for a future study. We would like to note that the Little (1988) and Kim and Bentler (2002) tests of MCAR are applicable in a different setting as the one we just described. Our tests allow for testing MCAR when two different missing data mechanisms lead to the same pattern of missing for two subjects in that two subjects with the same pattern of missing data can be placed in two different groups. On the other hand, Little (1988) and Kim and Bentler’s (2002) tests require that all subjects with the same pattern of missing be in the same group. The tests that we have proposed here cannot handle this situation, as they require sufficient amount of observed data in each group and for each variable so that the covariance matrices (l) and (l) can be estimated. For the tests T1 and T2 , we need to have at least two cases observed for each pair, although ˆ (l) . We only observed this problem in a few cases of this may not be sufficient to produce a positive definite ˆ (l) or  our simulations when the number of cases was small (ni = 30), the percentage of missing data was large (30%), the number of variables was large (p = 8) and variables were highly correlated. A similar problem arises for the LRT in that if there is not sufficient amount of observed data the EM algorithm’s E-step runs into problems. Acknowledgments We would like to thank the associate editor and the referees for their helpful comments. Mortaza Jamshidian’s research has been supported in part by the National Science Foundation Grant DMS-0437258. We would like to also thank Mojgan Khatoonabadi, an undergraduate student at CSUF, for her assistance in organizing the results of the simulation studies. References Arbuckle, J.L., 1996. Full information estimation in the presence of incomplete data. In: Marcoulides, G.A., Schumacker, R.E. (Eds.), Advanced Structural Equation Modeling: Issues and Techniques. Lawrence Erlbaum Associates, Mahwah, NJ, pp. 243–278. Aslam, S., Rocke, D.M., 2005. A robust testing procedure for equality of covariance matrices. Comput. Statist. Data Anal. 49, 863–874.

M. Jamshidian, J.R. Schott / Computational Statistics & Data Analysis 51 (2007) 4227 – 4239

4239

Fang, K.T., Kotz, S., Ng, K.W., 1990. Symmetric Multivariate and Related Distributions. Chapman & Hall, New York. James, G.S., 1954. Tests of linear hypotheses in univariate and multivariate analysis when the ratios of the population variances are unknown. Biometrika 41, 19–43. Jamshidian, M., Bentler, P.M., 1999. ML estimation of mean and covariance structures with missing data using complete data routines. J. Educational Behav. Statist. 24, 21–41. Jamshidian, M., Jennrich, R.I., 2000. Standard errors for EM estimation. J. Roy. Statist. Soc., Ser. B. 62, 257–270. Kim, K.H., Bentler, P.M., 2002. Tests of homogeneity of means and covariance matrices for multivariate incomplete data. Psychometrika 67, 609–624. Little, R.J.A., 1988. A test of missing completely at random for multivariate data with missing values. J. Amer. Statist. Assoc. 83, 1198–1202. Little, R.J.A., Rubin, D.B., 2002. Statistical Analysis With Missing Data. second ed. Wiley, New York. Muirhead, R.J., 1982. Aspects of Multivariate Statistical Theory. Wiley, New York. Muirhead, R.J., Waternaux, C.M., 1980. Asymptotic distributions in canonical correlation analysis and other multivariate procedures for nonnormal populations. Biometrika 67, 31–43. O’Brien, P.C., 1992. Robust procedures for testing equality of covariance matrices. Biometrics 48, 819–827. Schott, J.R., 2001. Some tests for the equality of covariance matrices. J. Statist. Plann. Inference 94, 25–36. Schott, J.R., 2005. Matrix Analysis for Statistics. second ed. Wiley, New York. Schwertman, N.C., Allen, D.M., 1979. Smoothing an indefinite variance-covariance matrix. J. Statist. Comput. Simul. 9, 183–194. Tiku, M.L., Balakrishnan, N., 1985. Testing equality of variance-covariance matrices the robust way. Commun. Statist. A 14, 3033–3051. Tyler, D.E., 1983. Robustness and efficiency properties of scatter matrices. Biometrika 79, 411–420. Yuan, K.H., Bentler, P.M., 2000. Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data. Sociol. Methodol. 2000 30, 165–200. Zhang, J., Boos, D.D., 1992. Bootstrap critical values for testing homogeneity of covariance matrices. J. Amer. Statist. Assoc. 87, 425–429.