5640
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 11, NOVEMBER 2012
Shrinkage-to-Tapering Estimation of Large Covariance Matrices Xiaohui Chen, Z. Jane Wang, Senior Member, IEEE, and Martin J. McKeown
Abstract—In this paper, we introduce a shrinkage-to-tapering approach for estimating large covariance matrices when the number of samples is substantially fewer than the number of variables (i.e., and ). The proposed estimator improves upon both shrinkage and tapering estimators by shrinking the sample covariance matrix to its tapered version. We first show that, under both normalized Frobenius and spectral risks, the minimum mean-squared error (MMSE) shrinkage-to-identity estimator is inconsistent and outperformed by a minimax tapering estimator for a class of high-dimensional and diagonally dominant covariance matrices. Motivated by this observation, we propose a shrinkage-to-tapering oracle (STO) estimator for efficient estimation of general, large covariance matrices. A closed-form formula of the optimal coefficient of the proposed STO estimator is derived under the minimum Frobenius risk. Since the true covariance matrix is to be estimated, we further propose a STO approximating (STOA) algorithm with a data-driven bandwidth selection procedure to iteratively estimate the coefficient and the covariance matrix. We study the finite sample performances of different estimators and our simulation results clearly show the improved performances of the proposed STO estimators. Finally, the proposed STOA method is applied to a real breast cancer gene expression data set. Index Terms—Large covariance estimation, minimax risk, minimum mean-squared errors, shrinkage estimator, tapering operator.
I. INTRODUCTION
P
RECISE estimation of covariance matrices from multivariate data is of great importance in many research fields such as array signal processing [1], [2], eigen-image analysis [3] and principal component analysis [4]. In this paper, we assume that independently and identically distributed (i.i.d.) samples following a zero-mean -dimensional multivariate Gaussian distribution are collected,1 where is the
Manuscript received August 05, 2011; revised December 29, 2011 and June 20, 2012; accepted July 10, 2012. Date of publication July 27, 2012; date of current version nulldate. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Arie Yeredor. X. Chen and Z. J. Wang are with Department of Electrical and Computer Engineering, the University of British Columbia, Vancouver, BC, V6T 1Z4 Canada (e-mail:
[email protected];
[email protected]). M. J. McKeown is with Department of Medicine (Neurology), the University of British Columbia, Vancouver, BC V6T 2B5, Canada (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2012.2210546 1Without
loss of generality (w.l.o.g.), we assume that the diagonal entries of are all normalized to one.
covariance matrix. The standard and most natural estimator of is the unstructured sample covariance matrix
where means the th -dimensional observation sample. For , is a conthe classical case where is fixed and sistent estimator of . Unfortunately, the covariance estimation problem becomes fundamentally different and more challenging for high-dimensional settings with a small number of samples where and meaning the concentration (i.e. large and small ). From the eigen-structure perspective, random matrix theory predicts that the specis wider than the spectrum of if and trum of [5]. For example, the Marchenko-Pastur theorem [6] states that the eigenvalues of have a deterministic limwhere iting distribution supported on , while the spectrum of is the Dirac mass at 1. This paper considers the high-dimensional settings and focuses on the corresponding problem of estimating large covariance matrices. Significant recent progress has been made in both theory and methodology development for estimating large covariance matrices [7]–[16]. To mitigate the issues of estimating large covariance matrices, regularization has been widely employed. In this paper, we focus on two major categories of regularized estimators of large covariance matrices. However, we would like to clarify that there are other regularized estimation procedures that do not fall into the two categories, such as the sparse matrix transform (SMT) [3] and the graphical Lasso [17]. The first category includes shrinkage-type estimators that shrink the covariance matrix to some well-conditioned matrices under different performance measures. For instances, Lediot and Wolf (LW) [8] proposed a shrinkage estimator by using a convex combination between and and provided a procedure for estimating the optimal combination weight that minimizes the mean-squared errors and that is distribution-free. Chen et al. [12] further extended the idea and improved the LW estimator through two strategies: one is based on the Rao-Blackwellization idea to condition on the sufficient statistics and the other is to approximate the oracle by an iterative algorithm. Closed-form expressions of both estimators were given in [12]. More recently, Fisher and Sun [13] proposed using as the shrinkage target with possibly different variances. These shrinkage estimators are amenable for general covariance matrices with “moderate-dimensionality”. Here, by moderate-dimensionality, we mean that grows nicely as increases, e.g. with satisfying the ratio for . some
1053-587X/$31.00 © 2012 IEEE
CHEN et al.: SHRINKAGE-TO-TAPERING ESTIMATION OF LARGE COVARIANCE MATRICES
Regularized estimation of large covariance matrices in the second category directly operate on the covariance matrix through operators such as thresholding [15], banding [14], and tapering [16]. Banding and tapering estimators are suitable for estimating covariance matrices where a natural ordering exists in the variables. Banding simply sets the entries far away from the main diagonal to be zeros and keeps the entries within the band unchanged. Tapering is similar to banding, with the difference in that it also gradually shrinks the off-diagonal entries within the band to 0. We can view banding as a hard-thresholding rule while tapering is a soft-thresholding rule, up to a certain unknown permutation [18]. In contrast, thresholding can deal with general permutation-invariant covariance matrices and introduce sparsity without requiring additional structures. These estimators are statistically consistent if certain sparsity is assumed and the dimensionality grows at any sub-exponential rate of , which allows much larger covariance matrices be estimable. In fact, it is further known that tapering and thresholding estimators are minimax [16], [19], [20]. The rate-optimality under the operator norm is not true for the banding estimator in [14] and it was shown that tapering is generally preferred than banding [16]. However, it is worth mentioning that, when the assumed sparsity condition is invalid, all the above estimators in the second category become sub-optimal. Despite recent progress on large covariance matrix estimation, there has been relatively little fundamental theoretical study on comparing the shrinkage-category and tapering-category estimators. To fill this gap, in this paper, we will first study the risk properties of shrinkage estimators and provide a comparison of risk bounds between shrinkage and tapering estimators. Further, motivated by the observed advantages and disadvantages of shrinkage and tapering estimators under different situations, to properly estimate general and high-dimensional covariance matrices, we propose a shrinkage-to-tapering estimator that combines the strengths of both shrinkage and tapering estimators. The proposed estimator has the form of a general shrinkage estimator with the crucial difference that the shrinkage target matrix is a tapered version of . By adaptively combining and a tapered , the proposed shrinkage-to-tapering estimator inherits the optimality in the minimax sense when sparsity is present and in the minimum mean-squared error (MMSE) sense when sparsity is absent. Therefore, the proposed estimator enjoys the benefits from both shrinkage and tapering estimators. The rest of the paper is structured as follows. In Section II, we first introduce the tapering estimator and its minimax risk bounds under Frobenius and spectral norms; we the derive the risk bounds under the same norms for the MMSE shrinkage oracle estimator proposed in [12]. Inconsistency of the shrinkage estimator will be shown by a set of examples. In Section III, we propose a shrinkage-to-tapering oracle (STO) estimator and derive a closed-form expression of the optimal shrinkage weight under MMSE. An approximating algorithm of the STO estimator is further proposed for practical implementation. Section IV compares the numeric performances of the proposed STO estimators, the tapering estimator and other types of shrinkage estimators. The paper is concluded in Section VI. All proofs are relegated to Appendix.
5641
Before proceeding, we fix the notations used throughout the paper. is used to denote the -th largest eigenvalues of matrix . In particular, we use a squared and . to denote the asymptotic equivalence between and , i.e. there are constants such that . is the trace of . denotes a constant that is independent of and and may vary its value from place to place. II. COMPARISON BETWEEN TAPERING SHRINKAGE ESTIMATORS
AND
The main contribution of this section is to provide a detailed analysis on risk bounds of tapering and shrinkage estimators of large covariance matrices. Before formally presenting our analysis, we first recall some definitions and results from tapering and shrinkage estimators of covariance matrices. A. Tapering Estimator We consider a class of tapering estimators. Let be the set symmetric matrix and be the Schur product of of two matrices and : . Definition II.1: A covariance matrix taper (CMT) is an for element in such that all . In other words, Schur multiplication by any CMT decreases the average eigenvalue. Let be a CMT, a tapering estimator of the covariance matrix is defined as (1) and For some class of covariance matrices
, we consider the following
(2) where is a smoothing parameter specifying the rate of decay from the main diagonal. We state that the matrices in of diagonally dominant. Note that our definition is different from the usual one in the literature and we use the term diagonally dominant as a measure of sparsity of covariance matrices when a natural ordering in variables exists, e.g. in time-series models. The remarkable theorems, proved by Cai, Zhang, and Zhou [16, Theorem 1 and 4], shows that a covariance tapering estimator based on data generated from i.i.d with is minimax. Explicit convergence rates for the spectral and Frobenius risks are given. It is very interesting and important to ask how we can construct CMTs that actually attain the minimax risks. Fortunately, it turns out that there exists such a CMT with different bandwidths (to be defined shortly) that is rate-optimal for each of the two risks in the minimax sense. More specifically, let be defined as (3)
5642
where . First, we can see that such defined matrix taper according to Definition II.1 since and therefore
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 11, NOVEMBER 2012
is a valid
unknown in practice and is the estimation goal. It is shown in [8] that can be given by a distribution-free formula (6)
Second, it is clear that such defined and hence for away from the every vanish off the stripe that is at most main diagonal. Therefore, is defined as the bandwidth of the CMT. Third, it should be noted that does not necessarily preserve the positive definiteness of . However, this concern can be mitigated by first diagonalizing and then replacing its negative eigenvalues by zeros. This modification preserves the minimax error bounds (up to a multiplicative factor of 2) and also the positive (semi-)definiteness of the resulting estimate [19]. It is noted that the optimal procedures are different under the Frobenius and spectral norms. It has been shown that optimal bandwidths of under the normalized Frobenius and spectral and , respectively. norms are B. Shrinkage Estimator Since the discovery of the Stein’s effect on the inadmissibility of the multivariate normal mean vector by the usual sample mean estimator when (see [21] for the original reference), extensive research has been devoted to proposing a broad range of shrinkage estimators such as the James-Stein estimator [22], its truncated version, e.g. see [23, page 68], among many others [24], to improve the performance of the usual estimator in terms of risks induced by a variety of loss functions. Similarly as in the estimation of the mean vector, the sample covariance estimator , as we have mentioned, is unsatisfactory for large (high-dimensional) covariance estimation problems. Steinian shrinkage therefore has been an alternatively attractive choice. Estimators of this kind naturally have the form
is
Under additional Gaussian assumption, the closed-form of given in [12]
(7) where (8) Here measures the distribution of the off-diagonal entries of and it is called the sphericity statistic in [8]. In particular,
where equalities are achieved if and only if and for the first and second inequalities, respectively. So when , the matrix entries have the most spread support (dense); while when , the energy of concentrates on the diagonal (sparse). A second shrinkage estimator proposed in [13] combines and in the same manner as in (5). These two estimators share similar properties since the optimal coefficient can be obtained in a single distribution-free framework. Therefore, in the rest of the paper, we focus on the identity target case in (5) which is easier and more expressive for our theoretic analysis. First, we derive the Frobenius risk of the MMSE oracle estimator (5), assuming that the data are from i.i.d. . Theorem II.1: Suppose are i.i.d. Gaussian . The Frobenius risk of the MMSE shrinkage oracle estimator (5) is given by
(4) is the shrinkage coefficient and is the where shrinkage target matrix. In general, is supposed to have the properties of being: (i) well-conditioned; (ii) consistent or even optimal in a subspace of symmetric matrices [13]. In other words, the shrinkage estimator is a convex combination between the sample covariance matrix and a “good” target matrix. There are several possible and intuitive choices of . , and the For instance, we consider shrinkage estimator has the following form (5) Chen et al. [12] defines an MMSE oracle estimator where is defined as the solution of the optimization problem
The MMSE oracle estimator seeks the best convex combination between the sample covariance matrix and a scaled identity matrix to approximate the true covariance matrix in terms of the mean-squared errors (MSEs). This estimator is said to be an oracle because the optimal solution depends on which is
(9) From Theorem II.1, we can see that the Frobenius risk of the shrinkage oracle estimator primarily depends on and the Frobenius norm and sphericity statistic of . The second term in (9) contributes negligibly to the total risk when is bounded away from the identity matrix, where . Since it is difficult for us to derive the exact formula, we also derive a lower bound on the risk under the spectral norm. Theorem II.2: Suppose are i.i.d. Gaussian . The spectral risk of the MMSE shrinkage oracle estimator (5) satisfies (10) Remark 1: We would like to mention that the lower bound (10) in Theorem II.2 is by no mean sharp and therefore it only serves as an evidence of the inconsistency of the MMSE shrinkage-to-identity estimator. An illustrative example and related discussions can be found in Example II.1. Theorems II.1 and II.2 are important in the sense that, by giving the pointwise explicit risk bounds of in the parameter space , it is possible for us to analyze the theoretic properties, such as consistency and admissibility, of the MMSE shrinkage
CHEN et al.: SHRINKAGE-TO-TAPERING ESTIMATION OF LARGE COVARIANCE MATRICES
5643
Fig. 1. Normalized MSE curves of the shrinkage MMSE estimator for the large covariances discussed in Example II.1 and II.2. The MMSE estimator fails to be , because the normalized Frobenius risks converge to the asymptotic values, calculated from (13) and (16), that are bounded away from consistent when . (b) Example II.2 with . 0. (a) Example II.1 with
oracle estimator in (5). Indeed, we will shortly show that this shrinkage estimator is inconsistent for some high-dimensional covariance matrices that may often appear in many real-world applications. Therefore, it suggests that we shall find alternative solutions. This is the main motivation of the proposed STO estimators which will be introduced shortly in Section III. C. Comparison of Risk Bounds Between Tapering and Shrinkage Estimators Comparing the error bounds in Theorem II.1, and Theorem II.2 and [16, Theorem 1 and 4], we are now ready to study the tapering and MMSE shrinkage oracle estimator. The comparison is done by studying several specific examples and interesting conclusions can be drawn. Those examples represents important classes of covariance matrices, such as auto-regressive models and fractional Brownian motions. We describe the examples as follows. Example II.1: Consider, for , for for
, .
(11)
In words, the entries of decay exponentially fast when moving away from the main diagonal. This example corresponds to the covariance structure of auto-regressive models with order 1, AR(1), and is considered in [12], [14]. We can easily see that, for this by using the normalization assumption by summing the squared and norm of all diagonals of . More specifically,
Therefore, it follows that (12) . Then the Frobenius risk We let in the super-linear high-dimensional situation is asymptotically, as , , and ,
where , since conclude that the Frobenius risk is
. Therefore we can (13)
It is clear that the normalized MSE is lower bounded by a positive constant depending on and therefore the MMSE shrinkage oracle estimator cannot be a consistent estimator of unless the concentration . Fig. 1(a) plots the finite sample size behavior of the normalized MSE and its limit (13). We can see that the normalized MSE asymptotically approaches to a non-zero value when with being large enough. On the contrary, note that in this example satisfies any smoothing pa, we hence deduce that, under the Frobenius rameter in [16, Theorem 4], of the risk, the convergence rate, (1) with the minimax CMT in (3) tapering estimator can be arbitrarily close to . Comparing these two bounds, it is clear that, for this example, the tapering estimator is uniformly superior than the MMSE shrinkage oracle estimator proposed in [12]. Therefore, this oracle estimator is in fact a weak oracle which is overly restrictive in terms of the functional form of the shrinkage estimator. The reason that the tapering estimator outperforms the MMSE shrinkage oracle estimator is that the optimal estimate may not necessarily be decomposed as a simple convex combination between the sample covariance matrix
5644
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 11, NOVEMBER 2012
and the scaled identity matrix . Example II.1 is a good evidence of this fact. Furthermore, noticing that (7) and (12), we have
(14)
The FBM is a model for complex systems that have long-range dependence for being close to 1, such as modeling the internet traffic [25]. Practical applications usually tune between 0.5 and 0.9. The covariance matrix in this model does not belong to unless , which is the case of the Brownian motion with white Gaussian noise process; therefore, tapering estimator does not guarantee to be consistent estimator in this example. To see this, we first observe that since is and then obtain that convex in for
we see from the eigen-structure perspective that the spectral risk of the MMSE shrinkage oracle estimator (5) obeys (15) which implies that because is a . monotone decreasing sequence as As a summary for Example II.1, we conclude that, although the bona fide covariance matrix in (11) has a diagonal-like structure, shrinkage of to an identity matrix is inconsistent and hence it is not a good choice in high-dimensional situations and this procedure shall be improved by taking into account more refined structural information. On the contrary, tapering is minimax in this example [16]. We now study a second example that has a slower polynomial decay rate than in Example II.1, as considered in [16]. Example II.2: Consider, for , for for
, .
Based on the definition in (2), we can show that . It follows from an analogous argument to the one used for deriving the Frobenius risk (13) in Example II.1 that
where
and , both converging as . The rest derivation proceeds as in Example II.1 and consequently we can achieve the argument that there exists a constant such that
where the second last equality follows from three telescope sums. Consequently, it follows that
where the last term is not summable for as . But . now, we clearly see from definition (2) that On the other hand, since the covariance matrix in Example II.3 is Toeplitz, its Frobenius norm is given by
(16) as illustrated in Fig. 1(b). Similar to the analysis in Example II.1, we can see that the tapering estimator outperforms the MMSE shrinkage oracle estimator under both Frobenius and spectral risks. Again, tapering estimator is minimax and MMSE shrinkage-to-identity is inconsistent in this example when the dimensionality is large. Example II.3: In a third example, we consider the covariance structure of a long-range dependent fractional Brownian motion (FBM) studied in [14] with the Hurst parameter :
(17)
For
, let us define
as a function of :
It is clear that is continuous, and . Conwhen and sequently we have when . For , because the function is asymptotically convex for any as , it follows from the Jensen’s inequality that for sufficiently large . Therefore, we deduce that will eventually be an in, as diverges creasing function between [0, 2] for to infinity. So as , while as .
CHEN et al.: SHRINKAGE-TO-TAPERING ESTIMATION OF LARGE COVARIANCE MATRICES
By Theorem II.1, for the MMSE shrinkage oracle estimator, we now have
Therefore, if , i.e. , then the Frobenius risk vanishes to zero and the MMSE shrinkage oracle estimator is a consistent estimator; if , i.e. , the Frobenius risk is asymptotically , meaning that the Frobenius risk for estimating this large covariance matrix depends on the concentration .
that, for an arbitrary large covariance matrix , the proposed estimator could improve upon both tapering and MMSE shrinkage oracle estimators. The optimal coefficient of the STO estimator can be given in a closed-form. Theorem III.1: The coefficient of the proposed STO estimator under the minimum Frobenius risk is (19) with . Under further Gaussian where assumption, we can write (19) in a closed-form given by (20) shown at the bottom of the page. B. Approximating the Oracle
III. A SHRINKAGE-TO-TAPERING ESTIMATOR A. Problem Formulation Motivated by the above discussions, we propose a Steinian shrinkage type estimator. With the important difference from the shrinkage estimator toward a scaled identity matrix, the proposed estimator shrinks the sample covariance matrix to its tapered version. Basically the proposed estimator subsumes in [13] as one special case where . Specifically, has the form the proposed estimator (18) where problem
5645
is determined by the solution to the optimization
Here can be either Frobenius or spectral norm. By using the tapering estimator as our shrinkage target, we hope this estimator can inherit good properties from both tapering and shrinkage estimators. Throughout the rest of the paper, we shall refer this proposed estimator as the shrinkage-to-tapering oracle (STO) estimator. On one hand, for , we can see from [16, Theorem 1 and 4] that the proposed STO estimator reduces to the tapering estimator for large and . On the other hand, for , the proposed estimator reduces to an analogy of the MMSE shrinkage oracle estimator. Therefore, we expect
The proposed STO estimator is nice in theory for developing the closed-form expression of the optimal coefficient . Nevertheless, in practice, the true is the goal of estimation and thus unknown, and thus the proposed oracle estimator is not feasible in practice. Therefore, we propose a practical algorithm that approximates the oracle estimator. Following the idea of [12], we similarly derive a shrinkage-to-tapering oracle approximating (STOA) iterative algorithm as follows. First, the STOA algorithm is initialized as ; then it is natural to replace by in (20) as to calculate the optimal coefficient. Consequently, we obtain the next estimate . These two steps alternate until certain convergence criterion is met. Formally, by induction, based on current step , the optimal for step is given by (21) shown at the coefficient is a diagonal matrix such that bottom of the page, where . Then we compute the estimate at step as (22) The above two steps are operated iteratively until the seconverges. The proposed STOA estimator is dequence fined as using its limit
Currently, for the proposed STOA estimator, we are unable to derive a rigorous convergence theory as for the oracle approximation shrinkage (OAS) estimator in [12]. Our empirical simulations in Section IV, however, demonstrate that the STOA es-
(20)
(21)
5646
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 11, NOVEMBER 2012
timator can approximate the STO estimator reasonably well for a broad range of , regardless of its sparsity. Also, our numerical experiments show that the MSEs converge quite well after a few iterations. For the proposed estimators, an important issue is to determine the bandwidth of in the tapering step for calculating the shrinkage target matrix . We adopt a data-driven random split procedure (i.e. cross-validation) for estimating as follows: We randomly split the independent data samples into two subsets of sizes and respectively, and choose from a set of candidate values. For each in the chosen set, is estimated based on one subset data which we call the training data set and we calculate the distance, e.g. represented by the Frobenius or spectral norm, between the estimated and the sample covariance matrix computed from the other subset, i.e. the testing data set, . Finally the optimal is determined by the index yielding the smallest distance. Due to the extra step of determining , the proposed estimator is more computationally expensive than the LW, RBLW, OAS, and MMSE shrinkage oracle estimators. Nonetheless, this validation overload in practice is a minor computational issue because the proposed STO and STOA estimators are quite efficient for any pre-specified and the computational costs of all shrinkage estimators mentioned in this paper are comparable. The STOA pseudo-code is listed in Algorithm 1. Algorithm 1: The STOA algorithm Input: , Output: begin for each Construct the CMT
(24) is the sample covariance matrix of the testing data where set. We prove the following theorem to justify the use of the above random split procedure for bandwidth selection. More specifically, Theorem III.2 shows that the above random split procedure achieves the nearly optimal prediction risk as if the true were assumed known. Theorem III.2: Suppose are i.i.d. Gaussian and . Let be the oracle estimation error. Assume that for all matrices such that . If , then (25) is the STO estimator with the bandwidth deterwhere mined by the random split procedure (24) and is defined in (23). By Theorem III.2, we can have Corollary III.3: If certain structural information, such as where the oracle is consistent, is available, then random split can choose a bandwidth of the tapering target such that the optimal STO estimator is also consistent for . and are i.i.d. Corollary III.3: Suppose . If and , then Gaussian .
do with bandwidth
as in (3);
and calculate the optimal Initialize from shrinkage coefficient by iterating (21) and (22) until convergence; and the corresponding Find the best bandwidth of by the minimum prediction error on the test data; end Return
STO estimator with the optimal prediction Frobenius risk. Let be the bandwidth minimizing the empirical Frobenius loss
.
end In general, bandwidth selection is a difficult problem, which depends on the true covariance structures. Fortunately, we can show that the above random split procedure can choose a bandwidth such that the induced estimation error is asymptotically equal to the estimator with the true covariance matrix (i.e. an oracle) being available. The STO oracle estimator is defined by the following minimization problem, (23) is the STO estimator (18) with tapering bandwidth where . It is clear that the STO oracle estimator corresponds to the
IV. SIMULATION We simulate the three examples discussed earlier in this paper to study the finite sample size numeric performances of the proposed estimators. We fix for all models and consider . The different values of with STOA algorithm is initialized at and . The maximum number of iterations in the STOA algorithm is set to be 10. We compare the proposed STO estimator and its variant STOA with the tapering [16] and several shrinkage estimators including the LW [8], Rao-Blackwellized LW (RBLW) [12], MMSE shrinkage oracle (MMSEO) estimator and its variant oracle approximating shrinkage (OAS) [12] estimator. A. Model 1: AR(1) The th-row and th-column entry of the covariance matrix is (26) . A smaller essentially We chose more like the identity matrix. Note that for any makes , for all and some . To specify the tapering bandwidth, we need to determine a proper by the described random split approach; i.e., our tapering estimator is performed on a training data set over a
CHEN et al.: SHRINKAGE-TO-TAPERING ESTIMATION OF LARGE COVARIANCE MATRICES
5647
Fig. 2. Model 1 in (26): The normalized MSEs as a function of , averaged over 100 replications. The tapering [16], LW [8], RBLW [12], MMSE shrinkage ; (b) ; (c) ; (d) . oracle (MMSEO) [12], and OAS [12] are compared with the proposed STO and STOA estimators. (a)
pre-defined grid and then the optimal is selected by minon the oracle. imizing the Frobenius loss Estimated normalized MSEs, i.e. the Frobenius risk, and the spectral risk are plotted in Fig. 2 and Fig. 3 for various aforementioned estimators. Several interesting observations can be noted from Fig. 2 and Fig. 3. First, in terms of estimation risks, for large , the STO, STOA and tapering estimators generally improve upon the previous shrinkage-type estimators including LW, RBLW, OAS, and the MMSEO. For a small (e.g. in Fig. 3(a)), a ) is needed for STO and STOA large sample size (e.g. to eventually outperform other methods. A larger sample size , compared to the cases of , 0.7, 0.9, is required for is intuitive, because when gets smaller (e.g. ), the true covariance matrix is closer to the identity matrix, which fits better the model of the shrinkage-to-identify methods (LW, RBLW, OAS, MMSEO) and thus more data points are needed in the proposed STO and STOA to beat their performances. However, the trends of Frobenius and spectral risks Fig. 2 and Fig. 3 still validate our large-sample size Theorem II.1 and Theorem II.2. The improvement is clear even when is not so large as
considered in the asymptotic setup of the theorems. Second, the proposed STO and STOA also outperform the tapering estimator, although the improvement is smaller than that of the shrinkage-type estimators. The improvement on Frobenius risk is slightly more significant than that of the spectral risk. Third, it is clear from these two figures that the proposed STOA can well approximate the STO estimator. Forth, despite that the STO estimator minimizes the MSE, the results from the spectral risk are similar to that of the Frobenius risk. It suggests that the STO and STOA are robust against the norm under which the risk is minimized. We also note that, for a very small sample size , the MMSE oracle estimator performs the best for small . Since the Frobenius and spectral risks of LW, RBLW, and OAS are very large at , we truncate the plots at certain levels to make the curves distinguishable. It is worth noting that the case of and does not fit well the high-dimensional setup which requires both and are large. Similar observations are noted in Model-2 and Model-3 examples too. The estimated shrinkage coefficients and are plotted in Fig. 4. It is observed that, in general, the coefficients of
5648
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 11, NOVEMBER 2012
Fig. 3. Model 1 in (26): The spectral risks as a function of , averaged over 100 replications. Here the legends are the same as in Fig. 2. (a) ; (d) . (c)
STO and STOA are closer to 1 than other shrinkage estimators. as the This means that STO and STOA essentially use estimator, with a slight adjustment by incorporating information directly from . This is confirmative to the theory we have seen since the tapering estimator is minimax. Moreover, shrinkage coefficients of LW, RBLW, MMSEO, and OAS estimators tend to decrease as increases. This makes sense because the more data collected, the larger amount of information should be used from , in which case a smaller value of shall be adaptively chosen. On the contrary, we do not see this observation for the STO and STOA estimators. In fact, their coefficients seem to converge to 1 in this empirical study, as seen in Fig. 4. This is due to the fact that the shrinkage target, , of these two estimators actually contains the data information. When is truly optimal, then it is sufficient for the STO and STOA estimators to use only the target component and thus the optimal coefficients converges to 1 in this example. We also empirically investigate the convergence of the proposed STOA estimator. The normalized MSEs are plotted as a function of the iteration number in STOA in Fig. 5. Only the first 10 iterations with an initialization at are shown.
; (b)
;
We can clearly see that the STOA estimator generally converges after the first 5 iterations. B. Model 2: We have for for
,
(27)
where we choose the smoothing parameter from . We simply set in order to achieve the optimal convergence rate under the Frobineus norm. We remark that the numeric performance can be further improved by cross-validating on a set of bandwidths on the order . The risk results of different estimators are shown in Fig. 6 and Fig. 7. Again, we observe the same pattern on the error curves and essentially same conclusions can be drawn as in Model 1. C. Model 3: Fractional Brownian Motion The numeric performance of the STO and STOA estimators when is studied in the third model, the FBM
CHEN et al.: SHRINKAGE-TO-TAPERING ESTIMATION OF LARGE COVARIANCE MATRICES
5649
Fig. 4. Model 1 in (26): The estimated shrinkage coefficients for different estimators, averaged over 100 replications. The legends are the same as in Fig. 2, except that the tapering estimator is excluded and STO/SOTA and STO_2/STOA_2 are the STO/STOA estimates under the Frobenius and spectral risks, respectively. ; (d) . (a) ; (b) (c)
model. In our setup, we look at the FBM with the Hurst param. eter selected from From Fig. 8, we can see that the normalized MSEs of the MMSE shrinkage estimators are smaller than that of the tapering estimator. This is not surprising because: (i) The assumption is violated and therefore no optimality under the Frobenius risk can be expected in the tapering estimator; (ii) The MMSE estimators are designed to minimize the Frobenius risk. When looking at the spectral risk in Fig. 9, we observe that the risk of the tapering estimator is smaller than those from the MMSE family. Therefore, the tapering estimator is quite robust in the sense that, although being sub-optimal, it still provides better spectral risk performances than the MMSE shrinkage estimators. In contrast, the MMSE shrinkage estimators are sensitive to norms under which the risk performance is measured; in particularly, they are only optimal in the Frobenius norm. It is observed that the STO and STOA estimators uniformly outperform other shrinkage estimators when and . In the case of , they are outperformed by LW, RBLW, OAS, and MMSEO estimators but still yield smaller MSEs than the tapering estimator. The case of ap-
pears to be non-uniform; however the curve trends shown in Fig. 8(b) suggest that STO and STOA may eventually yield a smaller Frobenius risk as gets larger. V. BREAST CANCER DATA CLASSIFICATION The proposed STOA estimator is applied to a real breast cancer data set [26]. This data set contains 22283 gene expression measurements of a total 133 subjects: 34 subjects with pathological complete response (pCR) and 99 subjects with residual disease (RD). pCR subjects are considered to have high chance to survive from breast cancer; therefore, prediction of pCR status from gene expression profiles is of great interest. In our experiment, 150 differentially expressed genes are retained by a two-sample -test as predictors. Previous study [27] used the linear discriminant analysis (LDA) classifier which corresponds to the label that maximizes the scores
5650
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 11, NOVEMBER 2012
Fig. 5. The normalized MSEs of the STOA estimator as a function of the number of iterations. (a) Model 1 in (26): . (c) model 3 in (17):
where and with being the size of group . It is clear that the performance of the LDA classifier crucially depends the estimation quality of . Therefore, the goal of this problem is to compare the classification performance of the LDA method with different (inverse) covariance matrix estimators. Specifically, the data set is randomly split into the 112 training and 21 test samples. We report the sensitivity, specificity and the Mathews correlation coefficient (MCC) of LDA classifiers when using covariance matrix estimators from the LW, RBLW, OAS, and the proposed STOA methods accordingly:
MCC is a joint measure to combine the specificity and the sensitivity. Its values are always between 1 and 1; with 1 being
; (b) model 2 in (27):
;
perfect prediction and 1 being the worst performance. 3-fold cross-validation is used to tune the bandwidth parameter in the STOA estimator on the training set. In this real application, the MMSEO and STO estimators are not applicable since they require the true covariance matrix. Moreover, tapering estimator is not included in comparison since it is not positive-definite. The results, averaged over 100 random splittings, are shown in Table I. Based on the results in Table I, we note that LW achieves the perfect sensitivity among all methods while its specificity is the lowest. Also, RBLW, OAS, and STOA yield similar performances, all having a high specificity and a low sensitivity. Among the methods, the proposed STOA yields best overall performance in terms of MCC. VI. CONCLUSIONS The main contributions of the paper are summarized as follows: 1) For high-dimensional covariance estimation problems where , we show that the MMSE shrinkage oracle
CHEN et al.: SHRINKAGE-TO-TAPERING ESTIMATION OF LARGE COVARIANCE MATRICES
Fig. 6. Model 2 in (27): The normalized MSEs as a function of , averaged over 100 replications. (a)
estimator is inconsistent under both Frobenius and spectral risks for some typical covariance matrices in . Moreover, we show that the tapering estimator is uniformly superior than the MMSE shrinkage estimator in this case. 2) We proposed a STO estimator that combines the advantages from both the MMSE shrinkage and tapering estimators. Our simulation study shows that the proposed method can significantly outperform one type of estimators and marginally over the other type; however, we should note that it combines the strengths from tapering (as prior information) and shrinkage (as no prior information) estimators in a fully automatic and data-dependent means. Therefore, it is suitable for estimating general and high-dimensional covariance matrices. An oracle estimator in the closed-form is derived and a practical algorithm to approximate the STO estimator is presented. Selection of the bandwidth in the tapering target can be theoretically justified to achieve nearly optimal prediction error. We have compared the numeric performance of various covariance estimators by simulation. It would be more interesting to apply the proposed estimator to more real-world data sets in
5651
; (b)
; (c)
; (d)
.
the future work; for instance, precise estimation of the covariance for larger number of channels of hyperspectral images with fewer number of pixels [28] is also a natural application of the proposed framework. APPENDIX A. Proof of Theorem II.1 Proof: By definition, LHS of (9) can be written as
We analyze the first term in the last equation and find that it equals
5652
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 11, NOVEMBER 2012
Fig. 7. Model 2 (27): The spectral risks as a function of , averaged over 100 replications. (a)
Let
Note that for any square matrix , we have ; and . Expanding further by substituting therefore in (6) into , we can obtain that
Now, by the Gaussian assumption, we have
Substituting this expression into , we see the theorem follows.
; (b)
; (c)
; (d)
.
B. Proof of Theorem II.2 Proof: By the definition of the matrix spectral norm and and , we can obtain noting that the following chain inequalities
CHEN et al.: SHRINKAGE-TO-TAPERING ESTIMATION OF LARGE COVARIANCE MATRICES
Fig. 8. Model 3 (FBM) in (17): The normalized MSEs as a function of , averaged over 100 replications. (a)
Here we used Jensen’s inequality twice at the second and third steps.
5653
; (b)
; (c)
; (d)
.
(29), we get
C. Proof of Theorem III.1 Proof: Let
and note that (28)
where
with
. Moreover, since
taking expectation on both sides, we can obtain that
(29) To calculate
, we need to find the minimizer of . Expanding this using (28) and
Differentiating this function w.r.t. zero, we immediately get (19). Further, if it is assumed that Wick’s theorem, we have
and finding its solution to follow i.i.d.
, by
5654
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 11, NOVEMBER 2012
Fig. 9. Model 3 (FBM) in (17): The spectral risks as a function of , averaged over 100 replications. (a)
Now, it follows from direct calculation that
where
; (b)
; (c)
; (d)
.
from which it follows that
; so similarly
, we Substituting into (19), we obtain (20). If modify to be either 0 or 1, whichever gives a smaller MSE since it is a quadratic function of and therefore attains the minimum value at one of its boundary points. D. Proof of Theorem III.2 Proof: The proof is similar to argument of [15, Theorem 3] with necessary modification to accommodate the STO estimation context; for the completeness reason, however, we give its proof. By definition (23) and (24),
where . Put Proposition 1], we obtain that
. By [15,
CHEN et al.: SHRINKAGE-TO-TAPERING ESTIMATION OF LARGE COVARIANCE MATRICES
TABLE I SPECIFICITY, SENSITIVITY AND MCC OF THE CLASSIFICATION PERFORMANCES OF LW [8], RBLW [12], OAS [12], AND THE PROPOSED STOA FOR THE BREAST CANCER GENE EXPRESSION DATA
Since
, we easily deduce that
and the theorem follows. E. Proof of Corollary III.3 Proof: We check the conditions in Theorem III.2. Since , for any ,
for some constant . Then, with the Gaussian assumption, it follows from [15, Lemma A.3] that there exists a constant such that
Now, since
, it is clear that
Hence, by Theorem III.2, we conclude that
ACKNOWLEDGMENT We are grateful to the Associate Editor and three anonymous reviewers whose suggestions have greatly improved the paper. This work was partially funded by a Pacific Alzheimer’s Research Foundation (PARF) Centre Grant Award. REFERENCES [1] R. Abrahamsson, Y. Selen, and P. Stoica, “Enhanced covariance matrix estimators in adaptive beamforming,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2007, pp. 969–972. [2] J. R. Guerci, “Theory and application of covariance matrix tapers for robust adaptive beamforming,” IEEE Trans. Signal Process., vol. 47, no. 4, pp. 977–985, 1999. [3] G. Cao, L. R. Bachega, and C. A. Bouman, “The sparse matrix transform for covariance estimation and analysis of high dimensional signals,” IEEE Trans. Image Process., vol. 20, no. 3, pp. 625–640, 2011. [4] I. M. Johnstone and A. Y. Lu, “Sparse principal components analysis,” 2004 [Online]. Available: http://www-stat.stanford.edu/~imj/WEBLIST/AsYetUnpub/sparse.pdf, ArXiv:0901.4392v1 [5] Z. Bai and J. W. Silverstein, “No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices,” Ann. Probabil., vol. 26, no. 1, pp. 316–345, 1998. [6] V. A. Marchenko and L. A. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Math. USSR-Sbornik, vol. 1, no. 4, p. 507C536, 1967. [7] O. Ledoit and M. Wolf, “Improved estimation of the covariance matrix of stock returns with an application to portfolio selection,” J. Empir. Fin., vol. 10, no. 5, pp. 603–621, 2003.
5655
[8] O. Ledoit and M. Wolf, “A well-conditioned estimator for large dimensional covariance matrices,” J. Multivar. Analy., vol. 88, no. 2, pp. 365–411, 2004. [9] O. Ledoit and M. Wolf, “Honey, I shrunk the sample covariance matrix,” J. Portfolio Manage., vol. 30, no. 4, pp. 110–119, 2004. [10] J. Schafer and K. Strimmer, “A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics,” Stat. Appl. Genetics Mol. Biol., vol. 4, 2005. [11] P. Stoica, J. Li, X. Zhu, and J. R. Guerci, “On using a priori knowledge in space-time adaptive processing,” IEEE Trans. Signal Process., vol. 56, no. 6, pp. 2598–2602, 2008. [12] Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero, “Shrinkage algorithms for MMSE covariance estimation,” IEEE Trans. Signal Process., vol. 58, no. 10, pp. 5016–5029, 2010. [13] T. J. Fisher and X. Sun, “Improved stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix,” Comput. Stat. Data Anal., vol. 55, pp. 1909–1918, 2011. [14] P. J. Bickel and E. Levina, “Regularized estimation of large covariance matrices,” Ann. Stat., vol. 36, no. 1, pp. 199–227, 2008. [15] P. J. Bickel and E. Levina, “Covariance regularization by thresholding,” The Ann. Stat., vol. 36, no. 6, pp. 2577–2604, 2008. [16] T. Cai, C.-H. Zhang, and H. Zhou, “Optimal rates of convergence for covariance matrix estimation,” Ann. Stat., vol. 38, no. 4, pp. 2118–2144, 2010. [17] J. Friedman, T. R. Hastie, and Travor, “Sparse inverse covariance estimation with the graphical Lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008. [18] P. J. Bickel and M. Lindner, “Approximating the inverse of banded matrices by banded matrices with applications to probability and statistics,” Ann. Statist. [Online]. Available: http://www-stat.wharton. upenn.edu/~tcai/paper/Sparse-Covariance-Matrix.pdf, submitted for publication [19] T. Cai and H. Zhou, “Optimal rates of convergence for sparse covariance matrix estimation,” 2011, Preprint. [20] T. Cai and H. Zhou, “Minimax estimation of large covariance matrices under -norm,” Statistica Sinica, 2011, to be published. [21] C. Stein, “Inadmissibility of the usual estimator for the mean of a multivariate normal distribution,” in Proc. 3rd Berkeley Symp. Math. Stat. Probab., 1956, pp. 197–206. [22] W. James and C. Stein, “Estimation with quadratic loss,” in Proc. 4th Berkeley Symp. Math. Stat. Probab., 1961, pp. 361–379. [23] C. P. Robert, The Bayesian Choice, ser. Springer Texts in Statistics, 2nd ed. New York: Springer-Verlag, 2007. [24] J. Hoffbeck and D. Landgrebe, “Covariance matrix estimation and classification with limited training data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 7, pp. 763–767, 1996. [25] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, “On the self-similar nature of Ethernet traffic (extended version),” IEEE/ACM Trans. Netw., vol. 2, no. 1, pp. 1–15, 1994. [26] K. R. Hess, K. Anderson, W. F. Symmans, V. Valero, N. Ibrahim, J. A. Mejia, D. Booser, R. L. Theriault, A. U. Buzdar, P. J. Dempsey, R. Rouzier, N. Sneige, J. S. Ross, T. Vidaurre, H. L. Gómez, G. N. Hortobagyi, and L. Pusztai, “Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer,” J. Clin. Oncol., vol. 24, no. 26, pp. 4236–4244, 2006. [27] J. Fan, Y. Feng, and Y. Wu, “Network exploration via the adaptive Lasso and SCAD penalties,” Ann. Appl. Stat., vol. 3, no. 2, pp. 521–541, 2009. [28] J. Theiler, G. Cao, L. R. Bachega, and C. A. Bouman, “Sparse matrix transform for hyperspectral image processing,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 3, pp. 424–437, 2011.
Xiaohui Chen received the B.Sc. degree from Zhejiang University in 2006, and the M.Sc. degree and Ph.D. degree in electrical and computer engineering, both from the University of British Columbia (UBC), in 2008 and 2012, respectively. He is currently a Research Associate in the Department of Electrical and Computer Engineering at UBC. His current research interests include the broad areas of high-dimensional statistical inference and non-parametric methods for time series data.
5656
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 11, NOVEMBER 2012
Z. Jane Wang (SM’12) received the B.Sc. degree (with highest honors) from Tsinghua University, China, in 1996, and the M.Sc. and Ph.D. degrees from the University of Connecticut, Storrs, in 2000 and 2002, respectively, all in electrical engineering. She was a Research Associate at the Institute for Systems Research, University of Maryland, College Park. She joined the Department Electrical and Computer Engineering at the University of British Columbia (UBC) in 2004, where she is an Associate Professor. Her research interests are in the broad areas of statistical signal processing theory and applications. Dr. Wang was corecipient of the EURASIP Best Paper Award 2004 and the IEEE Signal Processing Society Best Paper Award 2005. She has been an Associate Editor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING, the IEEE SIGNAL PROCESSING LETTERS, and IEEE TRANSACTIONS ON MULTIMEDIA.
Martin J. McKeown received the B.Eng. degree in engineering physics (summa cum laude) from McMaster University, Hamilton, ON, Canada, in 1986 and the M.D. degree from the University of Toronto, Toronto, ON, Canada, in 1990. From 1990 to 1994, he specialized in neurology at the University of Western Ontario, and later did a fellowship in clinical electrophysiology. He is Board Certified in neurology in Canada and the U.S., and has been an examiner for the U.S. Neurology Board certification. From 1995 to 1998, he was a Research Associate at the Computational Neurobiology Laboratory, Salk Institute for Biology Studies, San Diego, CA. From 1998 to 2003, he was an Assistant Professor of Medicine at the Duke University core faculty at Duke’s Brain Research and Analysis Center, Durham, NC, as well as an Adjunct Professor in Biomedical Engineering and Dukes Center for Neural Analysis. He is currently a movement disorder specialist and Clinical Director at the Pacific Parkinsons Research Center, a Professor of Medicine (Neurology), Associate Member of the Department of Electrical and Computer Engineering, and faculty member of the Brain Research Centre at the University of British Columbia.