High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann
February 3, 2012
Problem definition
I
Want to estimate the covariance matrix for Gaussian Distributions: e.g. gene expression
I
Take a random sample of vectors X 1 , ..., X (n) i.i.d. ∼ Np (0, Σ0 ) where p may depend on n
I
Let Θ0 = Σ−1 0 denote the concentratin matrix
I
Sparsity: certain elements of Θ0 are zero
I
Task: Use the sample to etimte a set of zeros, and then estimator for Θ0 (Σ0 ) given the pattern of zeros.
I
Show consistency in predictive risk and in estimating Θ0 andΣ0 when n,p → ∞
Gaussian Graphical Model
indep.
I
X = (X1 , ...., Xp ) ∼ N(µ, Σ0 )
I
Given a graph G = (V , E0 ), a pair ( i, j) is NOT contained in E0 (θ0,ij = 0) iff Xi ⊥ ⊥ Xj | {Xk ; k ∈ V \ {i,j }}
I
Define Predictive Risk with Θ 0 as R(Θ) = tr(ΘΣ0 ) - log|Θ| ∝ -2E0 (logfΘ (X )) Where the Gaussian Log-likelihood function using Σ 0 is logfΘ (X ) = − p2 log2π + 12 log|Θ| − 12 X T Θ−1 X
Gaussian Graphical Model
indep.
I
X = (X1 , ...., Xp ) ∼ N(µ, Σ0 )
I
Given a graph G = (V , E0 ), a pair ( i, j) is NOT contained in E0 (θ0,ij = 0) iff Xi ⊥ ⊥ Xj | {Xk ; k ∈ V \ {i,j }}
I
Define Predictive Risk with Θ 0 as R(Θ) = tr(ΘΣ0 ) - log|Θ| ∝ -2E0 (logfΘ (X )) Where the Gaussian Log-likelihood function using Σ 0 is logfΘ (X ) = − p2 log2π + 12 log|Θ| − 12 X T Θ−1 X
Previous work
I
Yuan and Li (2007), Banerjee, El Ghaoui and d’ Aspremont (2008), d’Aspremont et al ( 2008), Friedman et al (2008) ˆ − |logdet(Ω)| + λΣi6=j |ωi,j | min tr(ΣΩ) ˆ is sample covariance matrix. where Σ
I
Meinshausen Buhlmann (2006) Regression Approach
A Regression Model
We assume a multivariate Gaussian model X = (X1 , ..., Xp ) ∼ Np (0, Σ0 ), where Σ0,ii = 1 Regression model: I
∀i = 1, ..., p Xi =
P
βji Xj + Vi where βji = −θ0,ij /θ0,ii , and Vi ∼ N(0, σV2 i ) ⊥ ⊥ {Xj : j 6= i}
j6=i
for ∀i, Var (Vi ) = I
1 θ0,ii
Xi ⊥⊥ Xj | {Xk ; k ∈ V \ {i,j }} ⇔ θ0,ij 6= 0 ⇔ βij 6= 0 and βji 6= 0
Meinshausen and Buhlmann 2006
I
Perform p regressions using the Lasso to βˆ1 , ..., βˆp , where n o βˆi = βˆi ; j ∈ {1, ..., p} \ i j
I
Then estimate the edge set by OR rule
I
Under sparsity and Neighborhood Stability conditions, they show P Eˆn = E0 → 1 as n → ∞
Essential Sparsity
I
I
i For each i, define s0,n as the smallest integer such that n o Pp 2 , λ2 θ i 2 min θ 0,ii ≤ s0,n λ θ0,ii , where 0,ij j=1,j6=i p λ = 2log (p) /n i where s0,n at row i describes the number of sufficiently large non-diagonal elements. P i Define S0,n = pi=1 s0,n as the essential sparsity of the graph.
The estimation procedure
I
Nodewise regressions for inferring the graph
I
Maximum likelihood estimation based on graph
I
Choosing the regularization parameters using cross validation
Nodewise regression for inferring the graph
Using the Lasso in combination with thresholding to infer the graph. I
For each of the nodewise regressions using the Lasso with same λn 2 P P P i βinit = argminβ i nr=1 Xi − j6=i βji Xj + λn j6=i βji ,∀i
I
i with τ to get the Zero set: Thresholding βinit n o i Di = j : j = 6 i, βj,init 0 such that for all i, and Vi : Var(Vi ) = 1/σ0,ii ≥ v 2
Theorem 1: selection
Assume that A0 and A2 hold, and Σi,i = 1 for all i. Under appropriately chosen λn and τ , we obtain an edge set Eˆn with high probability that ˆ En ≤ 4S0,n , where Eˆn \ E0 ≤ 2S0,n
Theorem 1 : estimation
Assume that in addition, A1 holds. Then
p
ˆ
ˆ
S0,n log max (n, p) /n
Θn − Θ0 ≤ Θn − Θ0 = Op
F
2 p
ˆ
ˆ
S0,n log max (n, p) /n
Σn − Σ0 ≤ Σn − Σ0 = Op F 2 ˆ n − R (Θ0 ) = Op (S0,n log max (n, p) /n) R Θ
Remark 3 : Σ0,ii are unknow Under
A0, A1 and
A2 :
ˆ
ˆ I Θn − Θ0 and Σ − Σ n 0 achieve the same rate as in 2 2 Theorem 1. I
For the Frobenius norm and the risk to converge, A1 is to be replaced by: (A3) p nc for some constant 0 < c < 1 and p + S0,n = o (n/log max (n, p)) as n → ∞ in this case, we have
p
ˆ
(p + S0,n ) log max (n, p) /n
Θn − Θ0 = Op
F p
ˆ
Σ − Σ (p + S0,n ) log max (n, p) /n
n 0 = Op F ˆ R Θn − R (Θ0 ) = Op ((p + S0,n ) log max (n, p) /n)
Proposition 4: Bounding the bias due to sparse approoximation
Assume that (A0), (A1) and (A2) hold. With the same choices on λn , τ as in Theorem 1, then
p
˜
(p + S0,n ) log max (n, p) /n kΘn,D kF = Θ 0 − Θ0 = Op F
Theorem 5: for case that E0 ⊆ E
Assume (A1) and (A2) hold, and define S = |E0 |. Assume Σ0,ii = 1 for all i. Suppose with high probability we obtain an edge set E such that E0 ⊆ E and |E \ E0 | = O (S), then:
p
ˆ Slog max (n, p) /n
Θn (E ) − Θ0 = Op F
This is the case that all non-zero βji are suffciently large
Simulation study
I
Gelato, GLasso and space
I
AR(1)-Block model, random concentration matrix model and exponential decay model
I I
p = 300 and n = 40, 80, 320 ˆ ˆ Measure Σ − Σ , Θ − Θ n n 0 0 and Kullback Leibler F F divergence
I
Select tuning parameter using cross-validation
Simulation 1: AR-Bolck Model ΣBlock = 0.9|i−j|
Simulation 1: AR-Bolck Model 2
Simulation 2: Random Concentration Matrix Θ = B + δI
Simulation 3: Exponential Decay Model θ0,ij = exp (−2 |i − j|)
Real Data Application
Conclusion
I
Gelta separates the tasks of model selection and covariance estimation
I
Thresholding plays a key role in obtaining a sparse approximation of the graph using a small amount of sample
I
With stronger conditions on the smaple size, convergence rates in terms of operator and frobenius norms and KL divergence are established
I
The method is feasible in high dimensions