High-dimensional Covariance Estimation Based On Gaussian ...

Report 2 Downloads 76 Views
High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann

February 3, 2012

Problem definition

I

Want to estimate the covariance matrix for Gaussian Distributions: e.g. gene expression

I

Take a random sample of vectors X 1 , ..., X (n) i.i.d. ∼ Np (0, Σ0 ) where p may depend on n

I

Let Θ0 = Σ−1 0 denote the concentratin matrix

I

Sparsity: certain elements of Θ0 are zero

I

Task: Use the sample to etimte a set of zeros, and then estimator for Θ0 (Σ0 ) given the pattern of zeros.

I

Show consistency in predictive risk and in estimating Θ0 andΣ0 when n,p → ∞

Gaussian Graphical Model

indep.

I

X = (X1 , ...., Xp ) ∼ N(µ, Σ0 )

I

Given a graph G = (V , E0 ), a pair ( i, j) is NOT contained in E0 (θ0,ij = 0) iff Xi ⊥ ⊥ Xj | {Xk ; k ∈ V \ {i,j }}

I

Define Predictive Risk with Θ  0 as R(Θ) = tr(ΘΣ0 ) - log|Θ| ∝ -2E0 (logfΘ (X )) Where the Gaussian Log-likelihood function using Σ  0 is logfΘ (X ) = − p2 log2π + 12 log|Θ| − 12 X T Θ−1 X

Gaussian Graphical Model

indep.

I

X = (X1 , ...., Xp ) ∼ N(µ, Σ0 )

I

Given a graph G = (V , E0 ), a pair ( i, j) is NOT contained in E0 (θ0,ij = 0) iff Xi ⊥ ⊥ Xj | {Xk ; k ∈ V \ {i,j }}

I

Define Predictive Risk with Θ  0 as R(Θ) = tr(ΘΣ0 ) - log|Θ| ∝ -2E0 (logfΘ (X )) Where the Gaussian Log-likelihood function using Σ  0 is logfΘ (X ) = − p2 log2π + 12 log|Θ| − 12 X T Θ−1 X

Previous work

I

Yuan and Li (2007), Banerjee, El Ghaoui and d’ Aspremont (2008), d’Aspremont et al ( 2008), Friedman et al (2008) ˆ − |logdet(Ω)| + λΣi6=j |ωi,j | min tr(ΣΩ) ˆ is sample covariance matrix. where Σ

I

Meinshausen Buhlmann (2006) Regression Approach

A Regression Model

We assume a multivariate Gaussian model X = (X1 , ..., Xp ) ∼ Np (0, Σ0 ), where Σ0,ii = 1 Regression model: I

∀i = 1, ..., p Xi =

P

βji Xj + Vi where βji = −θ0,ij /θ0,ii , and Vi ∼ N(0, σV2 i ) ⊥ ⊥ {Xj : j 6= i}

j6=i

for ∀i, Var (Vi ) = I

1 θ0,ii

Xi ⊥⊥ Xj | {Xk ; k ∈ V \ {i,j }} ⇔ θ0,ij 6= 0 ⇔ βij 6= 0 and βji 6= 0

Meinshausen and Buhlmann 2006

I

Perform p regressions using the Lasso to βˆ1 , ..., βˆp , where n o βˆi = βˆi ; j ∈ {1, ..., p} \ i j

I

Then estimate the edge set by OR rule

I

Under sparsity and Neighborhood Stability conditions, they show   P Eˆn = E0 → 1 as n → ∞

Essential Sparsity

I

I

i For each i, define s0,n as the smallest integer such that n o Pp 2 , λ2 θ i 2 min θ 0,ii ≤ s0,n λ θ0,ii , where 0,ij j=1,j6=i p λ = 2log (p) /n i where s0,n at row i describes the number of sufficiently large non-diagonal elements. P i Define S0,n = pi=1 s0,n as the essential sparsity of the graph.

The estimation procedure

I

Nodewise regressions for inferring the graph

I

Maximum likelihood estimation based on graph

I

Choosing the regularization parameters using cross validation

Nodewise regression for inferring the graph

Using the Lasso in combination with thresholding to infer the graph. I

For each of the nodewise regressions using the Lasso with same λn  2 P P P i βinit = argminβ i nr=1 Xi − j6=i βji Xj + λn j6=i βji ,∀i

I

i with τ to get the Zero set: Thresholding βinit n o i Di = j : j = 6 i, βj,init 0 such that for all i, and Vi : Var(Vi ) = 1/σ0,ii ≥ v 2

Theorem 1: selection

Assume that A0 and A2 hold, and Σi,i = 1 for all i. Under appropriately chosen λn and τ , we obtain an edge set Eˆn with high probability that ˆ En ≤ 4S0,n , where Eˆn \ E0 ≤ 2S0,n

Theorem 1 : estimation

Assume that in addition, A1 holds. Then



p 

ˆ

ˆ

S0,n log max (n, p) /n

Θn − Θ0 ≤ Θn − Θ0 = Op

F

2 p 

ˆ

ˆ

S0,n log max (n, p) /n

Σn − Σ0 ≤ Σn − Σ0 = Op F  2 ˆ n − R (Θ0 ) = Op (S0,n log max (n, p) /n) R Θ

Remark 3 : Σ0,ii are unknow Under

A0, A1 and

A2 :



ˆ

ˆ I Θn − Θ0 and Σ − Σ n 0 achieve the same rate as in 2 2 Theorem 1. I

For the Frobenius norm and the risk to converge, A1 is to be replaced by: (A3) p  nc for some constant 0 < c < 1 and p + S0,n = o (n/log max (n, p)) as n → ∞ in this case, we have

p 

ˆ

(p + S0,n ) log max (n, p) /n

Θn − Θ0 = Op

F p 

ˆ

Σ − Σ (p + S0,n ) log max (n, p) /n

n 0 = Op F   ˆ R Θn − R (Θ0 ) = Op ((p + S0,n ) log max (n, p) /n)

Proposition 4: Bounding the bias due to sparse approoximation

Assume that (A0), (A1) and (A2) hold. With the same choices on λn , τ as in Theorem 1, then

p 

˜

(p + S0,n ) log max (n, p) /n kΘn,D kF = Θ 0 − Θ0 = Op F

Theorem 5: for case that E0 ⊆ E

Assume (A1) and (A2) hold, and define S = |E0 |. Assume Σ0,ii = 1 for all i. Suppose with high probability we obtain an edge set E such that E0 ⊆ E and |E \ E0 | = O (S), then:

p 

ˆ Slog max (n, p) /n

Θn (E ) − Θ0 = Op F

This is the case that all non-zero βji are suffciently large

Simulation study

I

Gelato, GLasso and space

I

AR(1)-Block model, random concentration matrix model and exponential decay model

I I

p = 300 and n = 40, 80, 320 ˆ ˆ Measure Σ − Σ , Θ − Θ n n 0 0 and Kullback Leibler F F divergence

I

Select tuning parameter using cross-validation

 Simulation 1: AR-Bolck Model ΣBlock = 0.9|i−j|

Simulation 1: AR-Bolck Model 2

Simulation 2: Random Concentration Matrix Θ = B + δI

Simulation 3: Exponential Decay Model θ0,ij = exp (−2 |i − j|)

Real Data Application

Conclusion

I

Gelta separates the tasks of model selection and covariance estimation

I

Thresholding plays a key role in obtaining a sparse approximation of the graph using a small amount of sample

I

With stronger conditions on the smaple size, convergence rates in terms of operator and frobenius norms and KL divergence are established

I

The method is feasible in high dimensions