Supplemental Materials for “Asymptotically Normal and Efficient Estimation of Covariate-Adjusted Gaussian Graphical Model ” Mengjie Chen, Department of Biostatistics, University of North Carolina-Chapel Hill,
Zhao Ren, Department of Statistics, University of Pittsburgh,
Hongyu Zhao, Department of Biostatistics, Yale University,
Harrison Zhou∗ Department of Statistics, Yale University, New Haven, CT 06520, USA. Email:
[email protected]. January 14, 2015
1
Abstract In this supplemental materials, we collect the proofs of Theorem 1 and Lemmas 1-3 for proving main theorems. Moreover, a simulation study on Homogeneous Models as well as some visualization plots is given in the end.
A. PROOF OF LEMMA 1
T
D E/n ≤ C1 λ is an immediate consequence of conditions 1 and 2. The function Lλ (d, θ) ∞ is jointly convex in (d, θ). For fixed θ > 0, denote the minimizer of Lλ (d, θ) over all d ∈ Rp0 by dˆ(θλ), a function of θλ, i.e., ( dˆ(θλ) = arg min Lλ (d, θ) = arg min p p d∈R
0
d∈R
0
kR − Wdk2 + λθ kdk1 2n
) ,
(A.39)
n o ˆ , θˆ . ˆ then if we knew θ in the solution of Equation (27), the solution for the equation is dˆ θλ ˆ is just the standard Lasso with the penalty θλ, ˆ however we don’t know We recognize that dˆ θλ ˆ The strategy of our analysis is that we first show that θˆ is very close to its oracle the estimator θ. estimator θora , then the standard Lasso analysis would imply the desired result Equations (31)-(32) ˆ ora = 1 + O(λ2 s). For the standard Lasso analysis, some kind of under the assumption that θ/θ regularity condition is assumed on the design matrix WT W/n in the regression literature. In this paper we use the lower-RE condition, which is one of the most general conditions. Let µ = λθ. From the Karush-Kuhn-Tucker condition, dˆ(µ) is the solution to the Equation (A.39) if and only if ˆ ˆ R − Wd (µ) /n = µ · sgn dk (µ) , if dˆk (µ) 6= 0, WkT R − Wdˆ(µ) /n ∈ [−µ, µ] , if dˆk (µ) = 0.
WkT
(A.40)
Let C2 (a1 , a2 ) and C1 (a11 , a12 , a2 ) be constants depending on a1 , a2 and a11 , a12 , a2 , respectively. The constant C2 is bounded above if a1 is bounded above and a2 is bounded below by constants, respectively. The constant C1 is bounded above whenever a11 and a12 are bounded ∗
The research of Harrison Zhou was supported in part by NSF Career Award DMS-0645676, NSF FRG Grant
DMS-0854975 and NSF Grant DMS-1209191
2
above and a2 is bounded below by constants. The explicit formulas of C1 and C2 are given as follows, a1 + C2 (a1 , a2 ) = a2
"
a1 a2
2
2 (a1 + 1) + a2
#1/2 ,
C1 (a11 , a12 , a2 ) = a12 (1 + C2 (a11 × a12 , a2 )) .
(A.41)
The following propositions are helpful to establish our result. The proof is given in Sections A.1 and A.2. and Proposition 1 The sparsity s is defined in Equation (26). For any ξ > 1, assuming ν ≤ µ ξ−1 ξ+1 conditions 2, 3 in Lemma 1 hold, we have
ˆ true d (µ) − d
1
ˆ
d (µ) − dtrue 2 1
W dtrue − dˆ(µ) n n o ˆ θˆ be the solution of Proposition 2 Let d,
µ
, 4ξA1 , α1 λs, λ √ µ ≤ C2 4ξA1 , α1 λ s, λ
ˆ true ≤ (ν + µ) d (µ) − d . ≤ C1
(A.42) (A.43) (A.44)
1
the scaled Lasso (27). For any ξ > 1, assuming
conditions 1 − 3 in Lemma 1 hold, then we have θˆ ora − 1 ≤ τ . θ
(A.45)
Now we finish our proof with these two propositions. According to conditions 1 − 3 in Lemma 1 , Proposition 2 implies ν ≤ µ ξ−1 with µ = λθˆ and Proposition 1 further implies that there exist ξ+1 some constants c1 , c2 and c3 such that
ˆ ˆ
d λθ − dtrue ≤ c1 λs,
1 √
ˆ ˆ true
d λθ − d ≤ c2 λ s,
2
true ˆ − d λθˆ
W d ≤ c3 λ2 s. n i h Note that µλ = θˆ ∈ [θora (1 − τ ), θora (1 + τ )] ⊂ 2A1 2 , 3A2 2 . Thus the constants c1 , c2 and c3 only depends on A1 , A2 , α1 and ξ. Now we transfer the results above on standardized scaled Lasso n √ o (27) back to the general scaled Lasso (25) through the bounded scaling constants kDnk k and immediately have the desired results (31)-(32). 3
A.1
Proof of Proposition 1
Define T = {k : |dtrue | ≥ λ}, the index set of those large coefficients of dtrue . Notice that k 2 1
W dtrue − dˆ(µ) n T dtrue − dˆ(µ) WT R − Wdˆ(µ) − WT (R − Wdtrue ) =
n
true ˆ
true ˆ
− d (µ) (A.46) ≤ µ d − d (µ) + ν d 1 1 1
, ≤ (µ + ν) dtrue − dˆ(µ) + 2µ dtrue T c 1 − (µ − ν) dtrue − dˆ(µ) (A.47) c T
T
1
1
where the first inequality follows from the KKT conditions (A.40). Now define ∆ = dˆ(µ) − dtrue . Equation (A.46) also implies the desired inequality (A.44) ∆T
WT W ∆ ≤ (µ + ν) k∆k1 n
(A.48)
We will first show that n o
true
k∆T c k1 ≤ max 2 (1 + ξ) d , 2ξ k∆T k1 , Tc 1
(A.49)
then we are able to apply the lower-RE condition (28) to derive the desired results. ξ−1 To show Equation (A.49), we note that our assumption ν ≤ µ ξ+1 with Equation (A.47) implies
1 kW∆k2 ≤ 2µ n
ξ k∆T k1 k∆T c k1 + dtrue T c 1 − ξ+1 ξ+1
.
(A.50)
Suppose that
k∆T c k1 ≥ 2 (1 + ξ) dtrue T c 1 ,
(A.51)
then the inequality (A.50) becomes 0≤
1 µ kW∆k2 ≤ (2ξ k∆T k1 − k∆T c k1 ) , n ξ+1
which implies k∆T c k1 ≤ 2ξ k∆T k1 .
(A.52)
Therefore the complement of inequality (A.51) and Equation (A.52) together finish our proof of Equation (A.49). 4
Before proceeding, we point out two facts which will be used below several times. Note the sparsity s is defined in terms of the true coefficients btrue in Equation (26) before standardization but the index set T is defined in term of dtrue after standardization. Condition 2 implies that this standardization step doesn’t change the sparsity up to a factor A1 . Hence it’s not hard to see that |T | ≤ A1 s and k(dtrue )T c k1 ≤ A1 λs, where |T | denotes the cardinality of the index T . Now we are able to apply the lower-RE condition of ∆T
WT W n
to Equation (A.48) and obtain that
WT W ∆ ≥ α1 k∆k2 − ζ (n, p0 ) k∆k21 n 2 ≥ α1 k∆k2 − ζ (n, p0 ) 8 (1 + ξ)2 dtrue T c 1 + k∆T k21 ≥ α1 − |T | ζ (n, p0 ) 8 (1 + ξ)2 k∆k2 − ζ (n, p0 ) 8 (1 + ξ)2 A21 λ2 s2 ≥
α1 k∆k2 − λ2 s, 2
where in the second, third and last inequalities we applied the facts (A.49), k∆T k21 ≤ |T | k∆T k2 , |T | ≤ A1 s, k(dtrue )T c k1 ≤ A1 λs and sζ (n, p0 ) 8 (1 + ξ)2 A21 ≤ min{ α21 , 1}. Moreover, by applying those facts used in last equation again, we have (µ + ν) k∆k1 ≤ 4ξµ dtrue T c 1 + k∆T k1 p ≤ 4ξµ A1 λs + |T | k∆k √ µ ≤ 4ξA1 λ2 s + sλ k∆k . λ The above two inequalities together with Equation (A.48) imply that 4ξA1
α1 √ µ 2 λ s + sλ k∆k ≥ k∆k2 − λ2 s. λ 2
Define Su = 4ξA1 µλ . Some algebra about this quadratic inequality implies the bound of ∆ under `2 norm
"
2
Su 2 (Su + 1) Su k∆k ≤ + + α1 α1 α1 √ µ ≡ C2 4ξA1 , α1 λ s. λ
5
#1/2
√ λ s
Combining this fact with Equation (A.49), we finally obtain the bound under `1 norm (A.42) k∆k1 ≤ k∆T k1 + k∆T c k1 ≤ 2 (1 + ξ) dtrue T c 1 + k∆T k1 p ≤ 2 (1 + ξ) A1 sλ + |T | k∆k µ ≤ 4ξA1 1 + C2 4ξA1 , α1 sλ µ λ ≡ C1 , 4ξA1 , α1 sλ. λ A.2
Proof of Proposition 2
For τ defined in Equation (30), we need to show that θˆ ≥ θora (1 − τ ) and θˆ ≤ θora (1 + τ ) on the n o event ν ≤ θora λ ξ−1 (1 − τ ) . Let dˆ(θλ) be the solution of (A.39) as a function of θ, then ξ+1
2
R − Wdˆ(θλ)
1 ∂ Lλ dˆ(θλ) , θ = − , (A.53) ∂θ 2 2nθ2 o n o n ∂ ∂ ˆ Lλ (d, θ) |d=d(θλ) = 0 for all dˆk (θλ) 6= 0, and ∂θ d (θλ) = 0 for all dˆk (θλ) = 0 since ∂d ˆ k k n o ˆ which follows from the fact that k : dk (θλ) = 0 is unchanged in a neighborhood of θ for almost
all θ. Equation (A.53) plays a key in the proof. (1). To show that θˆ ≥ θora (1 − τ ) it’s enough to show ∂ Lλ dˆ(θλ) , θ |θ=t1 < 0, ∂θ where t1 = θora (1 − τ ), due to the strict convexity of the objective function Lλ (d, θ) in θ. Equation (A.53) implies that
2t21
2
ˆ
R − Wd (t1 λ)
∂ Lλ dˆ(θλ) , θ |θ=t1 = t21 − ∂θ n
2
true true ˆ + W d (t1 λ) − d
R − Wd
≤ t21 − n T WT (R − Wdtrue ) ≤ t21 − (θora )2 + 2 dtrue − dˆ(t1 λ) n
≤ 2t1 (t1 − θora ) + 2ν dtrue − dˆ(t1 λ) . (A.54)
1
6
o n n o ξ−1 = ν/θora < λ ξ−1 On the event ν ≤ t1 λ ξ+1 (1 − τ ) we have ξ+1 ∂ ˆ Lλ d (θλ) , θ |θ=t1 ∂θ
ξ−1
true ˆ
ora ≤ 2t1 (t1 − θ ) + 2t1 λ − d (t1 λ)
d ξ+1 1
ξ − 1 true ˆ
≤ 2t1 −τ θora + λ − d (t1 λ) < 0.
d ξ+1 1 2t21
The last inequality follows from the definition of τ and the `1 error bound in Equation (A.42) of Proposition 1. Note that for λ2 s sufficiently small, we have small τ < 1/2. In fact, al
true ˆ
though d − d (t1 λ) also depends on τ , our choice of τ is well-defined and is larger than 1
λ 3(ξ−1) true ˆ d − d (t1 λ) . θora 2(ξ+1) 1
(2). Let t2 = θora (1 + τ ). To show the other side θˆ ≤ θora (1 + τ ) it is enough to show ∂ Lλ dˆ(θλ) , θ |θ=t2 > 0. ∂θ o n o n ξ−1 ξ−1 = ν/θora < λ ξ+1 (1 + τ ) we have Equation (A.53) implies that on the event ν ≤ t2 λ ξ+1
2t22
∂ Lλ dˆ(θλ) , θ |θ=t2 = t22 − ∂θ
2
ˆ
R − Wd (t2 λ) n
= t22 − (θora )2 + (θora )2 − = t22 − (θora )2 +
kR − Wd
= t22 − (θora )2 +
2
ˆ
R − Wd (t2 λ) n
2
k − R − Wdˆ(t2 λ)
true 2
n T dˆ(t2 λ) − dtrue WT R − Wdtrue + R − Wdˆ(t2 λ)
n
≥ t22 − (θ ) − dˆ(t2 λ) − dtrue (ν + t2 λ) 1
2ξ
t2 λ dˆ(t2 λ) − dtrue ≥ (t2 + θora ) θora τ − ξ+1 1
3ξ
≥ 2θora τ θora − λ dˆ(t2 λ) − dtrue > 0, 2 (ξ + 1) 1 ora 2
where the second last inequality is due to the fact τ ≤ 1/2 and the last inequality follows from the definition of τ and the `1 error bound in Equation (A.42) of Proposition 1. Still, our choice of τ is
true ˆ
4ξ λ well-defined and is larger than θora 2(ξ+1) d − d (t2 λ) . 1
7
B.
PROOF OF THEOREM 1
To emphasize the essence of the theorem and at the same time simplify its proof, we only prove p √ the results for λ1 = (1 + ε1 ) 2δ1 (log q)/n. The proof for λ1 = (1 + ε1 ) δ1 Ln (kq /q) relies on a similar result like Lemma 1 but is much involved. In particular, a weaker version of condition 1 in Lemma 1 is required only. The interested reader is referred to Sun & Zhang (2013) for further details. With the help of the deterministic result, Lemma 1, we are able to show the following proposition, Proposition 3 There exist some constants Ck0 , 1 ≤ k ≤ 3 such that for each 1 ≤ j ≤ p, kγj − ˆ γj k1 ≤ C10 λ1 s1 , √ kγj − ˆ γj k ≤ C20 λ1 s1 , kX (γj − ˆ γj )k2 /n ≤ C30 λ21 s1 ,
(A.55) (A.56) (A.57)
with probability 1 − o q −δ1 +1 .
2
ˆ
Given the results of Proposition 3, it is trivial to finish our proof. Note Z − Z γj )k2 j j = kX (γj − ˆ and hence Equation (14) immediately follows from result (A.57). Equations (15) and (16) are obtained by the union bound and Equations (A.55) and (A.56) because of the following relationship,
2
P
ˆ
ˆ γj k1 and Γ − Γ = pj=1 kγj − ˆ γj k2 .
Γ − Γ = maxj kγj − ˆ `∞
F
It is then enough to prove Proposition 3 to complete the proof of Theorem 1. The proof is similar to that in Ren, Sun, Zhang & Zhou (2013), although a more delicate result is required to show lower-RE condition of the gram matrix under condition 3’. B.1
Proof of Proposition 3
The Proposition is an immediate consequence of Lemma 1 applied to Yj = Xγj + Zj with tuning parameter λ1 and sparsity s1 . To show the union of Equations (A.55)-(A.57) holds with high probability 1 − o q −δ1 +1 , we only need to check the following conditions of the Lemma 1 hold
8
with probability at least 1 − o q −δ1 +1 , ξ−1 ora (1 − τ ) for some ξ > 1 , I1 = ν ≤ θ λ1 ξ+1 kXk k √ ∈ [1/A1 , A1 ] for all k , I2 = n kZj k ora ora I3 = θ ∈ [1/A2 , A2 ] , where θ = √ . , n WT W satisfies lower-RE with (α1 , ζ (n, q)) s.t. n I4 = , 2 2 α1 s ζ (n, q) 8 (1 + ξ) A ≤ min{ , 1} 1
1
2
√ where we set ξ = 3/ε1 + 1 for the current setting, A1 = CA1 max B, M1 under Condition √ √ 3 and A1 = CA1 M1 under Condition 30 for some universal constant CA1 > 0, A2 = 2M2 , √ n 1 α1 = 2M and ζ (n, q) = o(1/s ). Let us still define W = X · diag as the standardized X 1 kXk k 1 in Lemma 1 of Section 6.1. We will show that p P {I1c } = O q −δ1 +1 / log q , P {Iic } = o(q −δ1 ) for i = 2, 3 and 4, which implies P {Ej } = 1 − o q −δ1 +1 , where Ej is the intersection of I1 to I4 . We will first consider P {I2c } and P {I3c }, then P {I4c }, and leave P {I1c } to the last, which relies on the bounds for P {Iic }, 2 ≤ i ≤ 4. (1). To study P {I2c } and P {I3c }, we need the following Bernstein-type inequality (See e.g. Vershynin (2010), Section 5.2.4 for the inequality and definitions of norms k·kϕ2 and k·kϕ1 ) to control the tail bound for sum of i.i.d. sub-exponential variables, ( n ) 2 X t t , , P Ui ≥ tn ≤ 2 exp −cn min K2 K
(A.58)
i=1
for t > 0, where Ui are i.i.d. centered sub-exponential variables with parameter kUi kϕ1 ≤ K and √ (1) (1) c is some universal constant. Notice that θora = kZj k / n with Zj ∼ N (0, σjj ), and Xk is
h i
(1) −1/2 sub-Gaussian with parameter Xk ∈ cϕ2 M1 , Cϕ2 B by Conditions 2 and 3 with some ϕ2
9
(1)
universal constants cϕ2 and Cϕ2 (all formulas involving ϕ1 or ϕ2 parameter of Xk replace “B” √ by “ M1 ” under Condition 30 hereafter). The fact that sub-exponential is sub-Gaussian squared implies that there exists some universal constant C1 > 0 such that n (θora )2 − σjj is sum of i.i.d. sub-exponential with ϕ1 parameter C1 σjj ,
(1) 2 2 2
kXk k − nExk is sum of i.i.d. sub-exponential with ϕ1 parameter
Xk
.
ϕ1
(1) 2 0
∈ c0ϕ1 M1−1 , Cϕ1 Note that σjj ∈ [1/M2 , M2 ] by Condition 2 and B 2 with some
Xk
√ ϕ1 √ 0 0 universal constants cϕ1 and Cϕ1 . Let A1 = CA1 max B, M1 and A2 = 2M2 for some large constant CA1 . Equation (A.58) with a sufficiently small constant c0 implies n σjj o P {I3c } = P (θora )2 ∈ / 1/A22 , A22 ≤ P n (θora )2 − σjj ≥ n 2 0 ≤ 2 exp −C n = o(q −δ1 ),
(A.59)
and ) 2 kX k k ∈ / 1/A21 , A21 for some k P {I2c } = P n ( )
2
(1)
n ≤ q2 exp (−C 0 n) = o(q −δ1 ). ≤ qP kXk k2 − nEx2k > c0 (A.60)
Xk
(
ϕ1
(2). To study the lower-RE condition of
WT W , n
we essentially need to study
XT X n
XT X . n
On the
T
satisfies lower-RE with (αx , ζx (n, q)), then W nW satisfies √ k n 2 √ k ∈ [1/A1 , A1 ] on lower-RE with (α1 , ζ) = αx A−2 , ζ A , since W = Xdiag and kX x 1 1 kXk k n event I2 , it’s easy to see that if
event I2 . Moreover, to study the lower-RE condition, the following lemma implies that we only need to consider the behavior of
XT X n
on sparse vectors.
Lemma 4 For any symmetric matrix ∆q×q , suppose v T ∆v ≤ δ for any unit 2s sparse vector P v ∈ Rq , i.e. kvk = 1 and v ∈ B0 (2s) = {a : qi=1 1 {ai 6= 0} ≤ 2s}, then we have T v ∆v ≤ 27δ kvk2 + 1 kvk2 for any v ∈ Rq . 1 s
10
See Supplementary Lemma 12 in Loh & Wainwright (2012) for the proof. With a slight abuse T of notation, we define Σx = Cov X (1) . By applying Lemma 4 on XnX − Σx and Condition 2, v T Σx v ≥
1 M1
kvk2 , we know that
XT X n
1 αx ≥ M1
satisfies lower-RE (αx , ζx (n, q)) with
27 27 1− and ζx (n, q) ≤ , L s1 M1 L
(A.61)
provided T T X X v ≤ 1 for all v ∈ B0 (2s1 ) with kvk = 1, − Σ v x LM1 n which implies that the population covariance matrix Σx and its sample version
XT X n
(A.62) behave simi-
larly on all 2s1 sparse vectors. Now we show Equation (A.62) holds under Conditions 3 and 30 respectively for a sufficiently large constant L such that the inequality in the event I4 holds. Under Condition 30 , X (1) is jointly sub-Gaussian with parameter (M1 )1/2 . A routine one-step chaining (or δ-net) argument implies that there exists some constant csg > 0 such that ) ( T 2 T X X t t 2 P sup v − Σx v > t kvk ≤ 2 exp −csg n min , + 2s1 log q . n M12 M1 v∈B0 (2s1 ) (A.63) See e.g. Supplementary Lemma 15 in Loh & Wainwright (2012) for the proof. Hence by picking 1 n small t = LM1 with any fixed but arbitrary large L, the sparsity condition s1 = o log q and Equation (A.63) imply that Equation (A.62) holds with probability 1− o p−δ1 . q n , Hoeffding’s inequality and a union bound imply that Under Condition 3, if s1 = o log q ) ( r
XT X (1 + δ ) log q 1
≤ 2q −2δ1 , P
n − Σx > 2B n ∞
where the norm k·k∞ denotes the entry-wise supnorm and X (1) ∞ ≤ B (see, e.g. Massart & Picard (2007) Proposition 2.7). Thus with probability 1− o q −δ1 , we have for any v ∈ B0 (2s1 ) with kvk = 1, r T
T T X X X X
(1 + δ1 ) log q 1 v − Σx v ≤ − Σx kvk21 ≤ 2B = o(1),
n n n 2s1 ∞
11
where the last inequality follows from kvk1 ≤
√
2s1 kvk for any v ∈ B0 (2s1 ). Therefore we have Equation (A.62) holds with probability 1− o q −δ1 for any arbitrary large L. If s1 = o log3 nn log q ,
an involved argument using Dudley’s entropy integral and Talagrand’s concentration theorem for empirical processes implies (see, Rudelson & Zhou (2013) Theorem 23 and its proof) Equation (A.62) holds with probability 1− o q −δ1 for any fixed but arbitrary large L. Therefore we showed that under Condition 3 or 30 , Equation (A.62) holds with probability 1− 2 o q −δ1 for any arbitrary large L. Consequently, Equation (A.61) with (α1 , ζ) = αx A−2 1 , ζx A1 on event I2 implies that we can pick α1 = with probability 1− o q −δ1 .
1 2M1
and sufficiently small ζ such that event I4 holds
(3). Finally we study the probability of event I1 . The following tail probability of t distribution is helpful in the analysis. Lemma 5 Let Tn follows a t distribution with n degrees of freedom. Then there exists n → 0 as n → ∞ such that ∀t > 0 n 2 o 2 P Tn2 > n e2t /(n−1) − 1 ≤ (1 + n ) e−t / π 1/2 t . Please refer to Sun & Zhang (2012) Lemma 1 for the proof. Recall that I1 =
n
ν θora
o ≤ λ1 ξ−1 (1 − τ ) , ξ+1
where τ defined in Equation (30) satisfies that τ = O (s1 λ21 ) = o (1) on ∩4i=2 Ii . From the definition of ν in Equation (29) we have ν θora
= max |hk | , with hk = k
where each column of W has norm kWk k =
WkT Zj , nθora
√ n. Given X, equivalently W, it’s not hard to check
√
that we have √n−1h2k ∼ t(n−1) by the normality of Zj , where t(n−1) is the t distribution with (n − 1) 1−hk
degrees of freedom. From Lemma 5 we have ( r ) 2t2 P |hk | > n (n − 1) h2k 2 (n − 1) t2 /n (n − 1) h2k 2 (n − 1) t2 / (n − 2) = P > ≤P > 1 − h2k 1 − 2t2 /n 1 − h2k 1 − t2 / (n − 2) 2 (n − 1) h2k 2 2t /(n−2) ≤ P > (n − 1) e − 1 ≤ (1 + n−1 ) e−t / π 1/2 t , 2 1 − hk 12
where the first inequality holds when t2 ≥ 2, and the second inequality follows from the fact q ex − 1 ≤ x/(1 − x2 ) for 0 < x < 2. Now let t2 = δ1 log q > 2, and λ1 = (1 + ε1 ) 2δ1 nlog q with q 2δ1 log q ξ = 3/ε1 + 1, then we have λ1 ξ−1 (1 − τ ) > and ξ+1 n ) c 2δ log q 1 − P ∩4i=2 Ii ≤ P ∩4i=1 Ii ≥ P ora θ n ( ) r c 2δ1 log q ≥ 1 − q · P |hk | > − P ∩4i=2 Ii n −δ1 +1 1 q ≥ 1− √ + o (1) √ , log q πδ1 √ which immediately implies P {I1c } ≤ O q −δ1 +1 / log q . (
ν
C.
r
PROOF OF LEMMA 2
We show that Tk l /n − ETk El /n ≤ Cin λ21 s1 with probability 1 − o q −δ1 in this section. By Equation (37), we have T ∆Z + T ∆Z T Tk ∆Zl + Tl ∆Zk + ∆TZl ∆Zk k l l k 2 T ≤ C3 s1 λ1 + k l /n − Ek El /n ≤ . (A.64) n n To bound the term Tk ∆Zl , we note that by the definition of ∆Zl in Equation (35) there exists some constant C4 such that, T h i ∆Z T ˆ k l ˆ = k Zl − Zl + ZAc − ZAc βm /n n = Tk X [(γl − γˆl ) + (ˆ γAc − γAc ) βm ] /n
≤ Tk X/n ∞ k(γl − γˆl ) + (ˆ γAc − γAc ) βm k1
≤ Tk X/n ∞ max kγi − γˆi k1 (1 + kβm k1 ) ≤ Tk X/n ∞ C4 s1 λ1 , i
(A.65)
where the last inequality follows from Equations (34) and (36). i.i.d.
Since k ∼ N (0, ψkk ) and X are independent, it can be seen that each coordinate of Tk X is a sum of n i.i.d. sub-exponential variables with bounded parameter under either Condition 3 or 30 . A union bound with q coordinates and an application of Bernstein inequality imply that q
T
X/n ≤ C5 log q with probability 1 − o q −δ1 for some large constant C5 > 0. See e.g. k n ∞ 13
Section 5.2.4 of Vershynin (2010) for details of sub-exponential variable and Bernstein inequality. Also see (A.58). This fact, together with (A.65), implies that Tk ∆Zl /n = O (s1 λ21 ) with proba bility 1 − o q −δ1 . Similar result holds for Tl ∆Zk /n . Together with (A.64), this result completes our claim on Tk l /n − ETk El /n and finishes the proof of Lemma 2. D.
PROOF OF LEMMA 3
ˆ m= Z ˆ Ac βm + Em for each m ∈ Lemma 3 is an immediate consequence of Lemma 1 applied to Z A = {i, j} with parameter λ2 and sparsity Cβ s2 = 2M2 , which follows from the definition of βm in (6). It suffices for us to check three assumptions in the Lemma 1 hold with probability 1 − o p−δ2 +1 to finish our proof. Part of the proof is similar to that of Proposition 3. We thus directly apply those facts already shown in the proof of Proposition 3 whenever possible. kZˆ k k We study the second assumption √n ∈ [1/A1 , A1 ] and θora = kE√mn k ∈ [1/A2 , A2 ] in Lemma kZˆ k k kZˆ k√−Zk k k k∆ k √k + 1. First, we note that √n ≤ kZ , where Zk ∼ N (0, σkk I) and kE√mn k ≤ k√mnk + √Znm n n p kZˆ k√−Zk k 2 = o(1) acwhere m ∼ N (0, ψmm I). The latter terms are negligible: = O s λ 1 1 n k cording to Equation (33) and k∆√Zm = o(1) from Equation (37). Moreover, noting that ψmm , σkk ∈ n −1 M2 , M2 , we use the same argument as that for P {I3c } with an application of Bernstein inequalo n √ √ / [1/ 2M2 , 2M2 ] = ity for sub-exponential variables in the proof of Proposition 3 to obtain P k√mnk ∈ n o √ √ k √k ∈ / [1/ o(p−δ2 ) and P kZ 2M , 2M ] for some k = o(p−δ2 ). Given the above results, the 2 2 n √ second assumption holds with probability 1 − o(p−δ2 ) by choosing A1 = A2 = 3M2 .
We now study the third assumption in Lemma 1, which involves the lower-RE condition of √ WT W ˆ ˆ , where W = ZAc diag n/||Zk || . Following the arguments in Proposition 3 and Lemma n 4, given the second assumption holds and the fact λmin (ΣAc ,Ac ) ≥ 1/M2 , we essentially need to show with probability 1 − o p−δ2 the following fact holds T ˆT ˆ v ZAc ZAc /n − ΣAc ,Ac v ≤ (LM2 )−1 for all v ∈ B0 (2Cβ s2 ) with kvk = 1,
(A.66)
where L is some large constant and B0 (2Cβ s2 ) is the set of all sparse vectors with sparsity no more than 2Cβ s2 . Consequently, assumption three is valid with α1 =
1 2M2
and ζ (n, p) = o(1/s2 ). Now
we prove (A.66). Following the same line of the proof in Proposition 3 for the lower-RE condition 14
of XT X/n with normality assumption on Z and sparsity assumption s2 = o obtain that with probability 1 − o p−δ2 ,
p n/ log p , we can
T T v ZAc ZAc /n − ΣAc ,Ac v ≤ kvk2 /(2LM2 ). sup v∈B0 (2Cβ s2 )
(A.67)
Therefore it’s sufficient to show in the current setting that with probability 1− o p−δ2 , T ˆT ˆ T c c Z Z /n − Z Z /n v ≤ (2LM2 )−1 for all v ∈ B0 (2Cβ s2 ) with kvk = 1. v Ac A Ac A
(A.68)
To show Equation (A.68), we notice T v
ˆ Ac ˆ TAc Z Z ZT c ZAc − A n n
2 !
T ˆ ˆ c c c c Z Z − Z Z − Z v
c A A A A A v + v ≤ 2 v T ≡ D1 +D2 . n n
To control D2 , we find
√ p p √
ˆ
D2 ≤ kvk1 max Z s1 λ1 2Cβ s2 = o(1), k − Zk / n ≤ k
(A.69)
p by Equation (33), kvk1 ≤ 2Cβ s2 kvk and sparsity assumptions. To control D1 , we find D1 ≤ √ √ D2 kZAc vk / n, which is o(1) with probability 1− o p−δ2 by Equation (A.69) and the follow √ ing result kZAc vk / n = O(1). Equation (A.67) implies that with probability 1− o p−δ2 ,
kZAc vk kvk
1/2 √ ≤√ + ΣAc ,Ac v = O(1) for all kvk = 1. n 2LM2 Therefore we have shown (A.68) and finished the proof of assumption three. The following proposition established the verification of the first assumption in Lemma 1 and thus completes our proof of Lemma 3. Proposition 4 Set ξ = 3/ε2 + 1. Given that the second and third assumptions hold in Lemma √ 1 , we have ν ≤ θora λ2 (ξ − 1)(ξ + 1)−1 (1 − τ ) with 1 with A1 = A2 = 3M2 and α1 = 2M 2 √ probability at least 1 − p−δ2 +1 / log p , where v and τ are defined by (29) and (30).
15
D.1
Proof of Proposition 4
Recall in the setting of Lemma 3 we have A1 = A2 =
√ 3M2 , α1 =
1 2M2
and ζ (n, p) = o(1/s2 ).
Moreover,
ν = WT Em /n ∞ = WT m /n ∞ + O WT ∆Zm /n ∞ , √ √ √ θora = kEm k / n = km k / n + O k∆Zm k / n .
Following the same line of the proof for event I1 in Proposition 3, we obtain that WT m /n ∞ < km k √ λ2 ξ−1 (1 − τ ) with probability 1 − o p−δ2 +1 . Thus to finish our proof, it’s sufficient to show ξ+1 n with probability 1 − o p−δ2 +1 ,
T
√
W ∆Zm /n = o(λ2 ) and k∆Zm k / n = o(1). ∞
(A.70)
p √ Equation (37) in the main body of the paper immediately implies k∆Zm k / n = O s1 λ21 =
o(1) by the sparsity assumptions. To study WT ∆Zm /n ∞ , we note there exists some constant C6 > 0 such that,
√
T
T
W ∆Zm /n ˆ ˆ = diag n/|| Z || Z ∆ /n
k Ac Zm ∞ ∞
T
T T ˆ ≤ A1 ZAc + ZAc − ZAc ∆Zm /n
∞ √
√
ˆ T / / n ∆ n − Z ≤ A1 ZAc ∆Zm /n ∞ + max Z
Zm i i i
≤ A1 ZTAc ∆Zm /n ∞ + C6 s1 λ21 , (A.71) ˆ k ||/√n ∈ [1/A1 , A1 ] for all k in the first inequality and Equations (33) and (37) where we used ||Z in the last inequality. By the sparsity assumptions, we have s1 λ21 = o(λ2 ). Thus it’s sufficient to
show ZTAc ∆Zm /n ∞ = o(λ2 ) with probability 1 − o p−δ2 +1 . In fact,
T
ZAc ∆Zm /n ≤ ZTAc X [(γm − ˆ γm ) + (ˆ γAc − γAc ) βm ] /n ∞ ∞
≤ ZTAc X/n ∞ k[(γm − ˆ γm ) + (ˆ γAc − γAc ) βm ]k1
≤ ZTAc X/n ∞ max kˆ γi − γi k1 (1 + kβm k1 ) i
≤ ZTAc X/n ∞ Cs1 λ1 = o ZTAc X/n ∞ , 16
(A.72)
where the last inequality follows from Equations (34) and (36). i.i.d.
Since each ZTk ∼ N (0, σkk ) and is independent of X, it can be seen that each entry of ZTAc X is a sum of n i.i.d. sub-exponential variables with finite parameter under either Condition 3 or 30 . A union bound with pq entries and an application of Bernstein inequality in (A.58) imply that q
T
Z c X/n ≤ C7 log(pq) with probability 1 − o (pq)−δ1 for some large constant C7 > 0. A n ∞
This result, together with Equation (A.72), implies that ZTAc ∆Zm /n ∞ = o(λ2 ) with probability
1 − o p−δ2 +1 . Now we finish the proof of WT ∆Zm /n ∞ = o(λ2 ) by Equation (A.71) and hence the proof of the Equation (A.70) with probability 1 − o p−δ2 +1 . This completes our proof of Proposition 4.
E.
SIMULATION STUDIES: SUPPORT RECOVERY
In this section, we evaluate the performance of the proposed ANTAC method and competing methods on support recovery for three models with corresponding {p, q, n} listed in Table 4. Since every model has identical values along the diagonal, we call them “Homogeneous Model”. We generate the p × q coefficient matrix Γ in the same way as Section 4.1, i.i.d.
Γij ∼ N (0, 1) · Bernoulli(0.025). The off-diagonal entries of the p × p precision matrix Ω are generated as follows, i.i.d.
ωij ∼ N (0, 1) · Bernoulli(π), where the probability of being nonzero π = P (ωij 6= 0) is shown in Table 4 for three models respectively. We generate 50 replicates of X and Y for each of the three models. We compare our method with graphical Lasso (GLASSO) (Friedman, Hastie & Tibshirani 2008), CAPME (Cai, Li, Liu & Xie 2013) and cGLASSO in the same way as described in Section 4.2. We then evaluate the performance of the estimators for support recovery problem in terms of the misspecification rate, specificity, sensitivity, precision and Matthews correlation coefficient,
17
Table 4: Model parameters used in the simulation of support recovery. (p, q, n)
P (Γij 6= 0)
π = P (ωij 6= 0), i 6= j
Model 1
(200, 200, 200)
0.025
0.025
Model 2
(200, 100, 300)
0.025
0.025
Model 3
(800, 200, 200)
0.025
0.010
which are defined as, ˆ Ω) = MISR(Ω, PRE =
FN + FP
p(p − 1) TP TP + FP
, SPE =
, MCC =
TN TP , SEN = , TN + FP TP + FN TP × TN − FP × FN
[(TP + FP) (TP + FN) (TN + FP) (TN + FN)]1/2
.
Here, TP, TN, FP, FN are the numbers of true positives, true negatives, false positives and false negatives respectively. True positives are defined as the correctly identified nonzero entries of ˆ are selected as the off-diagonal entries of Ω. For GLASSO and CAPME, nonzero entries of Ω edges with no extra thresholding applied. For ANTAC, edges are selected by the theoretical bound with ξ0 = 2 . The results are summarized in Table 5. It can be seen that ANTAC achieves superior specificity and precision under all model settings. Besides, ANTAC has the best overall performance in terms of the Matthews correlation coefficient. We further construct ROC curves to check how this result would vary by changing the tuning parameters. For GLASSO or cGLASSO, the ROC curve is obtained by varying the tuning parameter. For CAPME, λ1 is fixed as the value selected by the cross validation and the ROC curve is obtained by varying λ2 . For proposed ANTAC method, the ROC curve is obtained by varying the thresholding level ξ0 . When p is small, CAPME, cGLASSO and ANTAC have comparable performance. As p grows, both ANTAC and cGLASSO outperform CAPME. The purpose of simulating “Homogeneous Model” is to compare the performance of ANTAC and other procedures under models with similar settings used in Cai et al. (2013). Overall the performance from all procedures is not satisfactory due to the difficulty of support recovery problem. All nonzero entries are sampled from a standard normal. Hence, most signals are very weak 18
Sensitivity
(200, 200, 200)
(200, 100, 300)
(800, 200, 200)
1.00
1.00
1.00
0.75
0.75
0.75
Method ANTAC 0.50
0.50
CAPME
0.50
GLASSO cGLASSO 0.25
0.25
0.00
0.25
0.00 0.00
0.25
0.50
0.75
1.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
1−Specificity
Figure 5: The ROC curves for different methods. For GLASSO or cGLASSO, the ROC curve is obtained by varying its tuning parameter. For CAPME, λ1 is fixed as the value selected by the cross validation and the ROC curve is obtained by varying λ2 . For ANTAC, the ROC curve is obtained by varying the cut-off on P-values. and hard to be recovered by any method, although ANTAC performs among the best in all three models.
F.
SIMULATION STUDIES: MORE VISUALIZATION PLOTS
In this section, we include the Q-Q plot for the estimators in Section 4.1 corresponding to Figure 1. Moreover, the visualization of the hub-network structure generated in Section 4.2 is also given.
19
vrω==00
vrω==0.3 0.3
vrω==0.6 0.6
vrω==11
1.5
p = 200
1.0 0.5 0.0 -0.5
p = 400
0 1.5
p = 1000
Observed
1
1.0 0.5 0.0 -0.5 1.5
q = 600
1.0 0.5 0.0 -0.5 -0.4
0.0
0.4
0.0
0.5
0.0
0.5
Theoretical
1.0
1.50.4
Figure 6: Q-Q plot for estimators in Section 4.1
20
0.8
1.2
1.6
●
●
● ● ● ● ● ●
●
●● ● ●
●
●
● ● ●
● ● ●
●
●
●
● ●
●
● ●
●
● ●
●
●
●
● ●
●
●
● ●
● ●
● ●
●
● ●
●
●
● ●
●
● ●
● ● ●
●
●
●
● ●
● ● ●
● ●
● ●
●
● ●
● ●
● ● ●
● ●
●
● ●
●
●
● ●
● ●
● ●
●
●
● ●
●
● ● ●
●
● ●
●● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
●
● ●
● ●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
● ● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●●
●
●
● ●
●●
● ● ●
●
● ● ● ●
●
●
●
● ●
●
● ●
●
● ●
●
● ● ● ● ● ● ● ● ●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
Figure 7: Visualization of hub-network structure generated in Section 4.2
21
Table 5: Simulation results of the support recovery for homogeneous models based on 50 replications. Specifically, the performance is measured by misspecification rate, specificity, sensitivity (recall rate), precision and the Matthews correlation coefficient with all the values multiplied by 100. Numbers in parentheses are the simulation standard deviations.
Model 1
Model 2
Model 3
(p, q, n)
Method
MISR
SPE
SEN
PRE
MCC
(200, 200, 200)
GLASSO
35(1)
65(1)
37(2)
2(0)
1(1)
cGLASSO
25(6)
76(6)
64(7)
6(1)
8(1)
CAPME
2(0)
100(0)
4(1)
96(1)
21(1)
ANTAC
2(0)
100(0)
6(1)
100(1)
23(2)
(200, 100, 300)
(800, 200, 200)
GLASSO
43(0)
57(0)
51(2)
3(0)
3(1)
cGLASSO
5(0)
97(0)
47(1)
25(1)
32(1)
CAPME
4(0)
97(0)
56(1)
29(1)
39(1)
ANTAC
2(0)
100(0)
19(1)
100(0)
44(1)
GLASSO
19(1)
81(1)
19(1)
1(0)
0(0)
cGLASSO
1(0)
100(0)
0(0)
100(0)
2(0)
CAPME
1(0)
100(0)
0(0)
0(0)
0(0)
ANTAC
1(0)
100(0)
7(0)
71(2)
22(1)
REFERENCES Cai, T. T., Li, H., Liu, W., & Xie, J. (2013), “Covariate-adjusted precision matrix estimation with an application in genetical genomics,” Biometrika, 100(1), 139–156. Friedman, J., Hastie, T., & Tibshirani, R. (2008), “Sparse inverse covariance estimation with the graphical Lasso,” Biostatistics, 9(3), 432–441. Loh, P.-L., & Wainwright, M. J. (2012), “High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity,” The Annals of Statistics, 40(3), 1637–1664. Massart, P., & Picard, J. (2007), Concentration inequalities and model selection, Vol. 1896 Springer. Ren, Z., Sun, T., Zhang, C.-H., & Zhou, H. H. (2013), “Asymptotic normality and optimalities in estimation of large Gaussian graphical model,” arXiv preprint arXiv:1309.6024, . 22
Rudelson, M., & Zhou, S. (2013), “Reconstruction from anisotropic random measurements,” Information Theory, IEEE Transactions on, 59(6), 3434–3447. Sun, T., & Zhang, C.-H. (2012), “Scaled sparse linear regression,” Biometrika, 99, 879–898. Sun, T., & Zhang, C.-H. (2013), “Sparse matrix inversion with scaled Lasso,” The Journal of Machine Learning Research, 14(1), 3385–3418. Vershynin, R. (2010), “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, .
23