Structured Sparse Regression via Greedy Hard-Thresholding Prateek Jain Microsoft Research, India
Nikhil Rao Technicolor R & I, Los Altos, CA
arXiv:1602.06042v1 [stat.ML] 19 Feb 2016
Inderjit Dhillon University of Texas at Austin February 22, 2016 Abstract Several learning applications require solving high-dimensional regression problems where the relevant features belong to a small number of (overlapping) groups. For very large datasets, hard thresholding methods have proven to be extremely efficient under standard sparsity assumptions, but such methods require NP hard projections when dealing with overlapping groups. In this paper, we propose a simple and efficient method that avoids NP-hard projections by using greedy approaches. Our proposed methods come with strong theoretical guarantees even in the presence of poorly conditioned data, exhibit an interesting computation-accuracy trade-off and can be extended to significantly harder problems such as sparse overlapping groups. Experiments on both real and synthetic data validate our claims and demonstrate that the proposed methods are significantly faster than the best known greedy and convex relaxation techniques for learning with structured sparsity.
1
Introduction
High dimensional problems where the regressor belongs to a small number of groups plays a fundamental role in many machine learning and signal processing applications, ranging from computational biology [15], image processing [29], wideband spectrum sensing [23], and multitask learning [20]. In most of these cases, the groups overlap. These correspond to overlapping gene pathways, parent-child pairs of wavelet transform coefficients, or overlapping frequency bands. Such problems are typically solved using convex algorithms, for example [24, 15] solve a regularized convex program using proximal point methods, or [36, 28] that solve a constrained version using greedy selection schemes. There are two main drawbacks for using convex methods for overlapping group sparse recovery programs. First, they require solving a hard non-smooth optimization problem that typically does not scale to large problems. Secondly, because of the complicated and delicate sub-gradient structure for such problems, they are hard to extend to other interesting structured sparsity settings; for example, sparse overlapping groups where the active set of overlapping groups themselves have only a few non-zeros. Such structures have found applications in computational biology [33], fMRI [27] and climatology applications [9]. Indeed, this can be generalized further to include a hierarchy of overlapping groups. Hierarchical structure within non-overlapping groups has found use in topic modeling [18] for example. Finally, theoretical guarantees for existing convex (as well as non-convex) approaches require strict conditions on the data. Most existing results hold under restrictive assumptions such as the Restricted Isometry Property (RIP) [2, 14, 30, 22], or variants thereof ([31, 21]). While some applications allow for the design of measurement operators that satisfy such conditions, several other scenarios do not allow for this liberty as a few of the features can be highly correlated. Indeed, in most applications of interest, such conditions do not hold. In fact, in such settings, the gains obtained from using group sparsity structure are typically just logarithmic factors of the dimension and the number of groups (see for example Section 3.1 in [30] ), which makes the results less appealing. 1
From a scalability point of view, Iterative Hard Thresholding (IHT) techniques have lead to state-of-theart methods for the standard sparsity or non-overlapping group sparsity problems [39, 11, 17]. IHT methods directly solve a non-convex problem using a projected gradient descent style technique. A key aspect of the computational simplicity of IHT methods is the ease of performing projections onto non-convex sets involved in such problems. For example when the constraint is the `0 pseudo-norm, projections can be performed by retaining the top entries of a vector. Hence, a natural question to ask is if similar gains can be obtained by applying IHT-style methods to the overlapping group problems. In particular, we study the following two questions: a) can IHT style methods be designed to efficiently solve the overlapping group sparsity problem and its extensions, and b) can we obtain solid theoretical justification for such methods, especially in more realistic settings compared to the RIP settings considered in the existing literature.
1.1
Our Contributions
IHT for Overlapping Group Sparsity: In this work, we introduce a novel IHT algorithm for the overlapping group sparsity problem. The key challenge in designing an IHT algorithm for this problem is that performing projections onto the set of group sparse vectors is in general NP-hard [3]. To alleviate this concern, we appeal to submodular function optimization and obtain a greedy method based hard thresholding operator for the projection problem. Moreover, we show that by increasing computational and sample complexity, we can improve accuracy of the obtained solution significantly. Statistical Guarantees: we show that our proposed approach can obtain nearly optimal solution as long as the function to be optimized satisfies the restricted strong convexity and smoothness properties (see Definition 2.1). Moreover, for the high-d regression problem, if the data is stochastic then we show that our approach provides a consistent estimator of the optimal regressor as long as n ≥ (κ2 k ∗ log(M ) + s) log(κ/), where is the desired accuracy, k ∗ is the group sparsity of the optimal regressor, M is the number of groups, κ is the condition number of the data matrix, and s is the maximum support size of a union of k ∗ groups. Interestingly, using just the standard sparsity structure, the sample complexity (n ≥ κ2 s log(p)) is significantly larger. Note that n depends logarithmically on the desired accuracy . Generalization to Other Cases: Due to simplicity of our hard thresholding operator, we can easily extend it to several complicated sparsity structures. In particular, we show that the methods we propose can be generalized to the sparse overlapping group settings, as well as generalizations to hierarchies of overlapping groups. Experimental Evidence: We provide extensive experiments on both real and synthetic datasets that show that our methods are not only faster than several other approaches, but are also accurate despite performing approximate projections. Indeed, even for poorly-conditioned data, IHT methods are an order of magnitude faster than other greedy and convex methods. We also observe a similar phenomenon when dealing with sparse overlapping groups.
1.2
Related Work
IHT schemes for the sparse regression setting have been extensively studied in literature. A series of papers ( [8, 7, 5] and references therein) studied convergence properties of these methods under standard RIP conditions. [17] generalized the method to settings where RIP does not hold, and also to the low rank matrix recovery setting. [39] used a similar analysis to obtain results for nonlinear models. However, most of these techniques apply only to non-overlapping groups. Forward greedy selection schemes for sparse [37, 16] and group sparse [35] constrained programs have been considered previously, where a single group is added at each iteration. The authors in [2] propose a
2
variant of CoSaMP [25] to solve problems that are of interest to us, but a key distinction from our work is that for the considered problems, these methods could perform exact projections at each iteration. Several works have studied approximate projections in the context of IHT [4, 32, 12, 13, 19]. However, these results require that the data satisfies RIP-style conditions which typically do not hold in real-world regression problems. Moreover, these analyses do not guarantee consistent estimate of the optimal regressor when the measurements have zero-mean random noise. In contrast, we provide results under general RSC/RSS condition and provide crisp rates for the error bounds when the noise in measurements is random.
2
Group Iterative Hard Thresholding for Overlapping Groups
In this section we present our algorithm for solving the overlapping group sparsity problem. We first introduce some notation. Suppose we are given a set of M groups that can arbitrarily overlap G = {G1 , . . . , GM }, where Gi ⊆ [p]. Also, let ∪M i=1 Gi = {1, 2, . . . , p}. We let kwk denote the Euclidean norm of w, and supp(w) denotes the support of w. For any vector w ∈ Rp , [15] defined the overlapping group lasso norm 1 as kwkG := inf
M X
kaGi k
M X
s.t.
i=1
aGi = w, supp(aGi ) ⊆ Gi
(1)
i=1
We also introduce the notion of “group-support” of a vector and its group-`0 pseudo-norm: G-supp(w) := {i s.t. kaGi k > 0}, kwkG0 := inf
M X
1{kaGi k > 0},
(2)
i=1
where aGi satisfies the constraints of (1). 1{·} is the indicator function, taking the value 1 if the condition is satisfied, and 0 otherwise. For a set of groups G, supp(G) = {Gi , i ∈ G}. Similarly, G-supp(S) = G-supp(wS ). Problem Definition: Suppose we are given a function f : Rp → R and M groups G = {G1 , . . . , GM }. The goal is to solve the following problem: min f (w) s.t. kwkG0 ≤ k w
(3)
Unfortunately, (3) is NP-hard in general. Instead, we focus on problems where f satisfies the restricted strong convexity and smoothness conditions: Definition 2.1 (RSC/RSS). Function f : Rp → R satisfies the restricted strong convexity (RSC) and restricted strong smoothness, if the following holds: αk I H(w) Lk I, where H(w) is the Hessian of f at any w ∈ Rp s.t. kwkG0 ≤ k. The following instantiation of the above problem is of particular interest: suppose we are given n noisy measurements of the form yi = xTi w? + βi , where βi ∼ N (0, λ2 ) and w∗ is the “true” parameter vector. Let kw? kG0 = k ? . Letting X ∈ Rn×p be the matrix with xi as it’s rows, we have: y = Xw? + β. We are interested in recovering w? consistently (i.e. recover w? exactly as n → ∞), and to this end, we look to solve the following (non convex) constrained least squares problem ˆ = arg min f (w) := w w
1 it
1 ky − Xwk2 s.t. kwkG0 ≤ k 2n
can be shown that it is in fact a norm.
3
(4)
with k ≥ k ∗ being a positive, user defined integer. In this paper, we propose to solve (4) using an Iterative Hard Thresholding (IHT) scheme. IHT methods iteratively take a gradient descent step, and then project the vector on to the (non-convex) constraint set, in this case the set of group-sparse vectors, i.e., kwkG0 ≤ k. Computing the gradient is easy and hence the complexity of the overall algorithm heavily depends on the complexity of performing the aforementioned projection on the set of group-sparse vectors. Algorithm 1 details the IHT procedure for the group lasso. Throughout the paper we consider the same high-level procedure, but consider different projection operators PbkG (g) for different settings of the problem. Algorithm 1 IHT for the Overlapping Group Lasso 1: Input : data y, X, parameter k, iterations T , step size η 2: Initialize: t = 0, w 0 ∈ Rp a k group sparse vector 3: for t = 1, 2, . . . , T do 4: gt = wt − η∇f (wt ) 5: wt = PbkG (gt ) where PbkG (gt ) performs (approximate) projections 6: end for 7: Output : wT Note that in the special case when G contains only singleton groups, kwkG0 = kwk0 , and step 1 can be very efficiently performed by retaining the largest k entries of g t by magnitude [5]. When G contains only non overlapping groups, this again can be done efficiently: compute k(gt )Gi k ∀i, and retain the entries of gt that correspond to the k largest normed groups. Suppose we are given a vector g ∈ Rp , which needs to be projected onto the constraint set kukG0 ≤ k. The projection operator PkG (·) is defined as: u∗ = PkG (g) = arg min ku − gk2 s.t kukG0 ≤ k u
(5)
Solving (5) is hard when G contains overlapping groups. Indeed, [3] show that the problem is in general NP-hard, as this can be shown to be equivalent to solving the weighted maximum set cover problem. To overcome this, we replace PkG (·) by an approximate operator PbkG (·) (step 5 of Algorithm 1). In Section 2.1, we present an efficiently computable operator PbkG (·) that achieves “good” approximation error, regardless of the properties of G.
2.1
Submodular Optimization for General G
We now introduce a simple greedy method to approximately solve the projection problem (5). Our method greedily and sequentially selects k groups, and outputs a vector PbkG (g) that acts as a proxy for PkG (g). We detail this procedure in Algorithm 2. Algorithm 2 Greedy Projections Require: g ∈ Rp , parameter k, groups G ˆ = 0 , v = g, Gb = {0} 1: u 2: for t = 1, 2, . . . k do 3: Find G? = arg maxG∈G\Gb kvG k S 4: Gb = Gb G? 5: v = v − vG? 6: u = u + vG ? 7: end for ˆ := PbkG (g), Gb = supp(u) 8: Output u We now show that Algorithm 2 helps obtain sufficiently good projections. First, from [3], (5) can be cast
4
in the following equivalent way: Gˆ = arg max
( X
˜ |G|≤k
) gi2
: I = ∪G∈G˜G
(6)
i∈I
ˆ u ˆ can be recovered by simply setting u ˆ I = gI and 0 everywhere else, where Note that once we have G, I = ∪G∈GˆG. We first define a few terms that will be useful in the sequel. Definition 2.2. Let Q be a finite set, and let z(·) be a real valued function defined on ΩQ , the power set of Q. The function z(·) is said to be submodular if z(S) + z(T ) ≥ z(S ∪ T ) + z(S ∩ T ) ∀S, T ⊂ ΩQ Lemma 2.3. Suppose we are given a set of groups G. Given a subset of groups G˜ ⊂ G, let S :=P ∪G∈G˜G be the union of groups. Then, for any vector x ∈ Rp , the following function is submodular: z(S) = i∈S x2i . This is proved in Appendix A. From Lemma 2.3, we see that (6) can be cast as a submodular optimization program of the form max z(S), s.t. |S| ≤ k. (7) S⊂Q
Furthermore, Algorithm 2 exactly corresponds to the greedy algorithm for submodular optimization [1], and this gives us a means to assess the quality of our projections: ˆ Gb as the output of Algorithm 2 with Lemma 2.4. Given an arbitrary vector g ∈ Rp , suppose we obtain u, G ˜ input g and target group sparsity k. Let u∗ = Pk (g) be as defined in (5). Then ˜ k
ˆ − gk2 ≤ e− k k(g)supp(u∗ ) k2 + ku∗ − gk2 ku where e is the base of the natural logarithm. A proof is provided in Appendix B. Note that the term with the exponent in Lemma 2.4 approaches 0 as k˜ increases. Increasing k˜ should imply more samples for recovery of w∗ . Hence, this lemma hints at possibility of trading off sample complexity for better accuracy, despite approximation in the projection operator. See Section 3 for more details. Algorithm 2 can be applied to any G, and is extremely efficient.
2.2
Exact Projections for Certain G
ˆ = u∗ . This follows from the fact that for non overlapping When G is a set of non overlapping groups, u P groups, the function z(S) = i∈S x2i is actually modular, and the greedy method finds the true optimal solution. Furthermore, when the set of groups G satisfies certain properties, [3] provide a dynamic programming algorithm to compute PkG (·) defined in (5). For the solution to be accurate, G has to be loopless pairwise overlapping, which requires each element in {1, 2, . . . , p} to be in at most two groups, among other restrictions. While this allows for efficient projections to be performed in polynomial time, this approach does not generalize to cases we are interested in. For example, when groups correspond to parent-child coefficients in a 1-D wavelet transform, most variables occur in three groups, and five in the case of the 2-D transform for images, and genes are often in multiple pathways.
2.3
Incorporating Full Corrections
IHT methods can be significantly improved by the incorporation of “corrections” after each projection step. This merely entails adding the following step in Algorithm 1 after step 1 ˜ s.t. supp(w) ˜ = supp(PbkG (g t )) wt = arg min f (w) ˜ w
5
When f (·) is the least squares loss as we consider, this step can be solved efficiently using Cholesky decompositions via the backslash operator in MATLAB. We will refer to this procedure as IHT-FC. Fully corrective methods in greedy algorithms typically yield significant improvements, both theoretically and in practice [10, 17].
3
Theoretical Performance Bounds
We now provide performance guarantees for Algorithm 1 when used to minimize a general f over the set of k-sparse combinations of M groups G = {G1 , G2 , . . . , GM }. We then instantiate our result to provide error bounds in estimating the k-group sparse vector w∗ using (y, X) where y = Xw∗ + β, and each row of X is sampled i.i.d as xi ∼ N (0, Σ). Let σmax (σmin ) be the maximum (minimum) singular value of Σ, and let κ := σmax /σmin be the condition number. Theorem 3.1. Let w∗ = arg minw,kwG k0 ≤k∗ f (w). Let f satisfy RSC/RSS with constants α2k+k∗ , L2k+k∗ , respectively (see Definition 2.1). Suppose we run Algorithm 1, with η = 1/L2k+k∗ and projections computed according to Algorithm 2. Then, the following holds for the (t + 1)th iterate: αk 0 αk0 ∗ · kwt − w∗ k2 + γ + , kwt+1 − w k2 ≤ 1 − 10 · Lk0 Lk0 ∗
Lk0 kw k2 ∗ k0 2 )), γ = L2 0 maxS, s.t., | G-supp(S)|≤k k(∇f (w∗ ))S k2 . where k 0 = 2k + k ∗ , k = O(( L αk0 ) · k log( αk0 · k kw∗ k2 k0 Specifically, the output of T = O L · -th iteration of Algorithm 1 satisfies: α 0 k
kwt+1 − w∗ k2 ≤ 2 +
10 · Lk0 · γ. αk0
We prove this result in Appendix D. Remarks The above theorem shows that Algorithm 1 approximates w∗ up to an error of γ. In the sequel, we formally bound γ for the least squares problem. Algorithm 2 avoids NP-hard projections by optimizing f over a larger set k. However, k is only logarithmically dependent on the accuracy parameter . Our proof implicitly uses the fact that Algorithm 2 performs approximate group-sparse projection (Lemma 2.4). However, we would like to stress that our results do not hold for arbitrary approximate projection operators. Our proof critically uses the greedy scheme of the algorithm. Moreover, as discussed in Section 4, the proof easily extends to other structured sparsity sets that allow such greedy selection steps. We obtain similar result as [17] for the standard sparsity case, i.e., when the groups are singletons. However, our proof is significantly simpler and allows for significantly easier setting of η.
3.1
Linear Regression Guarantees
Theorem 3.2. Let y = Xw∗ + β where β ∼ N (0, σ 2 I) and each row of X is given by xi ∼ N (0, Σ). Let w∗ be k ∗ -group sparse and f (w) := kXw − yk22 . Let the number of samples satisfy: κ n ≥ Ω (s + κ2 · k ∗ · log M ) · log , where s = maxw,kwkG ≤k | supp(w)|, κ = σmax /σmin . Then, applying the IHT algorithm (Algorithm 1) 0 with the group sparsity parameters k = 8κ2 k ∗ · log κ , η = 1/(4σmax ), guarantees the following after ∗ T = Ω(κ log κ·kw k2 ) iterations (w.p. ≥ 1 − 1/n8 ): r s log n ∗ kwT − w k ≤ σ + 2, n 6
where wT is the output of the T -th iteration of Algorithm 1. Remarks Note that one can ignore the group sparsity constraint, and instead look to recover the s− sparse vector w∗ using IHT methods for `0 optimization [17]. However, the corresponding sample complexity is n ≥ κ2 s∗ log(p). Hence, for an ill conditioned Σ, using group sparsity based methods provide a significantly stronger result, especially when the groups overlap significantly. To the best of our knowledge, this is the first result for overlapping group sparsity constrained methods when the data can be arbitrarily conditioned. Note that the number of samples required increases logarithmically with the accuracy . Theorem 3.2 thus displays an interesting phenomenon: by obtaining more samples, one can provide a smaller recovery error while incurring a larger approximation error (since we chose more groups). Our proof critically requires that when restricted to group sparse vectors, the least squares objective function f (w) = 12 ky − Xwk22 is strongly convex as well as strongly smooth: Lemma 3.3 (RSC/RSS Condition). Let X ∈ Rn×p be such that each xi ∼ N (0, Σ). Let w ∈ Rp be k-group sparse over groups G = {G1 , . . . GM }, i.e., kwkG0 ≤ k and s = maxw,kwkG ≤k | supp(w)|. Let the number of 0 samples n ≥ Ω(C (k log M + s)). Then, the following holds with probability ≥ 1 − 1/n10 :
4 1− √ C
σmin kwk22 ≤
1 kXwk22 ≤ n
4 1+ √ σmax kwk22 , C
We prove Lemma 3.3 in Appendix C. Theorem 3.2 then follows by combining Lemma 3.3 with Theorem 3.1.
3.2
IHT with Exact Projections PkG (·)
We now consider the setting where PkG (·) can be computed exactly and efficiently for any k. One such example is the dynamic programming based method by [3]. Note that unlike the greedy method used for approximate projection (Algorithm 2), the exact projection operator can be arbitrary. Hence, our proof of Theorem 3.1 does not apply directly in this case. However, we show that by exploiting the structure of hard thresholding, we can still obtain a similar result: Theorem 3.4. Let w∗ = arg minw,kwG k0 ≤k∗ f (w). Let f satisfy RSC/RSS with constants α2k+k∗ , L2k+k∗ , ∗ k0 kw k2 respectively (see Definition 2.1). Then, the following holds for the T = O( L )-th iterate of Algorithm 1 αk0 · G G b (with η = 1/L2k+k∗ ) with P (·) = P (·) is the exact projection: k
k
kwt+1 − w∗ k2 ≤ + ∗ k0 2 where k 0 = 2k + k ∗ , k = O(( L α 0 ) · k ), γ = k
2 Lk 0
maxS,
10 · Lk0 · γ. αk0
s.t., | G-supp(S)|≤k
k(∇f (w∗ ))S k2 .
See Appendix E for a detailed proof. Note that unlike greedy projection method (see Theorem 3.1), k is independent of accuracy parameter .
4
Extension to Sparse Overlapping Groups
In this section, we extend our proposed method in Section 2 to signal recovery in Sparse Overlapping Groups (SoG) as well as several other general sparsity structures.
7
The SoG generalizes the group lasso, in this case allowing the selected groups themselves to be sparse. Given positive integers k1 , k2 and a set of groups G, IHT for SoG would perform projections onto the following set: ( ) M X sog G C0 := w = aGi : kwk0 ≤ k1 , kaG1 k0 ≤ k2 (8) i=1
As in the case of overlapping group lasso, projection onto the above given SoG set is NP-hard in general. Motivated by our greedy approach in the overlapping group lasso case, we propose a similar method for SoG as well (see Algorithm 3). The algorithm essentially greedily selects the groups that have large top-k2 elements by magnitude. Below, we show that the IHT algorithm (Algorithm 1) combined with the greedy projection (Algorithm 3) indeed converges to the optimal solution. Moreover, our experiments (Section 5) reveal that this method, when combined with full corrections, yields highly accurate results that are significantly better than the state-of-the-art methods. Algorithm 3 Greedy Projections for SoG Require: g ∈ Rp , parameters k1 , k2 , groups G ˆ = 0 , v = g, Gb = {0}, Sˆ = {0} 1: u 2: for t=1,2,. . . k1 do 3: Find G? = arg maxG∈G\Gb kvG k S 4: Gb = Gb G? 5: Let S correspond to the indices of the top k2 entries of vG? by magnitude 6: Define S v¯ ∈ Rp , v¯S = (vG? )S v¯i = 0 i ∈ /S ˆ ˆ 7: S=S S 8: v = v − v¯ 9: u = u + v¯ 10: end for b Sˆ ˆ G, 11: Output u, First, we provide a general result that extracts out the key property of projection operator that is required by our proof. We then show that Algorithm 3 satisfies that property. In particular, we assume that there is a set of supports Sk∗ such that supp(w∗ ) ∈ Sk∗ . Also, let Sk ⊆ {0, 1}p be s.t. Sk∗ ⊆ Sk . Moreover, for any given z ∈ Rp , there exists an efficient procedure Ak to find S ∈ Sk s.t. the following holds for all S∗ ∈ Sk∗ : kzS\S∗ k22 ≤
k∗ · β kzS∗ \S k22 + , k
(9)
where > 0 and β is a function of . Then, we obtain the following general result for such structures: Theorem 4.1. Let w∗ = arg minw,supp(w)∈Sk∗ f (w), where Sk∗ ⊆ {0, 1}p is a fixed set parameterized by k ∗ . Let f satisfy RSC/RSS with constants αk , Lk , respectively (see Definition 2.1). Also, let Sk ⊆ {0, 1}p be s.t. Sk∗ ⊆ Sk . Moreover, there exists a projection procedure Ak that satisfies (9). Then, the following holds for the T = O
Lk 0 αk0
·
kw∗ k2
-th iterate of Algorithm 1 when Ak is used as the projection operator: kwt+1 − w∗ k2 ≤ 2 +
L
∗
2k+k where k = O(( α2k+k )2 · k ∗ · β α2k+k∗ , γ = ∗ L2k+k∗
2 L2k+k∗
maxS,
8
10 · L2k+k∗ · γ, α2k+k∗
S∈Sk
k(∇f (w∗ ))S k2 .
8 IHT IHT+FC CoGEnT FW GOMP
4.5 4 3.5 3 0
50
100 150 200 Time (seconds)
250
300
measurements
log(objective)
log(objective)
5
IHT IHT+FC CoGEnT FW GOMP
7
6
5
2000
1800
1800
1600
4
1400
3
1200 0
50
100
150
200
250
(b) Poorly Conditioned
1600
1200
1200
50
300
100
150
200
250
condition number
Time (seconds)
(a) Well Conditioned
2000
measurements
6 5.5
300
50
100 150 200 condition number
(c) Phase Transition, (d) Phase Poorly Conditioned (IHT) Poorly (GOMP)
250
300
Transition, Conditioned
Figure 1: Performance comparisons on synthetic data. IHT and IHT-FC are our proposed methods. (Best seen in color) We prove this result in Appendix F. Consequently, a result similar to Theorem 3.2 can be obtained for the case when f is least squares loss function. We now show that (9) holds for the SoG case. For simplicity, we provide the result for non-overlapping case; for overlapping groups a similar result can be obtained by combining the following lemma, proved in Appendix G, with Lemma D.2. Lemma 4.2. Let G = {G1 , . . . , GM } be M non-overlapping groups. Let G-supp(w∗ ) = {i∗1 , . . . , i∗k∗ }. Let G be the groups selected using Algorithm 3 applied to z and let Si be the selected set of co-ordinates from group Gi where i ∈ G. Let S = ∪i Si , and let S∗ = ∪i (S∗ )i = supp(w∗ ). Also, let G∗ be the set of groups that contains S∗ . Then, the following holds: kzS\S ∗ k22 ≤ max(
k1∗ k2∗ , ) · kzS ∗ \S k22 . k1 k2
The above lemma along with Theorem 4.1 shows that for SoG case, we need to project onto larger (than k1∗ ) number of groups and onto larger (than k2∗ ) number of elements in each group. In particular, we select L2k+k∗ 2 ∗ ) ki for both i = 1, 2. ki ≈ ( α2k+k ∗ Combining the above lemma with Theorem 4.1 and a similar lemma to Lemma 3.3 also provides us with sample complexity bound for estimating w∗ from (y, X) s.t. y = Xw∗ + β. Specifically, the sample complexity evaluates to n ≥ κ2 k1∗ log(M ) + κ2 k1∗ k2∗ log(maxi |Gi |) . An interesting extension to the SoG case is that of a hierarchy of overlapping, sparsely activated groups. When the groups at each level do not overlap, this reduces to the case considered in [18]. However, our theory shows that when a corresponding approximate projection operator is defined for the hierarchical overlapping case (extending Algorithm 3), IHT methods can be used to obtain the solution in an efficient manner.
5
Experiments and Results
In this section, we empirically compare and contrast our proposed group IHT methods against the existing approaches to solve the overlapping group sparsity problem. Greedy methods for group lasso problem have been shown to outperform proximal point schemes, and hence we restrict our comparison to greedy procedures. We compared four methods: our algorithm with (IHT-FC) and without (IHT) the fully corrective step, the Frank Wolfe (FW) method [36] , CoGEnT, [28] and the Group OMP (GOMP) [35]. All relevant hyper-parameters were chosen via a grid search, and experiments were run on a macbook laptop with a 2.5 GHz processor and 16gb memory. Synthetic Data: well conditioned: We first compared various greedy schemes for solving the overlapping group sparsity problem on synthetic data. We generated M = 1000 groups of contiguous indices of size 25; the last 5 entries of one group overlap with the first 5 of the next. We randomly set 50 of these to be active, populated by uniform [−1, 1] entries. This yields w? ∈ Rp , p ∼ 22000. X ∈ Rn×p where i.i.d
n = 5000 and Xij ∼ N (0, 1). Each measurement is corrupted with Additive White Gaussian Noise 9
−2
1
1
0.8 0
−3
0.6
−1
0.4
−4
0 200
400
600
800
1000
1200
1400
1600
1800
2000
1
log(MSE)
0.2
−5
1500
2000
2500
3000
3500
500
1000
1500
2000
2500
3000
3500
500
1000
1500
2000
2500
3000
3500
0
0.6
−7
0.4
1000
−1
IHT IHT−FC COGEnT FW
−6
0.8
500 1
1 0
0.2 0 200
400
600
800
1000
1200
1400
1600
1800
2000
−8 0
−1
5
10 time (seconds)
15
20
Figure 2: (a) The blocks signal (top) and it’s recovery (bottom) using IHT, (b) SoG: error vs time comparison for various methods, (c) SoG: reconstruction of the true signal (top) from IHT-FC (middle) and CoGEnT (bottom) Signal IHT GOMP CoGEnT Blocks .00029 .0011 .00066 HeaviSine .0026 .0029 .0021 Piece-Polynomial .0016 .0017 .0022 Piece-Regular .0025 .0039 .0015 Table 1: MSE on standard test signals using IHT with full corrections (AWGN) with standard deviation σ = 0.1. Figure 1(a) shows that the hard thresholding methods achieve orders of magnitude speedup compared to the competing schemes, and achieve almost the same (final) objective function value despite approximate projections. Synthetic Data: poorly conditioned: Next, we consider the exact same setup, but with each row of X given by: xi ∼ N (0, Σ) where κ = σmax (Σ)/σmin (Σ) = 10. Figure 1(b) shows again the advantages of using IHT methods; IHT-FC is about 10 times faster than the next best CoGEnT method. We next generate phase transition plots for recovery by our method (IHT) as well as the state-of-the-art GOMP method. We generate vectors in the same vein as the above experiment, with M = 500, B = 15, k = 25, p ∼ 5000. We vary the the condition number of the data covariance (Σ) as well as the number of measurements (n). Figure 1(c) (c), (d) shows the phase transition plot as the measurements and the condition number are varied for IHT, and 1(d) the same for GOMP. The results are averaged over 10 independent runs. It can be seen that even for condition numbers as high as 200, n ∼ 1500 measurements suffices for IHT to exactly recovery w∗ , whereas GOMP with the same setting is not able to recover w∗ even once. Noisy Compressed Sensing: Here, we apply our proposed methods in a compressed sensing framework studied by [29] to recover sparse wavelet coefficients of signals. We used the standard “test” signals (Table 1) of length 2048, and obtained 512 Gaussian measurements. We set k = 100 for IHT and GOMP. IHT is competitive (in terms of accuracy) with the state of the art in convex methods, while being significantly faster. Figure 2(a) shows the recovered blocks signal using IHT. Tumor Classification: Breast Cancer Dataset: We next compare the aforementioned methods on a gene selection problem for breast cancer tumor classification. We use the data in [38, 34] 2 . We ran a 5-fold cross validation scheme to choose parameters, where we varied η ∈ {2−5 , 2−4 , . . . , 23 } k ∈ {2, 5, 10, 15, 20, 50, 100} τ ∈ {23 , 24 , . . . , 213 }. Table 2 shows that the vanilla hard thresholding method is competitive despite performing approximate projections, and the method with full corrections obtains the best performance among the methods considered. We randomly chose 15% of the data to test on. Sparse Overlapping Group Lasso: Finally, we study the sparse overlapping group (SoG) problem that was introduced and analyzed in [27]. We perform projections as detailed in Algorithm 3. We generated a toy vector with 100 groups of size 50 and randomly selected 5 groups to be active, and among the active group only set 30 coefficients to be non zero. The groups themselves were overlapping, with the last 10 entries of one group shared with the first 10 of the next, yielding p ∼ 4000. We chose the best parameters from a grid, and we set k = 2k ∗ for the IHT methods. Figure 2(b) shows that IHT-FC obtains significantly more 2 download
at http : //cbio.ensmp.f r/ ljacob/
10
Method FW IHT GOMP CoGEnT IHT-FC
Misclassification % 29.41 27.94 25.01 23.53 21.65
time (sec) 6.4538 0.0400 0.2891 0.1414 0.1601
Table 2: Tumor Classification: misclassification rate of various methods. IHT-FC is about as fast as CoGEnT but has about 2% higher accuracy. accurate solution than other methods; Figure 2(c) shows the true and recovered vectors.
6
Conclusions and Discussion
We proposed a greedy-IHT method that can applied to regression problems over set of group sparse vectors. Our proposed solution is efficient, scalable, and provide fast convergence guarantees under general RSC/RSS style conditions. We extended our analysis to handle even more challenging structures like sparse overlapping groups. Our experiments show that IHT methods achieve fast, accurate results even with greedy and approximate projections.
References [1] Francis Bach. Convex analysis and optimization with submodular functions: a tutorial. arXiv preprint arXiv:1010.4207, 2010. [2] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive sensing. Information Theory, IEEE Transactions on, 56(4):1982–2001, 2010. [3] Nirav Bhan, Luca Baldassarre, and Volkan Cevher. Tractability of interpretability via selection of group-sparse models. In Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, pages 1037–1041. IEEE, 2013. [4] Thomas Blumensath. Sampling and reconstructing signals from a union of linear subspaces. Information Theory, IEEE Transactions on, 57(7):4660–4671, 2011. [5] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009. [6] Thomas Blumensath and Mike E Davies. Sampling theorems for signals from the union of finitedimensional linear subspaces. Information Theory, IEEE Transactions on, 55(4):1872–1882, 2009. [7] Thomas Blumensath and Mike E Davies. Normalized iterative hard thresholding: Guaranteed stability and performance. Selected Topics in Signal Processing, IEEE Journal of, 4(2):298–309, 2010. [8] Thomas Blumensath, Mehrdad Yaghoobi, and Mike E Davies. Iterative hard thresholding and l0 regularisation. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 3, pages III–877. IEEE, 2007. [9] Soumyadeep Chatterjee, Arindam Banerjee, Snigdhanshu Chatterjee, and Auroop R Ganguly. Sparse group lasso for regression on land climate variables. In 2011 11th IEEE International Conference on Data Mining Workshops, pages 1–8. IEEE, 2011. [10] Joseph C Dunn. Global and asymptotic convergence rate estimates for a class of projected gradient processes. SIAM Journal on Control and Optimization, 19(3):368–400, 1981.
11
[11] Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM J. on Num. Anal., 49(6):2543–2563, 2011. [12] Raja Giryes and Deanna Needell. Greedy signal space methods for incoherence and beyond. Applied and Computational Harmonic Analysis, 39(1):1–20, 2015. [13] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Approximation algorithms for model-based compressive sensing. arXiv preprint arXiv:1406.1579, 2014. [14] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity. The Journal of Machine Learning Research, 12:3371–3412, 2011. [15] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, pages 433–440. ACM, 2009. [16] Prateek Jain, Ambuj Tewari, and Inderjit S Dhillon. Orthogonal matching pursuit with replacement. In Advances in Neural Information Processing Systems, pages 1215–1223, 2011. [17] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for highdimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014. [18] Rodolphe Jenatton, Julien Mairal, Francis R Bach, and Guillaume R Obozinski. Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 487–494, 2010. [19] Anastasios Kyrillidis and Volkan Cevher. Combinatorial selection and least absolute shrinkage via the clash algorithm. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages 2216–2220. IEEE, 2012. [20] Karim Lounici, Massimiliano Pontil, Alexandre B Tsybakov, and Sara Van De Geer. Taking advantage of sparsity in multi-task learning. arXiv preprint arXiv:0903.1468, 2009. [21] Karim Lounici, Massimiliano Pontil, Alexandre B Tsybakov, and Sara Van De Geer. Taking advantage of sparsity in multi-task learning. arXiv preprint arXiv:0903.1468, 2009. [22] Lukas Meier, Sara Van De Geer, and Peter B¨ uhlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53–71, 2008. [23] Moshe Mishali and Yonina C Eldar. Blind multiband signal reconstruction: Compressed sensing for analog signals. Signal Processing, IEEE Transactions on, 57(3):993–1009, 2009. [24] Sofia Mosci, Silvia Villa, Alessandro Verri, and Lorenzo Rosasco. A primal-dual algorithm for group sparse regularization with overlapping groups. In Advances in Neural Information Processing Systems, pages 2604–2612, 2010. [25] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009. [26] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functionsi. Mathematical Programming, 14(1):265–294, 1978. [27] Nikhil Rao, Christopher Cox, Rob Nowak, and Timothy T Rogers. Sparse overlapping sets lasso for multitask learning and its application to fmri analysis. In Advances in neural information processing systems, pages 2202–2210, 2013. [28] Nikhil Rao, Parikshit Shah, and Stephen Wright. Conditional gradient with enhancement and truncation for atomic-norm regularization. In NIPS Workshop on Greedy Algorithms, 2013. 12
[29] Nikhil S Rao, Robert D Nowak, Stephen J Wright, and Nick G Kingsbury. Convex approaches to model wavelet sparsity patterns. In Image Processing (ICIP), 2011 18th IEEE International Conference on, pages 1917–1920. IEEE, 2011. [30] Nikhil S Rao, Ben Recht, and Robert D Nowak. Universal measurement bounds for structured sparse signal recovery. In International Conference on Artificial Intelligence and Statistics, pages 942–950, 2012. [31] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241–2259, 2010. [32] Parikshit Shah and Venkat Chandrasekaran. Iterative projections for signal identification on manifolds: Global recovery guarantees. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on, pages 760–767. IEEE, 2011. [33] Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 22(2):231–245, 2013. [34] Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545–15550, 2005. [35] Grzegorz Swirszcz, Naoki Abe, and Aurelie C Lozano. Grouped orthogonal matching pursuit for variable selection and prediction. In Advances in Neural Information Processing Systems, pages 1150–1158, 2009. [36] Ambuj Tewari, Pradeep K Ravikumar, and Inderjit S Dhillon. Greedy algorithms for structurally constrained high dimensional problems. In Advances in Neural Information Processing Systems, pages 882–890, 2011. [37] Joel Tropp, Anna C Gilbert, et al. Signal recovery from random measurements via orthogonal matching pursuit. Information Theory, IEEE Transactions on, 53(12):4655–4666, 2007. [38] Marc J Van De Vijver, Yudong D He, Laura J van’t Veer, Hongyue Dai, Augustinus AM Hart, Dorien W Voskuil, George J Schreiber, Johannes L Peterse, Chris Roberts, Matthew J Marton, et al. A geneexpression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25):1999–2009, 2002. [39] Xiaotong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 127–135, 2014.
A
Proof of Lemma 2.3
Proof. Let S and T be two sets of groups, s.t., S ⊆ T . Let, SS = supp(∪j∈S Gj ) and T T = supp(∪j∈T Gj ). Then, SS ⊆ T T . Hence, z(S ∪ i) − z(S) =
X `∈SS∪supp(Gi )
x2` −
X `∈SS
X
x2` =
`∈supp(Gi )\SS
where ζ1 follows from SS ⊆ T T . Hence proved.
13
ζ1
x2` ≥
X `∈supp(Gi )\T T
x` = z(T ∪ i) − z(T ),
B
Proof of Lemma 2.4
Proof. First, from the approximation property of the greedy algorithm [26], k0 ˆ 2 ≥ 1 − e− k ku∗ k2 kuk
(10)
ˆ 2 = kgk2 − kuk ˆ 2 because (u) ˆ supp(u) Also, kg − uk ˆ = (g)supp(u) ˆ and 0 otherwise. Using the above two equations, we have: k0
ˆ 2 ≤ kgk2 − ku∗ k2 + e− k ku∗ k2 , kg − uk k0
= kg − u∗ k2 + e− k ku∗ k2 , k0
= kg − u∗ k2 + e− k k(g)supp(u∗ ) k2 ,
(11)
where both equalities above follow from the fact that due to optimality, (u∗ )supp(u∗ ) = (g)supp(u∗ ) .
C
Proof of Lemma 3.3
Proof. Note that, kXwk22 =
X X (xTi w)2 = (ziT Σ1/2 w)2 = kZΣ1/2 wk22 , i
i
n×p
where Z ∈ R s.t. each row zi ∼ N (0, I) is a standard multi-variate Gaussian. Now, using Theorem 1 of [6], and using the fact that Σ1/2 w lies in a union of M k subspaces each of at most s dimensions, we have (w.p. ≥ 1 − 1/(M k · 2s )): 1 4 4 kΣ1/2 wk22 ≤ kZΣ1/2 wk22 ≤ 1 + √ kΣ1/2 wk22 . 1− √ n C C The result follows by using the definition of σmin and σmax .
D
Proof of Theorem 3.1
Proof. Recall that gt = wt − η∇f (wt ), wt+1 = PbkG (gt ). Let supp(wt+1 ) = St+1 , supp(w∗ ) = S∗ , I = St+1 ∪S∗ , and M = S∗ \St+1 . Also, note that | G-supp(I)| ≤ k + k∗ . Moreover, (wt+1 )St+1 = (gt )St+1 . Hence, k(wt+1 − gt )St+1 ∪S∗ k22 = k(gt )M k22 . Now, using Lemma D.2 with z = (gt )I , we have: k(gt )M kG0 k(gt )M kG0 · k(gt )St+1 \S∗ k22 + , k−e k k−e k ζ1 k∗ k∗ ≤ · k(gt )St+1 \S∗ k22 + , k−e k k−e k
k(wt+1 − gt )I k22 = k(gt )M k22 ≤
ζ2
≤
k∗ k−e k
· k(w∗ − gt )I k22 +
k∗ , k−e k
(12)
where ζ1 follows from M ⊂ S∗ and hence | G-supp(M )| ≤ | G-supp(S∗ )| = k ∗ . ζ2 follows from wS∗ t+1 \S∗ = 0.
14
Now, using the fact that k(wt+1 − w∗ )I k2 = kwt+1 − w∗ k2 along with triangle inequality, we have: s s ! k∗ k∗ ∗ ∗ kwt+1 − w k2 ≤ 1 + · k(w − gt )I k2 + , (13) e k−k k−e k s s ! ζ1 k∗ k∗ ≤ 1+ · k(w∗ − wt − η(∇f (w∗ ) − ∇f (wt )))I k2 + 2ηk(∇f (w∗ ))St+1 k2 + , k−e k k−e k s s ! ζ2 k∗ k∗ ∗ ∗ ≤ 1+ · k(I − ηH(I∪St )(I∪St ) (α))(wt − w )I∪St k2 + 2ηk(∇f (w ))St+1 k2 + , k−e k k−e k s s ! ζ3 k∗ k∗ α2k+k∗ ∗ ∗ kwt − w k2 + 2ηk(∇f (w ))St+1 k2 + , ≤ 1+ · 1− L2k+k∗ k−e k k−e k (14) where α = cwt + (1 − c)w∗ for a c > 0 and H(α) is the Hessian of f evaluated at α. ζ1 follows from triangle inequality, ζ2 follows from the Mean-Value theorem and ζ3 follows from the RSC/RSS condition and by setting η = 1/L2k+k∗ . 2 L2k+k∗ Theorem now follows by setting k = 2 + 1 · log(kw∗ k2 /) and appropriately. α2k+k∗ Lemma D.1. Let w = PbkG (g) and let S = supp(w). Then, for every I s.t. S ⊆ I, the following holds: wI = PbkG (gI ). Proof. Let Q = {i1 , i2 , . . . , ik } be the k-groups selected when the greedy procedure (Algorithm 1) is applied to g. Then, / Q. kwGij \(∪1≤`≤j−1 Gi` ) k22 ≥ kwGi \(∪1≤`≤j−1 Gi` ) k22 , ∀1 ≤ j ≤ k, ∀i ∈ Moreover, the greedy selection procedure is deterministic. Hence, even if groups Gi are restricted to lie in a subset of G, the output of the procedure remains exactly the same. b = PbkG (z) and let w∗ ∈ Rp be s.t. | G-supp(w∗ )| ≤ k ∗ . Let Lemma D.2. Let z ∈ Rp be any vector. Let w ∗ b S∗ = supp(w ), I = S ∪ S∗ , and M = S∗ \S. Then, the following holds: S = supp(w), kzS\S ∗ k22 kzM k22 − ≤ , k∗ k−e k k−e k where e k = O(k ∗ log(kw∗ k2 /)). b Let Q = {i1 , i2 , . . . , ik } be the Proof. Recall that the k groups are added greedily to form S = supp(w). k-groups selected when the greedy procedure (Algorithm 1) is applied to z. Then, kzGij \(∪1≤`≤j−1 Gi` ) k22 ≥ kzGi \(∪1≤`≤j−1 Gi` ) k22 ,
∀1 ≤ j ≤ k, ∀i ∈ / Q.
Now, as ∪1≤`≤j−1 Gi` ⊆ S, ∀1 ≤ j ≤ k, we have: kzGij \(∪1≤`≤j−1 Gi` ) k22 ≥ kzGi \S k22 ,
∀1 ≤ j ≤ k, ∀i ∈ / Q.
Let G-supp(w∗ ) = {`1 , . . . , `k∗ }. Then, adding the above inequalities for each `j s.t. `j ∈ / Q, we get: kzGij \(∪1≤`≤j−1 Gi` ) k22 ≥ 15
kzS ∗ \S k22 , k∗
(15)
where the above inequality also uses the fact that Adding (15) ∀ (e k + 1) ≤ j ≤ k, we get:
P
`j ∈G-supp(w∗ ),`j ∈Q /
kzS k22 − kzB k22 ≥
kzG`j \S k22 ≥ kzS ∗ \S k22 .
k−e k · kzS ∗ \S k22 , k∗
(16)
where B = ∪1≤j≤ek Gij . Moreover using Lemma 2.4 and the fact that | G-supp(zS ∗ )| ≤ k ∗ , we get: kzB k22 ≥ kzS ∗ k22 − . Hence, kzS\S ∗ k22 + kzS k22 − kzB k22 kzS k22 − kzS ∗ k22 + kzM k22 ≤ ≤ ≤ . k∗ k−e k k−e k k−e k
(17)
Lemma now follows by a simple manipulation of the above given inequality.
E
Proof of Theorem 3.4
Proof. Recall that gt = wt − η∇f (wt ), wt+1 = PkG (gt ). Similar to proof of Theorem 3.1, we define St+1 = supp(wt+1 ), St = supp(wt ), S∗ supp(w∗ ), I = St+1 ∪ S∗ , J = I ∪ St , and M = S∗ \St+1 . Also, note that | G-supp(I)| ≤ k + k ∗ , | G-supp(J)| ≤ 2k + k ∗ . ∗ Now, using Lemma E.1 with z = (gt )I , we have: k(wt+1 − gt )I k22 ≤ kk · k(w∗ − gt )I k22 . Now, the remaining proof follows proof of Theorem 3.1 closely. That is, using the above inequality with triangular inequality, we have: kwt+1 − w∗ k2 r ! r ! ζ1 k∗ k∗ · k(w∗ − gt )I k2 ≤ 1 + · k(w∗ − wt − η(∇f (w∗ ) − ∇f (wt )))I k2 + 2ηk(∇f (w∗ ))St+1 k2 , ≤ 1+ k k r ! ζ2 k∗ · k(I − ηHJ,J (α))(wt − w∗ )J k2 + 2ηk(∇f (w∗ ))St+1 k2 , ≤ 1+ k s ! ζ3 k∗ α2k+k∗ ≤ 1+ kwt − w∗ k2 + 2ηk(∇f (w∗ ))St+1 k2 , · 1− (18) L2k+k∗ k−e k where α = cwt + (1 − c)w∗ for a c > 0 and H(α) is the Hessian of f evaluated at α. ζ1 follows from triangle inequality, ζ2 follows from the Mean-Value theorem and ζ3 follows from the RSC/RSS condition and by setting η = 1/L2k+k∗ . 2 L2k+k∗ The theorem now follows by setting k = 2 · α2k+k . ∗ b = PkG (z), w∗ = PkG∗ (z) where Lemma E.1. Let z ∈ Rp be such that it is spanned by M groups and let w ∗ k ≥ k and G = {G1 , . . . , GM }. Then, the following holds: M −k b − zk22 ≤ kw kw∗ − zk22 . M − k∗ b and S∗ = supp(w∗ ). Since w b is projection of z, w bS = zS and 0 otherwise. Similarly, Proof. Let S = supp(w) ∗ wS∗ = zS∗ . So, to prove the lemma we need to show that: M −k kzS k22 ≤ kzS∗ k22 . (19) M − k∗ We first construct a group-support set A: we first initialize A = {B}, where B = supp(w∗ ). Next, we iteratively add k − k ∗ groups greedily to form A. That is, A = A ∪ Ai where Ai = supp(P1G (zA¯ )). 16
G e ∈ Rp be such that w eA = zA and w eA = 0. Also, recall that kzS kG0 = kzsupp(w) Let w e k0 ≤ |A| = k. Then, b we have: using optimality of w, (20) kzS k22 ≤ kzA k22 .
Now, kzB k22 kz k2 k − k∗ 1 − A 2 = kzB\A k22 − kz k2 . ∗ ∗ M −k M −k M −k (M − k ∗ )(M − k) A 2
(21)
∗
By construction, B\A = ∪k−k i=1 Ai . Moreover, A is spanned by at most M − k groups. Since, Ai ’s are constructed greedily, we have: kzAi k22 ≥
kzA k22 M −k .
kzB\A k22
=
Adding the above equation for all 1 ≤ i ≤ k − k ∗ , we get:
∗ k−k X
kzAi k22 ≥
i=1
Using (20), (21), and (22), we get:
F
kzB k22 M −k∗
−
kzS k22 M −k
k − k∗ kz k2 . M −k A 2
(22)
≥ 0. That is, (19) holds. Hence proved.
Proof of Theorem 4.1
Proof. Theorem follows directly from proof of Theorem 3.1, but with (12) replaced by the following equation: ζ1
k(wt+1 − gt )I k22 = k(gt )M k22 ≤
ζ2 k ∗ k∗ · β k(gt )St+1 \S∗ k22 + ≤ · β · k(w∗ − gt )I k22 + , k k
(23)
where ζ1 follows from the assumption given in the theorem statement. ζ2 follows from wS∗ t+1 \S∗ = 0.
G
Proof of Lemma 4.2
Proof. Consider group Gi s.t. i ∈ G ∩ G∗ . Now, in a group we just select elements Si by the standard hard thresholding. Hence, using Lemma 1 from [17], we have: kz(S∗ )i \S k22 ≥
k2∗ kzS\(S∗ )i k22 , ∀i ∈ G ∩ G∗ . k2
(24)
Due to greedy selection, for each Gi , Gj s.t. i ∈ G\G∗ and j ∈ G∗ \G, we have: X
kzSi k22 ≥
i∈G\G∗
|G\G∗ | |G∗ \G|
X
kzSj k22 .
j∈G∗ \G
That is, X
kzSi k22 ≥
i∈G\G∗
k2 k2∗
The lemma now follows by adding (24), (25).
17
X j∈G∗ \G
kzSj k22 .
(25)