On the Conditions of Sparse Parameter Estimation via Log-Sum Penalty Regularization Zheng Pan, Guangdong Hou, Changshui Zhang∗
arXiv:1308.6504v1 [cs.IT] 29 Aug 2013
State Key Laboratory on Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology (TNList) Department of Automation, Tsinghua University
Abstract For high-dimensional sparse parameter estimation problems, Log-Sum Penalty (LSP) regularization effectively reduces the sampling sizes in practice. However, it still lacks theoretical analysis to support the experience from previous empirical study. The analysis of this article shows that, like `0 -regularization, O(s) sampling size is enough for proper LSP, where s is the non-zero components of the true parameter. We also propose an efficient algorithm to solve LSP regularization problem. The solutions given by the proposed algorithm give consistent parameter estimations under less restrictive conditions than `1 -regularization. Keywords: parameter estimation, compressed sensing, log-sum penalty, sparse eigenvalue, proximal coordinate descent
1. Introduction We consider a widely studied sparse linear representation model: For a vector y ∈ Rn , there is an s-sparse vector θ∗ ∈ Rp which can represent y under a design matrix X ∈ Rn×p approximately in the sense that ky − Xθ∗ k2 ≤ λe for some noise level λe ≥ 0. In this paper, we concern the high-dimensional case p > n. In recent years, the research in the community of compressed sensing [17, 10, 9, 8] and high-dimensional inference [41, 26] shows that the solutions of the regularized regression in Eqn. (1) ∗ Corresponding
author Email addresses:
[email protected] (Zheng Pan),
[email protected] (Guangdong Hou),
[email protected] (Changshui Zhang)
Preprint submitted to Elsevier
August 30, 2013
can give consistent parameter estimations to the true parameter θ∗ in the sense that √ kθˆ − θ∗ k2 . λe / n. 1 θˆ = arg min F(θ), where F(θ) = ky − Xθk2 + R(θ). θ 2n
(1)
In Eqn. (1), R(θ) is a sparsity-encouraging regularizer, e.g., `1 -norm, `q -norm(0 < p < 1), SCAD [18], MCP [42] and capped-`1 norm [45]. If the error z = y − Xθ∗ is modeled as a Gaussian noise z ∼ N (0, σ 2 In ), we have p √ λe ≤ σ n + 2 n log n with probability at least 1 − 1/n [5]. For Gaussian noise, the consistency of parameter estimations means kθˆ − θ∗ k2 . σ. In this paper, we name n as sampling size. We only study the component-decomposable Pp regularizers, i.e., the regularizers can be written as R(θ) = i=1 r(|θi |) for any vector θ = (θ1 , · · · , θp ). Whether the parameter estimation is consistent depends on the design matrix, the regularizers and the optimization methods for Problem (1). In this paper, we name the conditions under which the parameter estimation is consistent for any s-sparse vector θ∗ as parameter estimation condition (PEC). The PECs are usually established with sparse eigenvalue (SE) (Definition 1) or restrictive eigenvalue (RE) [3]. Definition 1. For an integer t ≥ 1, we say that κ− (t) and κ+ (t) are the minimum and maximum sparse eigenvalues(SE) of a matrix X if κ− (t) ≤ kXθk22 /(nkθk22 ) ≤ κ+ (t)
(2)
holds for all θ with kθk0 ≤ t. A close concept to SE is RIC, which is defined as δt = inf{δ : 1−δ ≤ kXθk22 /n ≤ 1 + δ, kθk2 = 1, θ is t−sparse}. For RIC, we have δt = (κ+ (t) − κ− (t))/(κ+ (t) + κ− (t)), where δt is actually the RIC of the scaled matrix 2X/(κ+ (t) + κ− (t)). The rescaling for X is to avoid the scaling problem of RIC [19]. For `1 -regularization (`1 -norm as the regularizer), a lot of PECs based on SE, RIP and RE were proposed in recent years, e.g., κ+ (2s)/κ− (2s) or RIC is upper bounded by a constant [19, 6, 23] or the RE is strictly larger than 0 [3, 26]. In spite of being
2
described in different forms, these PECs actually require n to be at least O(s log p) for the case of Gaussian design matrices [1, 29], where the Gaussian design matrices have i.i.d. elements drawn from N (0, 1). In the asymptotic sense, the PECs for Gaussian design matrices are tightened to n > 2s log(p − s) + s + 1 by Wainwright [41]. Except `1 -norm, the other component-decomposable sparsity-encouraging regularizers are non-convex. As a result, the solutions of Problem (1) are algorithm-related. The corresponding PECs are also algorithm-related.
The greedy algorithms OMP
[37], ROMP [25] and CoSaMP [24] can be treated as algorithms for `0 -regularization (`0 -norm as the regularizers). Their PECs are in the form of δas ≤ c for some a ≥ 1 p and c < 1, e.g., δ31s ≤ 1/3 [46], δ2s ≤ 0.03/ log(s) [25] and δ4s ≤ 0.1 [24]. IHT [4] is another algorithm for `0 -regularization and its PECs have the same form, e.g., √ δ3s < 1/ 32 [4]or δ2s < 1/3 [20]. The PLUS algorithm [42], proposed for MCP, has a PEC of κ+ (t)/κ− (t) < 2t/s − 1 for some t ≥ s.
The iterative reweighted
L1-minimization (IRL1) [11] as well as its generalized methods [45] can be applied to general non-convex regularizers. They need almost the same PECs as `1 -regularization, e.g., κ+ (2s)/κ− (6s) ≤ 2 [45]. Furthermore, the PECs of IRL1 cannot be improved to √ be better than that of `1 -regularization in the sense that when δ2s > 1/ 2, IRL1 will fail to give consistent parameter estimation as `1 -regularization does [14].
Proximal
methods are also applicable for general non-convex regularizers [21] . But the existing PECs need the sparseness of the solutions in advance [27], which makes the PECs cannot be verified until the methods output the solutions.
The same problem occurs
in the DC based methods for capped-`1 norm as the regularizers [32, 33]. Zhang and Zhang [44] and Loh and Wainwright [22] show that some sparse local minimizers of Problem (1) give consistent parameter estimations to θ∗ if the RE is strictly positive. Their PECs are the same as that of `1 -regularization. For Gaussian design matrices, the above PECs for non-convex regularizers also require the sampling size to be at least O(s log p), since all of the above PECs either need positive RE or the RIC is close to 0 or κ+ (t1 )/κ− (t2 ) is upper bounded by a fixed value for some t1 ≥ s and t2 > 2s. As a comparison, we consider the PECs for the global solutions of Eqn. (1) regularized by `0 -norm. `0 -regularization only needs a sampling size of 2s [16] for the 3
noiseless case or as (a > 2 is a constant) for the noisy case [43]. The value of s can be regarded as the level of the information contained in the true parameters. Therefore, `0 -regularization can estimate the true parameters with a sampling size at the information level. In previous work, from `0 -norm to other regularizers, the sampling sizes have to be enhanced from O(s) to O(s log p). So, there is a big gap from the view of PECs. Recent research showed that some non-convex regularizers could give a good parameter estimation with far less sampling sizes (or say less restrictive PECs) than `1 regularization in theory and practice. `q -regularization (0 < q ≤ 1) has a PEC of √ κ+ (2t) ≤ 4( 2 − 1) κ− (2t)
1/q−1/2 t +1 s
(3)
for some t > s [19], which is much less restrictive than that of `1 -norm when q is small. If n ≥ 2t, which implies κ− (2t) > 0 with probability 1, there exists q > 0 such that the parameter estimation is consistent with `q -regularization. However, no algorithm can efficiently solve Problem (1) with `q -norm as the regularizer, since it is strongly NP-hard [13]. For other non-convex regularizers, many empirical works also reported the superiority of non-convex regularizers in the sampling sizes [15, 34, 38, 40, 35, 11, 36, 39, 12, 28]. However, as stated in the above, the (theoretical) PECs for them, except `q -regularization, are still almost the same as that of `1 -regularization. Feature selection is a stronger task than parameter estimation, which concerns estimating the support sets of the true parameters. Shen et al. [33] provide a necessary condition of feature selection and their DC methods for capped-`1 penalty only need an almost necessary condition for feature selection. Their conditions, roughly κ− (s) & s log p/n, still need n to be at least on the order of s log p. Summarizing teh advantages of different regularizers, we think that an ideal regularizer should have the following three properties: • Estimation by non-optimal solutions: Since Problem (1) is hard to solve exactly, an ideal regularizer should allow some non-optimal solutions to be consistent parameter estimations to the true parameters. • Efficiency: There exist efficient algorithms to output the desired non-optimal 4
solutions. • Weak PECs: The PECs for global solutions only need O(s) sampling size as `0 -regularization does. The PECs for the non-boptimal solutions are also less restrictive than that of `1 -regularization. This paper aims to seek such good regularizers and the corresponding efficient algorithms. Specially, we study the regularizers named Log-Sum Penalty (LSP) [11]. Roughly speaking, we prove that the above three properties are satisfied for proper LSP regularizers. In particular, we obtain the following conclusions: 1. With proper LSP as the regularizer, the PECs for global solutions of Problem (1) only need O(s) sampling sizes for general design matrices. 2. The proposed OMP-PCD methods can give suboptimal solutions efficiently. The main part of OMP-PCD is a coordinate descent algorithm, which only need O(n/λ2e ) iterations to achieve a desired suboptimal solution. 3. For proper LSP, the solutions given by OMP-PCD are consistent parameter estimations of the true parameters under proper PECs. The proposed PECs are in the form of κ+ (t)/κ− (t) ≤ MLSP , where the upper bound MLSP , like Eqn. (3), becomes arbitrarily large as LSP approaches `0 -norm. 4. For Gaussian design matrices, we derive the concrete LSP regularizers that allow consistent parameter estimations with O(s) sampling sizes.
2. Main Result First of all, we give the definition of LSP. Definition 2. For any vector θ = (θ1 , · · · , θp ), the LSP of θ with the parameters λ, γ Pp is defined as R(θ) = i=1 r(|θi |) where r(u) = λ2 log(1 + for u ≥ 0.
5
u ) λγ
In the following of this paper, R(θ) and r(u) denote LSP in particular. We use the following denotations for simplicity. Suppose the design matrix X = (x1 , x2 , · · · , xp ) and we define √ ξ = max kxi k2 / n 1≤i≤p
and λ∗ = inf{ξ 2 u/2 + r(u)/u : u > 0}. 2.1. OMP-PCD Algorithm LSP is a non-convex regularizer, optimization methods for Problem (1) are crucial for the consistency of parameter estimation.s We study the methods of Proximal Coordinate Descent (PCD) initiated by Orthogonal Matching Pursuit (OMP). We name our optimization methods as OMP-PCD. Algorithm 1 shows the pipeline of OMP-PCD. Algorithm 1 OMP-PCD(X, y, λ, γ, s0 , ψ, τ ) Require: the design matrix X, the response y, the parameters of LSP λ and γ, the max number of non-zero components s0 , the parameter of PCD ψ, the tolerance τ .
1:
θ(0) =OMP(X, y, s0 ). OMP algorithm is in Algorithm 2
2:
k ← 1.
3:
repeat
4:
for i = 1 to p do
5: (k) θi
1 ← arg min u∈R 2 (k)
where ωi 6:
end for
7:
k ← k + 1.
=
1 T n (xi y
kxi k22 + ψ2 n −
(k) T ji θj
P
+ r(|u|), (4)
Algorithm 2 OMP(X,y,s0 ) Require: the design matrix X, the response y, the maximum number of non-zero components s0 . (0)
1:
T = ∅, θOMP = 0
2:
for k = 1 to s0 do (k−1)
3:
j ← arg mini |xTi (XθOMP − y)|
4:
T ← T ∪ {j}
5:
θOMP = arg minθ kXθ − yk22 s.t. supp(θ) = T
(k)
6:
end for
7:
0 return θOMP
(s )
Eqn. (4) is the main step of Algorithm 1, which is derived from the following minimization problem. (k)
θi
(k)
(k)
(k−1)
= arg min F((θ1 , · · · , θi−1 , u, θi+1 , · · · , θp(k−1) )T )+ u∈R
ψ2 (k−1) 2 (u−θi ) , (5) 2
where i = 1, · · · , p, ψ > 0 is a positive constant and F(θ) is defined Eqn. (1). The iteration method in Eqn. (5) is called proximal coordinate descent [2]. The constant ψ plays a role of balance between decreasing F(θ) and not going far from the previous step. ψ also affects the time complexity of our OMP-PCD methods as shown later. The parameter s0 for OMP specifies the sparseness of the initial solution θ(0) for PCD. The parameter τ is preferred to be 2λ∗ in this paper. Eqn. (4) is a non-convex one-dimensional problem but all of its solutions are be(k−1)
tween 0 and (ψ 2 θi
(k)
+ ωi )/(ψ 2 + kxi k22 /n). Define the following shrinkage
function Th (w) = arg min fw (u), where fw (u) = u
(k)
With Th (w), θi
(k)
can be written as θi
1 |u| (u − w)2 + hλ2 log(1 + ). 2 λγ (k−1)
= Th (
(k)
ψ 2 θi +ωi ψ 2 +kxi k22 /n
(6)
) with h = 1/(ψ 2 +
kxi k22 /n). One of the advantages of LSP is that we can derive the closed form of the shrinkage function for LSP. 0, |w| < λ(2√h − γ) or w2 ≤ f (g(|w|)) |w| Th (w) = . (7) sgn(w) (|w| − λγ + p(|w| + λγ)2 − 4hλ2 ), w2 > f (g(|w|)) |w| 2 7
10 9 8 7
Th(w)
6 5 4 3 y=x γ=0.5 γ=0.1 γ=0.01 γ=0.001
2 1 0
0
2
4
6
8
10
w
Figure 1: Shrinkage function of LSP with λ = 1, h = 1 and γ = 0.5, 0.1, 0.01 and 0.001.
where g(w) =
1 2 λγ
w λγ
−1+
q w ( λγ + 1)2 −
4h γ2
. Figure 1 illustrates some exam-
ples of the shrinkage function. Section 4.8 shows the method to obtain Eqn. (7). The solutions given by OMP-PCD satisfy some good properties which are required later for consistent parameter estimation. We summary them in Theorem 1 and Theorem 2. Theorem 1. For the solution of the k-th iteration of OMP-PCD θ(k) , we have 1. {F(θ(k) )} is a decreasing sequence and converges; 2. Let the parameter of OMP-PCD τ = 2λ∗ and θˆ be a global minimizer of Problem (1). Then, OMP-PCD stops within K =
ˆ 2ξ 2 (F (θ (0) )−F (θ)) (λ∗ )2 ψ 2
iterations, and the
output θ˜ satisfies kX T (X θ˜ − y)/nk∞ ≤ 2λ∗ . 3. Define ξ0 =
p
(8)
ξ 2 + ψ 2 . If γξ0 ≤ 1, (k)
(k)
|θi | ≥ λ(1/ξ0 − γ) or θi
= 0.
(9)
Eqn. (9) shows the solutions given by OMP-PCD are strong signals in the sense that the minimal magnitudes of non-zero components are strictly larger than zero. The second conclusion of Theorem 1 guarantees the PCD part of OMP-PCD stops within 8
O(1/(λ∗ )2 ) iterations. At the same time, Eqn. (8) implies that the solution θ˜ can represent y approximately in the sense that the error measurement kX T (X θ˜− y)/nk∞ is on the order of the noise level. This error measurement is also used in the Dantzig Selector [7]. In fact, λ∗ can be upper bounded by λ as follows (See Section 4.7 for the proof). r
∗
λ ≤ λξ
2 log(1 +
2 ) ξ2γ2
(10)
√ Thus, with λ = O(λe / n), the PCD part iterates at most K = O(n/λ2e ) iterations. For the Gaussian noise setting z = y − Xθ∗ ∼ N (0, σ 2 In ), we have K = O(1/σ 2 ). Besides, Theorem 1 is independent of the initialization θ(0) of OMP-PCD algorithm. This theorem is actually the properties of PCD part of OMP-PCD. Another important property is that OMP-PCD outputs good suboptimal solutions under proper conditions. In this paper, the suboptimal solution is defined as follows. Definition 3. If F(θ) ≤ F(θ∗ ) + µ for some µ ≥ 0, we call θ a (µ, θ∗ )-suboptimal solution of Problem (1). Theorem 2. Consider s0 ≥ 1, B ≥ 1 such that kyk22 exp(−
Bλ2e s0 ) ≤ 2(1 + B)λ2e . 2nkθ∗ k21
The initialization θ(0) in Algorithm 1, given by OMP, is a (µ0 , θ∗ )-suboptimal solution with µ0 ≤ λ2 (1 + B)
2λe λ2e + √ nλ2 λγ n
r
s + s0 κ− (s + s0 )
.
Bλ2 s
In Theorem 2, kyk22 exp(− 2nkθe∗ k02 ) decreases exponentially as s0 increases. So, 1
the conditions of Theorem 2 are satisfied for enough large s0 . The first conclusion of Theorem 1 implies that PCD gives an extra decrease to µ0 . That is to say, the solution θ˜ ˜ ≥ 0. given by OMP-PCD is (µ0 −µPCD , θ∗ )-suboptimal with µPCD = F(θ(0) )−F(θ) µPCD can be computed explicitly when OMP-PCD stops. Based on Theorem 1 and Theorem 2, we can see how OMP and PCD cooperate: First, OMP gives a suboptimal solution θ(0) . θ(0) may not be a strong signal as in Eqn. (9) or satisfy the error measurement constraint as in Eqn. (8). However, PCD refines θ(0) iteratively to let its solutions become a strong signal and satisfy the error 9
measurement constraint in Eqn. (8). Meanwhile, PCD not only does not violate the ˜ (µ0 , θ∗ )-suboptimal property of θ(0) , but also gives µ0 an extra decrease so that F(θ) is closer to the global optimum of Problem (1). 2.2. Parameter Estimation Next, we show that the suboptimal solutions satisfying Theorem 1 are consistent parameter estimations of the true parameters under proper assumptions. Before that, we denote α = (1 + η)/(1 − η) for the constant η > 0 (η is introduced in Theorem 3). We define r Hγ =
s (γξ0 )−1/s − 1 t − 1 (γξ0 )−α/(t−1) − 1
where ξ0 is defined in Theorem 1. For Hγ , we have Hγ ≥ define
√ %(t) =
2+1 4
κ+ (t) −1 κ− (t)
(11) q
s t−1
1 γξ0
α 1s − t−1
. We
for any positive integer t. Theorem 3. Suppose θ˜ is a (µ, θ∗ )-suboptimal solution of Problem (1). Suppose the following four assumptions hold. Assumption A The parameters λ and γ of LSP satisfy that γ < 1/ξ0 and λ ≥ η
q 1 2 log
1 γξ
·
λe √ n
for some η ∈ (0, 1).
Assumption B θ˜ satisfies Eqn. (8) and Eqn. (9). Assumption C The true parameter θ∗ satisfies mini∈supp(θ∗ ) |θi∗ | ≥ λ(1/ξ0 − γ). Assumption D There exists an integer t ≥ αs + 1 such that %(2t) < Hγ ,
(12)
Then, kθ˜ − θ∗ k2 ≤ C1 λ∗ + C2 r−1 (µ/(1 − η)), where C1 =
√ √ (1+ 2)(2+η) t 2Hγ +1 2κ− (2t) Hγ −%(2t)
and C2 =
10
(2%(2t)+1)Hγ √ . t(Hγ −%(2t))
(13)
Theorem 3 is established to analyze the parameter estimation for the solution θ˜ of OMP-PCD. Roughly speaking, Theorem 3 sates the fact that, if true parameters are strong signals, OMP-PCD gives consistent parameter estimations with proper LSP regularizers and enough iterations. We have some remarks on the conditions of Theorem 3: By Theorem 2, θ˜ is ˜ Assumption (µ0 − µPCD , θ∗ )-suboptimal with µ0 . λ2 and µPCD = F(θ(0) ) − F(θ). √ A is a constraint for the parameter of LSP. It needs small enough γ and λ = O(λe / n). By Theorem 1, Assumption B is satisfied for the solutions of OMP-PCD with enough iterations. Assumption C requires that the true parameters are strong signals. Assumption D is actually a constraint on the sparse eigenvalue ratio κ+ (2t)/κ− (2t) of the design matrix X. For Assumption D, we will show in Section 2.3 that this assumption is satisfied for enough small γ or O(s) sampling sizes. 2.2.1. Global Solutions In Theorem 3, µ = 0 corresponds to the global solution case. The following corollary shows the parameter estimation results of global solutions. Corollary 1. Suppose Assumption A, C and D of Theorem 3 hold for the global solution θˆ of Problem (1). Then, (1 + kθˆ − θ∗ k2 ≤
√
√ 2)(1 + η) t 2Hγ + 1 ∗ λ . 2κ− (2t) Hγ − %(2t)
(14)
For global solutions, we do not need Assumption B of Theorem 3 since the optimality of global solutions implies the properties of strong signals in Eqn. (9) and kX T (X θˆ − y)/nk∞ ≤ λ∗ which is even stronger than Eqn. (8). (See the proof in Sec. 4.5) √ For an example with γ ≤ 1/ 3, ξ = 1, η = 0.2, t = 3s + 1, λ = we have kθˆ − θ∗ k2 ≤
√ 53.2 3s + 1 λe √ κ− (6s + 2) n
provided that %(2t) ≤ (Hγ − 1)/2.
11
η
q 1 2 log
1 γ
·
λe √ , n
2.2.2. Parameter Estimation by OMP-PCD Combining Theorem 2 and Theorem 3, we obtain the following corollary which shows concrete examples of the parameter estimation results of OMP-PCD. √ Corollary 2. Suppose the conditions of Theorem 3 hold with γ ≤ 1/ 3, ξ = 1, η = 0.2, t = 3s + 1 and %(2t) ≤ (Hγ − 1)/2. Suppose the conditions of Theorem 2 hold with B = 1. Then, for λ =
η
q 1 2 log
1 γ
·
λe √ , n
we have
√ kθ˜ − θ∗ k2 ≤ (T1 + T2 )λe / n, √ 53.2 3s+1 κ− (6s+2) and T2 2.76λ √ e , we have γ n
where T1 = For λ =
√ = 2.36γ
0.8−1/(2s)
s log(1/γ)
exp
0.906 γ
q
(15) 2(s+s0 ) κ− (s+s0 )
√ kθ˜ − θ∗ k2 ≤ (T10 + T20 )λe / n, where T10 =
q √ 29.3 3s+1 1 log γ1 κ− (6s+2) γ
1.84 T20 = √ s
log
1 γ
−
1.25µPCD λ2
(16)
and
1 r 2s 1 s + s0 1.25µPCD − exp 0.33γ 2 + 1.82 − 1 . γ κ− (s + s0 ) λ2
The upper bound of Eqn. (15) increases faster than the bound in Eqn. (16) as γ → 0 but T1 is independent of γ. As a consequence, Eqn. (15) is suitable for global solutions or the case in which µPCD is large or the case in which λe is small. Eqn (16) is preferred when µPCD is not so large and γ is not very small,. 2.3. Less Restrictive PECs In Theorem 3, Assumption A and C are satisfied by setting proper parameters for OMP-PCD algorithms and Assumption B is a strong-signal requirement on the true parameters. Only Assumption D is a constraint on the design matrix X in Theorem 3. In Assumption D, the smaller γ is, the larger Hγ is. Suppose the elements of X ∈ Rn×p are i.i.d. drawn from any distribution. If n ≥ 2t, , we have κ− (2t) > 0 with probability 1, which means %(2t) is finite. On the p other hand, Hγ ≥ s/(t − 1)( γ1 )1/s−α/(t−1) can be arbitrarily large as γ → 0 when t > αs + 1. Thus, we have the following corollary.
12
.
Corollary 3. Suppose αs is an integer and the elements of X ∈ Rn×p are i.i.d. drawn from any distribution. If n ≥ 2αs + 4, there exists γ > 0 such that Assumption D of Theorem 3 is satisfied with probability 1. For example, we consider the case with η = 0.2. We have α = 1.5. We set √ t = 3s + 1 and n = 6s + 2. In this case, Hγ ≥ (γξ0 )−1/(2s) / 3. So, %(2t) < Hγ is satisfied provided that γ
0, if n ≥ 4t and p ≥ t log p, then " # r r r p 1 4 t t 4t log p κ− (t) ≥ 1− (1 + ) 4 3ep 3 3ep n holds with probability at least 1 − C3 exp(−C4 n), where C3 and C4 are universal positive constants. Furthermore, 709p %(t) ≤ t
!2 p 1 + 4t log p/n p p − 0.603. 1 − 0.467 t/p(1 + 4t log p/n)
(19)
also holds with probability at least 1 − C3 exp(−C4 n). With Eqn. (17) and Eqn. (19), we can derive a condition on γ which is sufficient for %(2t) < Hγ , i.e., Assumption D of Theorem 3. For example, we consider the case with η = 0.2 again. With n = 24s + 8 and t = 3s + 1, we obtain the following γ’s value
14
which is sufficient for %(2t) < Hγ . √
ξ0 γ ≤
205p s
1 + log p p √ s/p(1 + log p)
1 − 1.14
−2s
!2
− 1.04
(20)
Corollary 3 and its special case for Gaussian design matrices give us a message that with small enough γ, consistent parameter estimation is achievable with only O(s) sampling size at least for the global solutions.
3. Conclusion Consistent parameter estimation at O(s) sampling size is not an exclusive property of `0 -norm. This property is also shared by the LSP regularizers that are close to `0 norm but are not strictly `0 -norm. Furthermore, Problem (1) is easier to solve with LSP than `0 -norm. At least, the proposed OMP-PCD algorithm provides a theoretically good way to get a good solution for parameter estimation. The PECs of practical solutions given by OMP-PCD are less restrictive than that of `1 -regularization.
4. Technical Proof 4.1. Proof of Theorem 1 (k)
(k)
(k−1)
(k−1) T
Let zk,i = (θ1 , · · · , θi , θi+1 , · · · , θp
) and zk,0 = θ(k−1) , zk,p = θ(k) .
(k)
(k)
By the definition of θi , we have F(θ(k) ) ≤ F(θ(k) ) + ψ 2 (θp F(zk,p−1 ) ≤ F(zk,p−1 ) + F(θ
(k−1)
). Note that F(θ
(k) ψ 2 (θp−1
(k)
−
(k−1) θp−1 )2 /2
(k−1) 2
− θp
) /2 ≤
≤ F(zk,p−2 ) ≤ · · · ≤ F(zk,0 ) =
) ≥ 0 for any k. Thus, {F(θ(k) )} is a decreasing sequence
and converges to a non-negative value. (k)
(k−1)
Besides, we have F(zk,i ) − F(zk,i+1 ) ≥ ψ 2 (θi+1 − θi+1 )2 /2 for any i = 0, 1, · · · , p−1. Summing up for from i = 0 to i = p−1, we have F(zk,0 )−F(zk,p ) ≥ ψ 2 kθ(k) − θ(k−1) k22 /2, i.e., F(θ(k−1) ) − F(θ(k) ) ≥
15
ψ 2 (k) kθ − θ(k−1) k22 . 2
(21)
Summing up Eqn. (21) for k = 1, 2, · · · , K, we obtain
PK
k=1
kθ(k) − θ(k−1) k22 ≤
2(F(θ(0) ) − F(θ(K) ))/ψ 2 , which readily implies that s ˆ 1 2(F(θ(0) ) − F(θ)) min kθ(k) − θ(k−1) k2 ≤ √ , 1≤k≤K ψ2 K
(22)
where θˆ is a global minimizer of F(θ). (k)
By the definition of θi , for any t ∈ R, 1 1 ky − Xzk,i k22 + R(zk,i ) ≤ ky − Xzk,i − txi k22 + R(zk,i + tei ), 2n 2n where ei is the i-th column of identity matrix. By triangle inequality, |t| r(|t|) 1 T txi (y − Xzk,i ) ≤ |t|(ξ + ), n 2 |t| which implies kxTi (y − Xzk,i )/nk∞ ≤ λ∗ . So, kX T (Xθ(k) − y)/nk∞
≤ maxi {|xTi (Xzk,i − y)/n| + |xTi X(θ(k) − zk,i )/n|} ≤ maxi {λ∗ + ξkθ(k) − θ(k−1) k∞ } ≤ λ∗ + ξkθ(k) − θ(k−1) k∞
Combining the above with Eqn. (22), we have min1≤k≤K kX T (Xθ(k) − y)/nk∞ ≤ ˆ 2ξ 2 (F (θ (0) )−F (θ) . ψ 2 (λ∗ )2 (k) definition of θi in Eqn.
2λ∗ with K = By the
(4), the third conclusion is derived from Lemma
2. 4.2. Lemma 2 Lemma 2. Let θˆ be the global solution of Problem (1). For any y and X, if γξ ≤ 1, we have |θˆi | > λ(1/ξ − γ) or θˆi = 0 for i = 1, · · · , p. Proof. θˆ minimizes
1 2n ky
− Xθk22 + R(θ), therefore the subgradient at θˆ con-
tains zero, i.e., |xTi (X θˆ − y)/n| ≤ r(| ˙ θˆi |−), ∀i = 1, · · · , p and θˆi 6= 0. Define θ¯ = (θˆ1 , · · · , θˆi−1 , 0, θˆi+1 , · · · , θˆn ). We have ¯ 2 + R(θ), ¯ X θk 2
which implies that 2nr(|θˆi |) ≤
1 2n ky
ˆ 2 + R(θ) ˆ ≤ − X θk 2
ˆ θˆi2 kxi k22 + 2θˆi xTi (y − X θ)
≤
1 2n ky
−
θˆi2 kxi k22 +
ˆ ≤ θˆ2 kxi k2 + 2n|θˆi |r(| 2|θˆi ||xTi (y − X θ)| ˙ θˆi |−) ≤ nξ 2 θˆi2 + 2n|θˆi |r(| ˙ θˆi |−). 2 i Note that r(u) = λ2 log(1 + u/(λγ)). It is easy to prove that r(u) > ξ 2 u2 /2 + ur(u) ˙ for u ∈ (0, λ(1/ξ − γ)]. So, |θˆi | > λ(1/ξ − γ) or θˆi = 0 for i = 1, · · · , p. 16
4.3. Proof of Theorem 2 (k)
Let {θOMP } be the sequence of the solutions of each iteration of OMP(X, y, s0 ). Let the final solution given by OMP be θOMP . Denote L(θOMP ) =
1 2 2n ky − XθOMP k2 .
We first need to bound L(θOMP ). According to the proof of Theorem 2.7 of Shalev-Shwartz et al. [31], the sequence (k) {θOMP }
satisfies (k)
(k+1)
(k)
L(θOMP ) ≤ L(θOMP ) −
(L(θOMP ) − L(θ∗ ))2 , 2kθ∗ k21
which implies that (k+1)
(k)
(k)
(k)
2kθ∗ k21 L(θOMP ) ≤ 2kθ∗ k21 L(θOMP ) − (L2 (θOMP ) − 2L(θOMP )L(θ∗ ) + L2 (θ∗ )) (k)
(k)
= L(θOMP )(2kθ∗ k21 − L(θOMP ) + 2L(θ∗ )) − L2 (θ∗ ) (23) Let Eqn. (23) be less than a ·
(k) 2kθ∗ k21 L(θOMP )
for some 0 < a < 1, which is equal to
that (k)
(k)
L2 (θOMP ) − 2((1 − a)kθ∗ k21 + L(θ∗ ))L(θOMP ) + L2 (θ∗ ) ≥ 0. (k+1)
(k)
Hence, L(θOMP ) ≤ aL(θOMP ) holds under the condition that (k)
L(θOMP ) ≥ L(θ∗ ) + (1 − a)kθ∗ k21 + Since L(θ∗ ) ≤
λ2e 2n ,
q (1 − a)2 kθ∗ k41 + 2(1 − a)kθ∗ k21 L(θ∗ ). (24)
Eqn. (24) is satisfied if a = 1 − (k)
L(θOMP ) ≥ Since
kyk22 2n (1
Bλ2e 2nkθ ∗ k21
and
(1 + B)λ2e . n
(25)
Bλ2e kyk2 Bλ2 s )s0 ≤ 2n2 exp(− 2nkθe∗ k02 ) ≤ (1 + B)λ2e /n, there exists 2nkθ ∗ k21 1 kyk22 Bλ2e kyk22 Bλ2e s¯0 −1 2 s¯0 (1 − ) ≥ (1 + B)λ e /n ≥ 2n (1 − 2nkθ ∗ k21 ) . 2n 2nkθ ∗ k21
−
s¯0 ≤ s0 such that Hence,
L(θOMP ) ≤
kyk22 Bλ2e s¯0 (1 − ) ≤ (1 + B)λ2e /n. 2n 2nkθ∗ k21
Let ∆OMP = θOMP − θ∗ . We have p 2L(θOMP )
√ = kX∆0OMP + Xθ∗ − yk2 / n p √ ≥ κ− (s + s0 )k∆OMP k2 − λe / n.
17
(26)
p p So, k∆OMP k2 ≤ λe ( 2(1 + B) + 1)/ nκ− (s + s0 ). Thus, µ = L(θOMP ) − L(θ∗ ) + R(θ∗ + ∆OMP ) − R(θ∗ ) ≤ L(θOMP ) + R(∆OMP ) √ ≤ (1 + B)λ2e /n + (s + s0 )r(k∆OMP k2 / s + s0 ) √ ( 2(1+B)+1)λe 2 2 ≤ (1 + B)λe /n + (s + s0 )λ log 1 + √ λγ n(s+s0 )κ− (s+s0 ) 2 q λ 2λ s+s 0 . ≤ λ2 (1 + B) nλe2 + λγ √en κ− (s+s 0) 4.4. Proof of Theorem 3 Let ∆ = θ˜−θ∗ , S = supp(θ∗ ) and T be any index set with |T | ≤ s = |S|. Denote √ κ− = κ− (2t), κ+ = κ+ (2t) and % = (1 + 2)(κ+ /κ− − 1)/4. Let i1 , i2 , · · · be a sequence of decreasing indices for the complement of T in {1, · · · , n}, i.e., ik ∈ T¯ for k ≥ 1 and |∆i1 | ≥ |∆i2 | ≥ |∆i3 | ≥ · · · . Given an integer t ≥ s, we partition T¯ as T¯ = ∪i≥1 Ti such that T1 = {i1 , · · · , it }, P T2 = {it+1 , · · · , i2t }, · · · . Define Σ = i≥2 k∆Ti k2 . The proof needs the following lemmas. Lemma 3. For any η > 0, γ < 1/ξ and λ ≥
η
q λe 2n log
1 γξ
, we have kX T (y −
Xθ∗ )/nk∞ ≤ ηλ∗ and it holds that η θT X T (y − Xθ∗ ) ≤ kXθk22 + ηR(θ) n 2n
(27)
for any θ ∈ Rp . Proof. Denote z = y − Xθ∗ . Suppose θη is a minimizer of the following problem min θ
1 kXθ − z/ηk22 + R(θ). 2n
(28)
By Lemma 2, if θη 6= 0, θη has at least one component whose magnitude is larger than λ(1/ξ − γ). Thus,
1 ˆ 2n kX θη
− z/ηk22 + R(θˆη ) > r(λ(1/ξ − γ)) ≥
λ2e 2nη 2
≥
kz/ηk22 2n ,
which means θη is not a minimizer. Hence, θ = 0 is the unique minimizer of Problem (28), which readily implies Eqn. (27). For any t ∈ R and i = 1, · · · , p, we have
1 2 2n ktxi − z/ηk2 + r(|t|)
readily implies that |xTi z/n| ≤ η(|t|ξ 2 /2 + r(|t|)/|t|) ≤ ηλ∗ . 18
≥
kz/ηk22 2n ,
which
Lemma 4. Suppose the conditions of Lemma 3 hold and θ˜ is a (θ∗ , µ)-suboptimal solution of Problem (1). Then, 1 1+η µ kX∆k22 + R(∆S¯) ≤ R(∆S ) + , 2n 1−η 1−η where ∆ = θ˜ − θ∗ and S = supp(θ∗ ). Proof. Denote L(θ) =
1 2n kXθ
− yk22 . θ∗ + ∆ is a (θ∗ , µ)-suboptimal solution,
which implies that µ ≥ L(θ∗ + ∆) − L(θ∗ ) + R(θ∗ + ∆) − R(θ∗ ) ≥ h∇L(θ∗ ), ∆i +
1 2 2n kX∆k2
+ R(∆S¯) − R(∆S )
η Invoking Lemma 3 by replacing θ with ∆, we have h∇L(θ∗ ), ∆i ≥ − 2n kX∆k22 −
ηR(∆). Thus, µ ≥
1−η 2 2n kX∆k2 − ηR(∆) + R(∆S¯) − R(∆S ).
Hence, the conclusion
follows. Lemma 5. Suppose the conditions of Lemma 3 hold and θ˜ satisfies Eqn. (8). Denote ∆ = θ˜ − θ∗ . For any index set T , max{k∆T k2 , k∆T1 k2 } ≤ %Σ +
(1 +
√
√ 2)(2 + η) t ∗ λ 2κ− (2t)
(29)
Proof. Invoking Eqn. (8) and Lemma 3, we obtain 1 1 1 k X T X∆k∞ ≤ k X T (X θ˜ − y)k∞ + k X T (Xθ∗ − y)k∞ ≤ (2 + η)λ∗ . n n n We modify the Eqn. (12) in Foucart and Lai [19] to the following inequality. hX∆, X(∆T + ∆T1 )i /n ≤ (k∆T k1 + k∆T1 k1 )kX T X∆/nk∞ √ ≤ (2 + η) tλ∗ (k∆T k2 + k∆T1 k2 ). Then, following the proof of Theorem 3.1 in Foucart and Lai [19], Eqn. (28) follows. √ Lemma 6. r(Σ/ t) ≤ R(∆T¯ )/t.
19
Proof. For any i ∈ Tk and j ∈ Tk−1 (k ≥ 2), we have |∆i | ≤ |∆j |. Thus,
√ r(|∆i |) ≤ R(∆Tk−1 )/t, i.e., |∆i |2 ≤ (r−1 (R(∆Tk−1 )/t))2 . It follows that r(k∆Tk k2 / t) ≤ √ P P R(∆Tk−1 )/t. Thus, R(∆T¯ )/t ≥ k≥2 R(∆Tk−1 )/t ≥ k≥2 r(k∆Tk k2 / t) ≥ √ r(Σ/ t). Next, we turn to the proof of Theorem 3. There are two cases according to the difference of supports of θ˜ and θ∗ . ˜ = supp(θ∗ ). For this case, we have ∆i = 0 for i ∈ S¯ and Case 1: supp(θ) Σ = 0. By Lemma 5, we can obtain that k∆k2 = k∆S k2 ≤ c1 λ∗ , where c1 = √ √ (1 + 2)(2 + η) t/(2κ− ). ˜ 6= supp(θ∗ ). Denote ρ0 = λ(1/ξ0 − γ). For this case, there exists Case 2: supp(θ) j satisfying |∆j | ≥ ρ0 . Let T be the indices of the first s largest components of ∆ in the sense of magnitudes. Obviously, R(∆T ) ≥ r(ρ0 ). By the concavity of LSP and Lemma 5, we have R(∆T ) ≤ sr
k∆T k2 √ s
≤ sr
! √ √ (1 + 2)(2 + η) t ∗ % √ Σ+ √ λ . s 2κ− s
(30)
Combining with Lemma 6 and 4, it follows that r r √ t −1 αR(∆T ) µ (1 + 2)(2 + η) t ∗ R(∆T ) −% r + ≤ λ r−1 s s t (1 − η)t 2κ− s (31) Since r−1 (u) = λγ(exp(u/λ2 ) − 1) is convex, we have µ µ αR(∆T ) t − 1 −1 αR(∆T ) 1 + r ≤ + r−1 . (32) r−1 t (1 − η)t t t−1 t 1−η We observe that r−1 (R(∆T )/s) r−1 (r(ρ0 )/s) (γξ0 )−1/s − 1 ≥ −1 = r (αr(ρ0 )/(t − 1)) (γξ0 )−α/(t−1) − 1 T )/(t − 1))
r−1 (αR(∆
(33)
Combining Eqn. (31), (32), (33) and the definition of Hγ , we obtain that under the condition of Eqn. (12), it holds that αR(∆S ) c3 −1 µ r−1 ≤ c2 λ∗ + r , t−1 t−1 1−η where r c2 =
√ t (1 + 2)(2 + η) t − 1 2κ− (Hγ − %) 20
(34)
(35)
and c3 = %/(Hγ − %). Hence, we have Σ≤
√
c3 + 1 t − 1c2 λ + √ r−1 t ∗
(36)
µ 1−η
.
With Eqn. (37) and Lemma 5, it follows that k∆k2 ≤ C1 λ∗ + C2 r−1 C1 =
(1 +
√
(37)
µ 1−η
, where
√ 2)(2 + η) t 2Hγ + 1 ≥ c1 , 2κ− Hγ − %
(2% + 1)Hγ C2 = √ . t(Hγ − %)
(38) (39)
4.5. Proof of Corollary 1 The optimality of θˆ implies that for any i = 1, · · · , p and any t ∈ R, 1 ˆ 2 + R(θ) ˆ ≤ 1 ky − X θˆ − txi k2 + R(θˆ + tei ), ky − X θk 2 2 2n 2n where ei is the i-th column of identity matrix. By triangle inequality, 1 T ˆ ≤ |t|(ξ |t| + r(|t|) ). tx (y − X θ) n i 2 |t| It readily implies that ∗ ˆ kX T (y − X θ)/nk ∞ ≤λ .
(40)
Hence, the proof of Theorem 3 can also apply to prove this corollary by replacing Assumption B of Theorem 3 with Eqn. (40) and the conclusion of Lemma 2. 4.6. Proof of Lemma 1 Since the elements of X are drawn i.i.d. from N (0, 1), kXθk22 ∼ χ2 (n) for any fixed θ. For any δ > 0, any integer s > 0 and any θ ∈ Rs with kθk2 = 1, PX (kXT θk22 /n ≥ δ, ∀T ⊂ {1, · · · , p} with |T | = s) R p nδ 1 ( x )n/2−1 exp(− x )d x ≥1− 2 2 0 Γ(n/2) 2 s R 1 s nδ ≥ 1 − ( pe ( x )n/2−1 d x2 s ) 0 Γ(n/2) 2 =1−
1 N pe s N ! (N δ) ( s )
= 1 − exp(−N D), 21
where N = n/2 and D =
log(N !) N
− log(N δ) −
s N
log
pe s .
For D, we have D
√ log[ 2πN ( Ne )N ] − log(N δ) − Ns log √ 1 = N1 (log 2πN + N log eδ − s log pe s ) ≥
1 N
=
1 n (log(πn)
+ n log
1 eδ
− 2s log
pe s
(41)
pe s )
Given ε ∈ (0, 1), we choose a finite set Q ⊂ Rs such that for all θ ∈ Rs with kθk2 = 1, we can find some qθ ∈ Q such that kqθ − θk2 ≤ ε/4. From covering number, we can choose such a set Q with |Q| ≤ (12/ε)s . Therefore, kXT qk22 ≥ δ, ∀T ⊂ {1, · · · , p} with |T | = s, ∀q ∈ Q nkqk22 holds with probability at least 1 − (12/ε)s exp(−nD/2). p There exist two universal constants c4 > 0 and c5 > 0 such that κ+ (s) ≤ q p 3(1+ 4s log ) with probability at least 1−c4 exp(−c5 n) [30]. Thus, with probability n at least 1−(12/ε)s exp(−nD/2)−c4 exp(nc5 ), we have for any T ⊂ {1, · · · , p} with |T | = s, it holds that √1 kXT n
θk2
≥ ≥ ≥ ≥
√1 kXT qθ k2 n
√1 kXT (qθ − θ)k2 nq p δkqθ k2 − 3(1 + 4s log )kqθ − θk2 n q √ p δ(1 − 4ε ) − 3ε (1 + 4s log ) 4q n √ 4s log p 3 )) 4 ( δ − ε(1 + n
−
√
√ In order to guarantee kXT θk2 / n > 0 with probability at least 1−exp(− n2 log 3e )− c4 exp(−nc5 ), we require that D − 2s log 12 ≥ log 3 , n εq e √δ > ε(1 + 4s log p ) n
This can be strengthened to log(πn) − 2s log 12 + n log q ε √δ > ε(1 + 4s log p )
1 3δ
− 2s log
pe s
n
Let ε = 4δ. Then the first inequality of Eqn. (42) becomes 2s/(n−2s) (πn)1/(n−2s) s δ≤ . 3ep 3n/(n−2s) 22
≥ 0,
(42)
Let n ≥ 4s. δ satisfies the above inequality when δ =
s 27ep .
With this δ, the second
inequality of Eqn. (42) becomes r 4
s (1 + 27ep
r
4s log p ) n ≥ 4s, we have 4 27ep n q √ s log p 27ep ) < 2/ 3e < 1. The second inequality of Eqn. (42) is satisfied. q p √ √ p 4s s and ε = 27ep , we have κ− (s) ≥ 43 δ(1 − 4 δ(1 + 4s log )) With δ = 27ep n holds with probability at least 1−c4 exp(−nc5 )−exp(−0.0493n) ≥ 1−C3 exp(−nC4 ), where C3 = 2 max{c4 , 1} and C4 = min{c5 , 0.0493}. Furthermore, we obtain that s p r 27e(1 + 4s log p/n) κ+ (s) p p p √ ≤ κ− (s) s 3 3e/4 − s/p(1 + 4s log p/n) holds with probability at least 1 − C3 exp(−nC4 ). 4.7. The upper bound of λ∗ Define U > 0 such that λ∗ ≤ λ( Note that
2 ξ2 γ 2
=
U2 log(1+U )
=
2 ξ2 γ 2 .
By letting u = U λγ, we have
p ξ 2 γU log(1 + U ) + ) = λξ 2 log(1 + U ) 2 γU
U2 log(1+U )
≥ U . Hence, ∗
λ ≤λ
r 2 log(1 +
2 ). ξ2γ2
4.8. Shrinkage Function In OMP-PCD, the following non-convex one-dimensional problem needs to be exactly solved. Th (w) = arg min fw (u), where fw (u) = u
1 |u| (u − w)2 + hλ2 log(1 + ). 2 λγ
(43)
It should be noted that Th (w) belongs to [min{0, w}, max{0, w}] and T (0) = 0. √ hλ2 For the case of w > 0, fw0 (u) = u − w + u+λγ . When w < λ(2 h − γ), √ fw (u) is increasing for u ≥ 0, implying Th (w) = 0. When w ≥ λ(2 h − γ), 23
q w w fw (u) has two stationary points us1 = 12 λγ( λγ − 1 − ( λγ + 1)2 − 4h γ 2 ), us2 = q w 1 w 0 ( λγ + 1)2 − 4h 2 λγ( λγ − 1 + γ 2 ). Note that us1 ≤ us2 ≤ w and fw (u) ≤ 0 for u ∈ [us1 , us2 ]. The minimum candidates are u = 0 and u = us2 . Since fw (0) = w2 and p w ¯+ 1 fw (us2 ) = λ2 γ 2 (w ¯− w ¯ 2 − 4h/γ 2 )2 + hλ2 log 4 where w ¯=
p
w ¯ 2 − 4h/γ 2 2
w λγ
+ 1. Hence, for w > 0 √ 0, w < λ(2 h − γ) or w2 ≤ fw (us2 ) Th (w) = . u , w2 > f (u ) s2
w
s2
The case of w < 0 is similar to the case of w > 0. Hence we obtain that for any w ∈ R, 0, |w| < λ(2√h − γ) or w2 ≤ f (g(|w|)) |w| Th (w) = , sgn(w) (|w| − λγ + p(|w| + λγ)2 − 4hλ2 ), w2 > f (g(|w|)) |w| 2 where g(w) = 12 λγ
w λγ
−1+
q w ( λγ + 1)2 −
4h γ2
.
References [1] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constr. Approx., 28(3):253– 263, 2008. [2] D.P. Bertsekas. Nonlinear programming. Athena Scientific Belmont, MA, 1999. [3] P.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009. [4] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009. [5] T.T. Cai, Guangwu Xu, and Jun Zhang. On recovery of sparse signals via `1 minimization. Information Theory, IEEE Transactions on, 55(7):3388–3397, 2009. 24
[6] T.T. Cai, L. Wang, and G. Xu. New bounds for restricted isometry constants. IEEE Transactions on Information Theory, 56(9):4388–4394, 2010. [7] E. Cand`es and T. Tao. The dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6):2313–2351, 2007. [8] E.J. Candes and Y. Plan. A probabilistic and ripless theory of compressed sensing. Information Theory, IEEE Transactions on, 57(11):7235 –7254, nov. 2011. [9] E.J. Cand`es and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory, 52 (12):5406–5425, 2006. [10] E.J. Cand`es, J.K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, 2006. [11] E.J. Cand`es, M.B. Wakin, and S.P. Boyd. Enhancing sparsity by reweighted `1 minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008. [12] R. Chartrand. Fast algorithms for nonconvex compressive sensing: Mri reconstruction from very few data. In Biomedical Imaging: From Nano to Macro, 2009. ISBI’09. IEEE International Symposium on, pages 262–265. IEEE, 2009. [13] Xiaojun Chen, Dongdong Ge, Zizhuo Wang, and Yinyu Ye. Complexity of unconstrained l 2-l p minimization. Mathematical Programming, pages 1–13, 2011. [14] M.E. Davies and R. Gribonval. Restricted isometry constants where `p sparse recovery can fail for 0 < p ≤ 1. Information Theory, IEEE Transactions on, 55 (5):2203 –2214, may 2009. [15] A.H. Delaney and Y. Bresler. Globally convergent edge-preserving regularized reconstruction: an application to limited-angle tomography. Image Processing, IEEE Transactions on, 7(2):204–221, 1998.
25
[16] David L Donoho. For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Communications on pure and applied mathematics, 59(6):797–829, 2006. [17] D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006. [18] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360, 2001. [19] S. Foucart and M.J. Lai. Sparsest solutions of underdetermined linear systems via lq-minimization for 0 < q ≤ 1. Applied and Computational Harmonic Analysis, 26(3):395–407, 2009. [20] R. Garg and R. Khandekar. Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 337– 344. ACM, 2009. [21] Pinghua Gong, Changshui Zhang, Zhaosong Lu, Jianhua Huang, and Jieping Ye. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 37–45, 2013. [22] Po-Ling Loh and Martin J Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. arXiv preprint arXiv:1305.2436, 2013. [23] Qun Mo and Song Li. New bounds on the restricted isometry constant. Applied and Computational Harmonic Analysis, 31(3):460 – 468, 2011. [24] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26 (3):301–321, 2009.
26
[25] Deanna Needell and Roman Vershynin. Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit. Foundations of computational mathematics, 9(3):317–334, 2009. [26] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012. [27] Zheng Pan and Changshui Zhang.
High-dimensional inference via lipschitz
sparsity-yielding regularizers. AISTATS 2013, 2013. [28] JC Ramirez-Giraldo, J. Trzasko, S. Leng, L. Yu, A. Manduca, and CH McCollough. Nonconvex prior image constrained compressed sensing (ncpiccs): Theory and simulations on perfusion ct. Medical Physics, 38(4):2157, 2011. [29] G. Raskutti, M.J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11: 2241–2259, 2010. [30] G. Raskutti, M.J. Wainwright, and B. Yu. Minimax rates of estimation for highdimensional linear regression over `q -balls. Information Theory, IEEE Transactions on, 57(10):6976–6994, 2011. [31] Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20(6):2807–2832, 2010. [32] Xiaotong Shen, Wei Pan, and Yunzhang Zhu. Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107 (497):223–232, 2012. [33] Xiaotong Shen, Wei Pan, Yunzhang Zhu, and Hui Zhou. On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, pages 1–26, 2013.
27
[34] E.Y. Sidky, R. Chartrand, and X. Pan. Image reconstruction from few views by non-convex optimization. In Nuclear Science Symposium Conference Record, 2007. NSS’07. IEEE, volume 5, pages 3526–3530. IEEE, 2007. [35] E.Y. Sidky, I. Reiser, R.M. Nishikawa, X. Pan, R. Chartrand, D.B. Kopans, and R.H. Moore. Practical iterative image reconstruction in digital breast tomosynthesis by non-convex tpv optimization. In Proc. SPIE Med. Imag, page 691328, 2008. [36] E.Y. Sidky, X. Pan, I.S. Reiser, R.M. Nishikawa, R.H. Moore, and D.B. Kopans. Enhanced imaging of microcalcifications in digital breast tomosynthesis through improved image-reconstruction algorithms. Medical physics, 36(11):4920, 2009. [37] J.A. Tropp and A.C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12): 4655–4666, 2007. [38] J. Trzasko, A. Manduca, and E. Borisch. Sparse mri reconstruction via multiscale l0-continuation. In Statistical Signal Processing, 2007. SSP’07. IEEE/SP 14th Workshop on, pages 176–180. IEEE, 2007. [39] J. Trzasko, A. Manduca, and E. Borisch. Highly undersampled magnetic resonance image reconstruction via homotopic l0-minimization. IEEE Transactions on Medical Imaging, 28(1):106–121, 2009. [40] J.D. Trzasko and A. Manduca.
A fixed point method for homotopic `0 -
minimization with application to mr image recovery. In Medical Imaging, pages 69130F–69130F. International Society for Optics and Photonics, 2008. [41] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using `1 -constrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183–2202, 2009. [42] C.H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010.
28
[43] C.H. Zhang and T. Zhang. A general theory of concave regularization for high dimensional sparse estimation problems. Statistical Science, 2012. [44] C.H. Zhang and T. Zhang. A general framework of dual certificate analysis for structured sparse recovery problems. Arxiv preprint arXiv:1201.3302, 2012. [45] Tong Zhang. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research, 11:1081–1107, 2010. [46] Tong Zhang. Sparse recovery with orthogonal matching pursuit under rip. Information Theory, IEEE Transactions on, 57(9):6215 –6221, sept. 2011.
29