A General Theory of Pathwise Coordinate Optimization for Nonconvex Sparse Learning∗ Tuo Zhao†
Han Liu‡
Tong Zhang§
Abstract The pathwise coordinate optimization is one of the most important computational frameworks for solving high dimensional convex and nonconvex sparse learning problems. It differs from the classical coordinate optimization algorithms in three salient features: warm start initialization, active set updating, and strong rule for coordinate preselection. These three features grant superior empirical performance, but also pose significant challenge to theoretical analysis. To tackle this long lasting problem, we develop a new theory showing that these three features play pivotal roles in guaranteeing the outstanding statistical and computational performance of the pathwise coordinate optimization framework. In particular, we analyze the existing methods for pathwise coordinate optimization and provide new theoretical insights into them. The obtained theory motivates the development of several modifications to improve the pathwise coordinate optimization framework, which guarantees linear convergence to a unique sparse local optimum with optimal statistical properties (e.g. minimax optimality and oracle properties). This is the first result establishing the computational and statistical guarantees of the pathwise coordinate optimization framework in high dimensions. Thorough numerical experiments are provided to back up our theory.
1
Introduction
Modern data acquisition routinely produces massive amount of high dimensional data sets, where the number of variables d greatly exceeds the sample size n. Such data include but not limited to chip data from high throughput genomic experiments (Neale et al., 2012), image data from functional Magnetic Resonance Imaging (fMRI, Eloyan et al. (2012)), proteomic data from tandem mass spectrometry analysis (Adams et al., 2007), and climate data from geographically distributed data centers (Hair et al., 2006). To handle the curse of dimensionality, we often assume only a small subset of variables are relevant in modeling (Guyon and Elisseeff, 2003). Such a sparsity assumption motivates various sparse learning approaches. For example, in sparse linear regression, we consider ∗
The R package picasso implementing the proposed method is available on the Comprehensive R Archive Network http://cran.r-project.org/web/packages/picasso/. † Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA; e-mail:
[email protected]. ‡ Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544 USA; e-mail:
[email protected]. § Department of Statistics, Rutgers University, Piscataway, NJ, 08854 USA; e-mail:
[email protected].
1
a linear model y = Xθ ∗ + , where y ∈ Rn is the response vector, X ∈ Rn×d is the design matrix, θ ∗ = (θ1 , ..., θd )> ∈ Rd is the unknown sparse regression coefficient vector, and ∼ N (0, σ 2 I) is the noise vector. Let k · k2 denote the `2 norm, and Rλ (θ) denote a sparsity-inducing regularization function (a.k.a regularizer) with a tuning parameter λ > 0. We estimate θ ∗ by solving the regularized least square problem 1 min Fλ (θ), where Fλ (θ) = ky − Xθk22 + Rλ (θ). (1.1) 2n θ∈Rd P Popular choices of Rλ (θ) are usually coordinate decomposable (i.e., Rλ (θ) = dj=1 rλ (θj )), including the `1 regularizer (also known as Lasso, Tibshirani (1996), Chen et al. (1998)), SCAD regularizer (Fan and Li, 2001), and MCP regularizer (Zhang, 2010a). Lasso is computationally tractable due to its convexity. However, it incurs large estimation bias to parameter estimation, and requires a very restrictive irrepresentable condition to attain variable selection consistency (Zhao and Yu, 2006; Wainwright, 2009; Zou, 2006; Meinshausen and B¨ uhlmann, 2006). To handle this challenge, statisticians have resorted to sparse learning with nonconvex regularization. Fan and Li (2001); Kim et al. (2008); Zhang (2010a); Kim and Kwon (2012) investigate nonconvex regularized least square estimation using SCAD and MCP regularizers. Particularly, they show that there exits a local optimum that attains the oracle properties under much weaker conditions. However, they did not provide specific algorithms that find such a local optimum.
1.1
Pathwise Coordinate Optimization
Typical algorithmic solutions to (1.1) developed in the past decade include the homotopy strategy (Efron et al., 2004), proximal gradient (or majorization-minimization) strategy (Nesterov, 2013), and coordinate optimization strategy (Fu, 1998; Friedman et al., 2007; Mazumder et al., 2011; Breheny and Huang, 2011; Tibshirani et al., 2012). Among these methods, the pathwise coordinate optimization framework proposed by Friedman et al. (2007); Mazumder et al. (2011); Tibshirani et al. (2012) gains significant empirical success. Scalable software packages such as GLMNET, SPARSENET, NCVREG, and HUGE have been developed (Friedman et al., 2010; Mazumder et al., 2011; Breheny and Huang, 2011; Zhao et al., 2012). When the coordinate optimization strategy was first introduced to solve (1.1), however, its superiority was not widely recognized. The earliest example dates back to Fu (1998), which adopted the classical coordinate minimization algorithm. Similar algorithms were also considered by Shevade and Keerthi (2003), Krishnapuram et al. (2005), Kooij et al. (2007), and Genkin et al. (2007). These algorithms are simple and straightforward: Given a solution θ (t) at the t-th iteration, we randomly select a coordinate j and take an exact coordinate minimization step (t+1)
θj
(t)
= argmin Fλ (θj , θ\j ),
(1.2)
θj
where θ\j is a subvector of θ with the j-th entry removed. As will be shown in §2, (1.2) admits a closed form solution for the `1 , SCAD, and MCP regularizers. For notational simplicity, we write (t+1) (t+1) (t+1) (t) θj = Tλ,j (θ (t) ). Once θj is obtained, we take θ\j = θ\j . Such coordinate minimization algorithms, though simple, may not be efficient in theory and practice. Existing computational 2
theory shows that these algorithms only attain sublinear rates of convergence to local optima in high dimensions (Shalev-Shwartz and Tewari, 2011; Richt´arik and Tak´ aˇc, 2012; Lu and Xiao, 2014). Moreover, no theoretical guarantee has been established on statistical properties of the obtained estimators for nonconvex regularization. Therefore the coordinate optimization was almost neglected until a recent rediscovery by Friedman et al. (2007), Mazumder et al. (2011), and Tibshirani et al. (2012).1 As illustrated in Figure 1 and Algorithm 1, Friedman et al. (2007), Mazumder et al. (2011), and Tibshirani et al. (2012) integrate the warm start initialization, active set updating strategy, and strong rule for coordinate preselection into the classical coordinate optimization. The warm start initialization, also referred as the pathwise optimization scheme, optimizes the objective function in a multistage way with a sequence of decreasing regularization parameters corresponding to solutions from sparse to dense. For each stage, the algorithm initializes using the solution from the previous stage; The active set exploits the solution sparsity to speed up computation (Boyd and Vandenberghe, 2009). Within each pathwise optimization stage, the active set updating contains two nested loops: In the outer loop, the algorithm first divides all coordinates into active ones (active set) and inactive ones (inactive set) based on some heuristic coordinate gradient thresholding rule, also known as the strong rule proposed in Tibshirani et al. (2012). Within each outer loop, an inner loop is called to conduct coordinate optimization. In general, the algorithm runs an inner loop on the current active coordinates until convergence, with all inactive coordinates remain zero. The algorithm then exploits some heuristic rule to identify a new active set, which further decreases the objective value and repeats the inner loops. The optimization within each stage terminates when the active set in the outer loop no longer changes. In practice, the warm start initialization, active set updating strategies, and strong rule for coordinate preselection significantly boost the computational performance, making pathwise coordinate optimization one of the most important computational frameworks for solving (1.1). Despite of the popularity of the pathwise coordinate optimization framework, we are still in lack of adequate theory to justify its superior computational performance due to its complex algorithmic structure. Therefore the warm start initialization, active set updating strategy, and strong rule for coordinate preselection are only considered as engineering heuristics in existing literature (Friedman et al., 2007; Mazumder et al., 2011; Tibshirani et al., 2012). On the other hand, many experimental results have shown that the pathwise coordinate optimization framework is effective at finding local optima with good statistical properties in practice, yet no theoretical guarantee has been established. Thus a gap exists between theory and practice.
1.2
Our Contribution
To bridge this gap, we develop new theory and new algorithms to demonstrate the superiority of the pathwise coordinate optimization framework. Specifically, the contribution of this paper can be summarized as follows: (I) We develop a new theory for analyzing the computational and statistical properties of the 1
A brief history on applying coordinate optimization to sparse learning problems is presented in Hastie (2009).
3
Regularization parameter initialization
New re gulariza paramet tion er
n solutio Initial
Active coordinate minimization
Initial set active
Coordinate updating
Strong rule for coordinate preselection
Convergence
Warm start initialization ce
en verg Con
New active set
Inner loop
Active set updating
Middle loop Output solution
Outer loop Figure 1: The pathwise coordinate optimization framework contains 3 nested loops : (I) Warm start initialization; (II) Active set updating and strong rule for coordinate preselection; (III) Active coordinate minimization. Many empirical results have corroborated its outstanding practical performance. Algorithm 1: The pathwise coordinate optimization framework contains 3 nested loops : (I) Warm start initialization; (II) Active set updating and strong rule for coordinate preselection; (III) Active coordinate minimization. Many empirical results have corroborated its outstanding practical performance. Initialize: λ0 , θb{0} , η ∈ (0, 1) For K ← 1, 2, ..., N λK = ηλK−1 , θ [0] ← θ {K−1} , m ← 0 Initialize an active set A ⊆ {1, ..., d} Repeat θ + ← θ [m] Repeat θj+ ← TλK ,j (θ + ) for all j ∈ A Until convergence Update the active set A ⊆ {1, ..., d} θ [m+1] ← θ + , m ← m + 1 Until the active set no longer changes θb{K} ← θ [m] Output: {θb{K} }N K=0
(Warm Start Initialization) (Strong Rule for Coordinate Preselection)
(Active Coordinate Minimization) (Active Set Updating)
pathwise coordinate optimization framework. Particularly, our theory thoroughly analyzes the warm start initialization, active set updating strategy, and strong rule for coordinate preselection. It shows that these strategies play pivotal roles in guaranteeing the solution sparsity and restricted strong convexity, which are the two most important conditions for statistical and computational performance in high dimensions (See more details in §4, Candes and Tao (2005); Zhang and Huang (2008); Bickel et al. (2009); Zhang (2009)). In addition,
4
our theory provides new insights, and indicates two possible drawbacks of existing pathwise coordinate optimization algorithms: (1) The commonly used heuristic cyclic selection rule for active set updating does not necessarily guarantee the solution sparsity; (2) Using an all zero solution as the initialization may not guarantee the restricted strong convexity, when we extend the algorithms to solving other regularized M-estimation problems with more general loss functions. (II) To guarantee the solution sparsity, we propose a new algorithm, named PICASSO (PathwIse CalibrAted Sparse Shooting algOrithm), which improves the pathwise coordinate optimization framework. Particularly, we propose three new active set selection rules, which guarantee the solution sparsity throughout all iterations. The modification, though simple, has a profound impact: Our theory shows that PICASSO attains global linear convergence to a unique sparse local optimum θ¯ with oracle statistical properties: There exists a universal constant C such that with high probability, we have r s∗ ∗ ¯ = supp(θ ∗ ), kθ¯ − θ k2 ≤ C · σ and supp(θ) n where supp(θ) = {j | θj = 6 0} and s∗ denotes the number of nonzero entries of θ ∗ . To the best of our knowledge, this is the first result establishing the computational and statistical properties of the pathwise coordinate optimization framework in high dimensions. (III) To deal with more general loss functions, we propose a new convex relaxation approach. Particularly, we first approximately solve a convex relaxation of (1.1). We then plug the obtained solution into PICASSO as the initial solution. Theoretically, we show that the proposed convex relaxation approach also guarantees the same computational and statistical performance of PICASSO for solving regularized M-estimation problems with general loss functions. Several proximal gradient algorithms in exiting literature are closely related to PICASSO. By exploiting similar sparsity structures of the optimization problem, Wang et al. (2014); Zhao and Liu (2014); Loh and Wainwright (2015) show that these proximal gradient algorithms also attain linear or approximately linear rates of convergence to local optima with good statistical properties. We will compare these algorithms with PICASSO in §6 and §7. The rest of this paper is organized as follows: In §2, we introduce notations and briefly review some regularizers of our interests; In §3, we present the PICASSO algorithm; In §4 we develop a new theory for analyzing the pathwise coordinate optimization framework, and establish the computational and statistical properties of PICASSO for sparse linear regression; In §5, we extend PICASSO to sparse logistic regression and provide theoretical guarantees; In §6, we present thorough numerical experiments to support our theory; In §7, we discuss related work; In §8, we present the technical proof for all theorems. P Notations: Given a vector v = (v1 , . . . , vd )> ∈ Rd , we define vector norms: kvk1 = j |vj |, P 2 2 kvk2 = j vj , and kvk∞ = maxj |vj |. We denote the number of nonzero entries in v as kvk0 = 5
P
j
1(vj 6= 0). We define the soft-thresholding function and operator as Sλ (vj ) = sign(vj ) · max{|vj | − λ, 0}
and Sλ (v) = Sλ (v1 ), ..., Sλ (vd )
>
.
We denote v\j = (v1 , . . . , vj−1 , vj+1 , . . . , vd )> ∈ Rd−1 as the subvector of v with the j-th entry removed. Let A ⊆ {1, ..., d} be an index set. We use A to denote the complementary set to A, i.e. A = {j | j ∈ {1, ..., d}, j ∈ / A}. We use vA to denote a subvector of v by extracting all entries of v with indices in A. Given a matrix A ∈ Rd×d , we use A∗j = (A1j , ..., Adj )> to denote the j-th column of A, and Ak∗ = (Ak1 , ..., Akd )> to denote the k-th row of A. Let Λmax (A) and Λmin (A) be the largest and smallest eigenvalues of A. Let Φ1 (A), . . . , Φd (A) be the singular values of A, P we define the matrix norms kAk2F = j kA∗j k22 and kAk2 = maxj Φj (A). We denote A\i\j as the submatrix of A with the i-th row and the j-th column removed. We denote Ai\j as the i-th row of A with its j-th entry removed. Let A ⊆ {1, ..., d} be an index set. We use AAA to denote a submatrix of A by extracting all entries of A with both row and column indices in A.
2
Background
We first briefly review some nonconvex regularizers of our interests, and then derive updating formulas for the exact coordinate minimization in (1.2).
2.1
Folded Concave Regularization
For high dimensional problems, we are interested in exploiting sparsity-inducing regularizers to yield b where kθk b 0 n d. These regularizers are usually coordinate decomposable: sparse estimators θ, Pd Rλ (θ) = j=1 rλ (|θj |). For example, the most commonly used `1 regularization can be written as P Rλ (θ) = λkθk1 = λ dj=1 |θj | with rλ (|θj |) = λ|θj | for j = 1, ..., d. The `1 norm is convex and computationally tractable, but often induces large estimation bias (Fan and Li, 2001; Zhang, 2010a). To address this issue, many nonconvex (or folded concave) regularizers have been proposed to obtain nearly unbiased estimators. Two notable examples in existing literature are the SCAD (Smooth Clipped Absolute Deviation, Fan and Li (2001)) regularization with θj2 − 2λγ|θj | + λ2 rλ (|θj |) = λ|θj | · 1(|θj | ≤ λ) − · 1(λ < |θj | ≤ λγ) 2(γ − 1) (γ + 1)λ2 + · 1(|θj | > λγ) for γ > 2, (2.1) 2 and MCP (Minimax Concavity Penalty, Zhang (2010a)) regularization with ! θj2 λ2 γ rλ (|θj |) = λ |θj | − · 1(|θj | < λγ) + · 1(|θj | ≥ λγ) for γ > 1. (2.2) 2λγ 2 By examining (2.1) and (2.2), we observe that both regularizers share a generic form where Hλ (θ) =
Pd
j=1 hλ (|θj |)
Rλ (θ) = λkθk1 + Hλ (θ),
(2.3)
is a smooth concave and coordinate decomposable function. Particu-
6
larly, the SCAD regularization has 2λ|θj | − |θj |2 − λ2 · 1(λ < |θj | ≤ λγ) 2(γ − 1)
hλ (|θj |) = and the MCP regularization has
(2.4)
0.0
3.0
θj2 λ2 γ − 2λ|θj | hλ (|θj |) = − · 1(|θj | < λγ) + · 1(|θj | ≥ λγ). (2.5) 2γ 2 We present two examples of the SCAD and MCP regularizers in Figure 2. Existing literature has `1
-0.5
1.0
-2.0
1.5
-1.5
r (|✓j |)
h (|✓j |)
2.0
-1.0
2.5
SCAD MCP
SCAD MCP
0.0
-3.0
0.5
-2.5
`1
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
|✓j |
|✓j |
Figure 2: Two examples of the SCAD and MCP regularizers with λ = 1 and γ = 2.01. The SCAD and MCP regularizers can effectively reduce the estimation bias and achieve better performance than the `1 regularization in both parameter estimation and support recovery, but impose greater computational challenge. shown that the SCAD and MCP regularizers effectively reduce the estimation bias, and achieve better performance than that of the `1 regularization in both parameter estimation and support recovery. But their nonconvexity imposes great computational challenge. See more details in Fan and Li (2001); Zhang (2010a,b); Zhang and Zhang (2012); Fan et al. (2014); Xue et al. (2012); Wang et al. (2013, 2014).
2.2
Exact Coordinate Minimization
Recall that each exact coordinate minimization iteration solves the following optimization problem, 1 (t+1) (t) (t) θj = Tλ,j (θ (t) ) = argmin Fλ (θj , θ\j ) = argmin ky − X∗\j θ\j − X∗j θj k22 + rλ (θj ). (2.6) 2n θj θj For the `1 , SCAD, and MCP regularizers, (2.6) admits a closed form solution. More specifically, without loss of generality, we assume that the design matrix X satisfies the column normalization √ (t) (t) condition, i.e., kX∗j k2 = n for all j = 1, ..., d. Let θej = n1 X> ∗j (y − X∗\j θ\j ). Then we can verify: (t+1)
• For the `1 regularizer, we have θj
(t) = Sλ (θej );
7
(t+1)
• For the SCAD regularization, we have θj
(t) θej (t) Sγλ/(γ−1) (θej ) = 1 − 1/(γ − 1) S (θe(t) ) λ
(t) if |θej | ≥ γλ,
(t) if |θej | ∈ [2λ, γλ),
if |θej | < 2λ; (t)
j
(t+1)
• For the MCP regularizer, we have θj
3
(t) θej (t) = Sλ (θej ) 1 − 1/γ
if |θej | ≥ γλ, (t)
(t) if |θej | < γλ.
Pathwise Calibrated Sparse Shooting Algorithm
We derive the PICASSO algorithm for sparse linear regression. PICASSO is a pathwise coordinate optimization algorithm and contains three nested loops (as illustrated in Algorithm 1). For simplicity, we first introduce its inner loop, then its middle loop, and at last its outer loop.
3.1
Inner Loop: Iterates over Coordinates within an Active Set
We start with the inner loop of PICASSO, which is the active coordinate minimization (ActCooMin) algorithm. Recall that we are interested in the following nonconvex optimization problem 1 min Fλ (θ), where Fλ (θ) = ky − Xθk22 + Rλ (θ). (3.1) 2n θ∈Rd The ActCooMin algorithm solves (3.1) by iteratively conducting exact coordinate minimization, but it is only allowed to iterate over a subset of all coordinates, which is called “the active set”. Accordingly, the complementary set to the active set is called “the inactive set”, because the values of these coordinates do not change throughout all iterations of the inner loop. Since the active set usually contains only a few coordinates, the active set coordinate minimization algorithm is very scalable and efficient. For notational simplicity, we denote the active and inactive sets by A and A respectively. Here we select A and A based on the sparsity pattern of the initial solution of the inner loop θ (0) , (0)
A = {j | θj
6= 0}
(0)
and A = {j | θj
= 0}.
The ActCooMin algorithm then minimizes (3.1) with all coordinates of A staying at zero values, min Fλ (θ)
θ∈Rd
subject to θA = 0.
(3.2)
Different from the classical randomized coordinate minimization algorithm (introduced in §1), the ActCooMin algorithm iterates over all active coordinates in a cyclic order at each iteration. Without loss of generality, we assume |A| = s
and A = {j1 , ..., js } ⊆ {1, ..., d},
where j1 ≤ j2 ≤ ... ≤ js . Given a solution θ (t) at the t-th iteration, we construct a sequence of auxiliary solutions {w(t+1,k) }sk=0 to obtain the next solution θ (t+1) . Particularly, for k = 0, we take w(t+1,0) = θ (t) ; For k = 1, ..., s, we take (t+1,k)
wj k
(t+1,k)
= Tλ,jk (w(t+1,k−1) )
and w\jk 8
(t+1,k−1)
= w\jk
,
where Tλ,jk (·) is defined in (2.6). Once we obtain w(t+1,s) , we set θ (t+1) = w(t+1,s) for the next iteration. We terminate the ActCooMin algorithm when kθ (t+1) − θ (t) k2 ≤ τ λ,
(3.3)
where τ is a small convergence parameter (e.g. 10−5 ). We then take the output solution as θb = θ (t+1) . The ActCooMin algorithm is summarized in Algorithm 2. The ActCooMin algorithm only converges to a local optimum of (3.2), which is not necessarily a local optimum of (3.1). Thus PICASSO needs to combine this inner loop with some active set updating scheme, which allows the active set to change. This leads to the middle loop of PICASSO. Algorithm 2: The active coordinate minimization algorithm (ActCooMin) cyclically iterates over only a small subset of all coordinates. Therefore its computation is scalable and efficient. Without loss of generality, we assume |A| = s and A = {j1 , ..., js } ⊆ {1, ..., d}, where j1 ≤ j2 ≤ ... ≤ js . Algorithm: θb ← ActCooMin(λ, θ (0) , A, τ ) Initialize: t ← 0 Repeat w(t+1,0) ← θ (t) For k ← 1, ..., s (t+1,k) wj k ← Tλ,jk (w(t+1,k−1) ) (t+1,k)
w\jk
(t+1,k−1)
← w\jk
θ (t+1) ← w(t+1,s) t←t+1 Until kθ (t+1) − θ (t) k2 ≤ τ λ Output: θb ← θ (t) .
3.2
Middle Loop: Iteratively Updates Active Sets
We then introduce the middle loop of PICASSO, which is the iterative active set updating (IteActUpd) algorithm. The IteActUpd algorithm simultaneously decreases the objective value and iteratively changes the active set to ensure convergence to a local optimum to (3.1). For notational simplicity, we denote the least square loss function and its gradient as 1 1 L(θ) = ky − Xθk22 and ∇L(θ) = X> (Xθ − y). 2n n 3.2.1
Active Set Initialization by Strong Rule
We first introduce how PICASSO initializes the active set for each middle loop. Suppose that an initial solution θ [0] is supplied to the middle loop of PICASSO. Friedman et al. (2007) suggest a “simple rule” to initialize the active set based on the sparsity pattern of θ [0] , [0]
A0 = {j | θj 6= 0}
[0]
and A0 = {j | θj = 0}.
9
(3.4)
But Tibshirani et al. (2012) show that (3.4) is sometimes too conservative, and suggest a more aggressive active set initialization procedure using “strong rule”, which often leads to better computational performance in practice. Specifically, given an active set initialization parameter ϕ ∈ (0, 1), the strong rule2 for PICASSO initializes A0 and A0 as [0]
[0]
A0 = {j | θj = 0, |∇j L(θ [0] )| ≥ (1 − ϕ)λ} ∪ {j | θj 6= 0},
(3.5)
A0 = {j | θj = 0, |∇j L(θ [0] )| < (1 − ϕ)λ},
(3.6)
[0]
where ∇j L(θ [0] ) denotes the j-th entry of ∇L(θ [0] ). As can be seen from (3.5), the strong rule yields an active set, which is no smaller than the simple rule. Note that we need the initialization parameter ϕ to be a reasonably small value (e.g. 0.1). Otherwise, the strong rule will select too many active coordinates and compromise the restricted strong convexity. We summarize the active set initialization procedure in Algorithm 3. Algorithm 3: The active set of the middle loop can be initialized by either the simple rule (Friedman et al., 2007) or the strong rule (Tibshirani et al., 2012). The strong rule yields a larger active set, and often leads to better computational performance than the simple rule in practice. Algorithm: (A0 , A0 ) ← ActIntProc(λ, θ [0] , ϕ) [0] [0] Simple Rule: A0 ← {j | θj 6= 0}, A0 = {j | θj = 0} [0]
[0]
Strong Rule: A0 ← {j | θj = 0, |∇j L(θ [0] )| ≥ (1 − ϕ)λ} ∪ {j | θj 6= 0} [0]
A0 ← {j | θj = 0, |∇j L(θ [0] )| < (1 − ϕ)λ} Output: (A0 , A0 )
3.2.2
Active Set Updating Strategy
We then introduce how PICASSO updates the active set at each iteration of the middle loop. Suppose that at the m-th iteration (m ≥ 1), we are supplied with a solution θ [m] with a pair of active and inactive sets defined as [m]
Am = {j | θj
[m]
6= 0}
and Am = {j | θj
6= 0}
and Am+0.5 = {j | θj
= 0}.
Each iteration of the IteActUpd algorithm contains two stages. The first stage conducts the active coordinate minimization algorithm over the active set Am until convergence, and returns a solution θ [m+0.5] . Note that the active coordinate minimization algorithm may yield zero values for some active coordinates. Accordingly, we remove these coordinates from the active set, and obtain a new pair of active and inactive sets as [m+0.5]
Am+0.5 = {j | θj
[m+0.5]
= 0}.
The second stage checks which inactive coordinates of Am+0.5 should be added into the active set. Existing pathwise coordinate optimization algorithms usually add inactive coordinates into the active set based on a cyclic selection rule (Friedman et al., 2007; Mazumder et al., 2011). 2
Our proposed strong rule for PICASSO is sightly different from the sequential strong rule proposed in Tibshirani et al. (2012). See more details in Remark 3.2.
10
Particularly, they conduct exact coordinate minimization over all coordinates of Am+0.5 in a cyclic order. Accordingly, an inactive coordinate is added into the active set if the corresponding exact coordinate minimization yields a nonzero value. Such a cyclic selection rule, however, is heuristic and has no control over the solution sparsity. It may add too many inactive coordinates into the active set, which compromises the restricted strong convexity. To address this issue, we propose the following three new active set selection rules. (I) Greedy Selection: We select a coordinate by km = argmax |∇k L(θ [m+0.5] )|, k∈Am+0.5
L(θ [m+0.5] )
where ∇j algorithm if
denotes the j-th entry of ∇L(θ [m+0.5] ). We terminate the IteActUpd |∇km L(θ [m+0.5] )| ≤ (1 + δ)λ,
(3.7)
where δ is a small convergence parameter (e.g. 10−5 ). Otherwise, we obtain θ [m+1] by [m+1]
θk m
= Tλ,km (θ [m+0.5] )
[m+1]
and θ\km
[m+0.5]
= θ\km
,
and take the new active and inactive sets as
Am+1 = Am+0.5 ∪ {km } and Am+1 = Am+0.5 \ {km }.
We summarize the IteActUpd algorithm using the greedy selection rule in Algorithm 4. Algorithm 4: The iterative active set updating algorithm simultaneously decreases the objective value and iteratively changes the active set. To encourage the sparsity of the active set, the greedy selection rule moves only one inactive coordinate to the active set in each iteration. Algorithm: θb ← IteActUpd(λ, θ [0] , δ, τ, ϕ) Initialize: m ← 0, (A0 , A0 ) ← ActIntProc(λ, θ [0] , ϕ) Repeat θ [m+0.5] ← ActCooMin(λ, θ [m] , Am , τ ) [m+0.5] Am+0.5 ← {j | θj 6= 0} [m+0.5]
Am+0.5 ← {j | θj = 0} [m+0.5] km ← argmax |∇k L(θ )| k∈Am+0.5
[m+1] θkm [m+1] θ\km
← Tλ,km (θ [m+0.5] ) [m+0.5]
← θ\km Am+1 ← Am+0.5 ∪ {km } Am+1 ← Am+0.5 \ {km } m←m+1 Until |∇km L(θ [m+0.5] )| ≤ (1 + δ)λ Output: θb ← θ [m]
11
(II) Randomized Selection: We randomly select a coordinate km from Mm+0.5 = {k | k ∈ Am+0.5 , |∇k L(θ [m+0.5] )| ≥ (1 + δ)λ}
(3.8)
with equal probability, where δ is defined in (3.7). We terminate the IteActUpd algorithm if Mm+0.5 is an empty set, i.e., Mm+0.5 = ∅. Otherwise, we obtain θ [m+1] by [m+1]
θkm
= Tλ,km (θ [m+0.5] )
[m+1]
and θ\km
[m+0.5]
= θ\km
,
and take the new active and inactive sets as Am+1 = Am+0.5 ∪ {km }
and Am+1 = Am+0.5 \ {km }.
We summarize the IteActUpd algorithm using the randomized selection rule in Algorithm 5. Remark 3.1. The randomized selection procedure has an equivalent and efficient implementation as follows: (i) We generate a randomly shuffled order of all inactive coordinates of Am+0.5 ; (ii) We then cyclically check all inactive coordinates according to the order generated in (i), and select the first inactive coordinate k satisfying |∇k L(θ [m+0.5] )| ≥ (1 + δ)λ. If no inactive coordinate satisfies the requirement, i.e., max k∈Am+0.5
|∇k L(θ)[m+0.5] | ≤ (1 + δ)λ,
then we have Mm+0.5 = ∅ and terminate the middle loop. Algorithm 5: Similar to the greedy selection rule, the randomized selection rule also moves only one inactive coordinate to the active set in each iteration to encourage the sparsity of the active set. Algorithm: θb ← IteActUpd(λ, θ [0] , δ, τ, ϕ) Initialize: m ← 0, (A0 , A0 ) ← ActIntProc(λ, θ [0] , ϕ) Repeat θ [m+0.5] ← ActCooMin(λ, θ [m] , Am , τ ) [m+0.5] Am+0.5 ← {j | θj 6= 0} [m+0.5]
Am+0.5 ← {j | θj = 0} Randomly sample a coordinate km from
Mm+0.5 = {k | k ∈ Am+0.5 , |∇k L(θ [m+0.5] )| ≥ (1 + δ)λ} with equal probability [m+1] θkm ← Tλ,km (θ [m+0.5] ) [m+1]
[m+0.5]
θ\km ← θ\km Am+1 ← Am+0.5 ∪ {km } Am+1 ← Am+0.5 \ {km } m←m+1 Until max |∇k L(θ [m+0.5] )| ≤ (1 + δ)λ k∈Am+0.5
Output: θb ← θ [m]
(III) Truncated Cyclic Selection: We choose to iterate over all coordinates of Am+0.5 in a cyclic 12
order. But different from the cyclic selection rule in Friedman et al. (2007); Mazumder et al. (2011), we conduct exact coordinate minimization over an inactive coordinate and add it into the active set only if the corresponding coordinate gradient has a sufficiently large magnitude. Otherwise, we make this coordinate stay in the inactive set. More specifically, without loss of generality, we assume |Am+0.5 | = g
and Am+0.5 = {j1 , ..., jg } ⊆ {1, ..., d},
where j1 ≤ j2 ≤ ... ≤ jg . We terminate the IteActUpd algorithm if max
k∈Am+0.5
|∇k L(θ)[m+0.5] | ≤ (1 + δ)λ.
Otherwise, we construct a sequence of auxiliary solutions {w[m+1,k] }gk=0 as follows: For k = 0, [m+1,k] [m+1,k−1] we set w[m+1,0] = θ [m+0.5] ; For k = 1, ..., g, we take w\jk = w\jk and ( Tλ,jk (w[m+1,k−1] ) if |∇jk L(w[m+1,k−1] )| ≥ (1 + δ)λ, [m+1,k] wj k = [m+1,k−1] wj k otherwise, where δ is defined in (3.7). Note that when δ = 0, the truncated cyclic selection rule is reduced to the cyclic selection rule in Friedman et al. (2007); Mazumder et al. (2011). Once we obtain all auxiliary solutions, we set θ [m+1] = w[m+1,g] , and take the new active and inactive sets based on the sparsity pattern of θ [m+1] , i.e., [m+1]
Am+1 = {j | θj
[m+1]
6= 0} and Am+1 = {j | θj
= 0}.
We summarize the iterative active set updating algorithm using the truncated cyclic selection rule in Algorithm 6. The IteActUpd algorithm, though equipped with the three proposed active set selection rules and strong rule for coordinate preselection, can only guarantee the solution sparsity throughout all iterations when specifying an appropriate regularization parameter.3 Otherwise when an insufficiently large regularization parameter is supplied, the IteActUpd algorithm may still overselect active coordinates. To address this issue, we combine the IteActUpd algorithm with a pathwise regularization scheme, which leads to the outer loop of PICASSO.
3.3
Outer Loop: Iterates over Regularization Parameters
The outer loop of PICASSO is the warm start initialization (also known as the pathwise optimization scheme). It solves (1.1) indexed by a geometrically decreasing sequence of regularization parameters {λK = λ0 η K }N K=0 with a common decaying parameter η ∈ (0, 1), and outputs a sequence of N + 1 solutions {θb{K} }N K=0 (also known as the solution path). For sparse linear regression, PICASSO chooses the leading regularization parameter λ0 as λ0 = k∇L(0)k∞ = k n1 X> yk∞ . Recall that Hλ (θ) is defined in (2.3). By verifying the KKT condition, we have min k∇L(0) + ∇Hλ0 (0) + λ0 ξk∞ = min k∇L(0) + λ0 ξk∞ = 0,
ξ∈∂k0k1
ξ∈∂k0k1
3
As will be shown in §4, the choice of the regularization parameter is determined by the initial solution of the middle loop.
13
Algorithm 6: To encourage the sparsity of the active set, the truncated cyclic selection rule only selects coordinates only when their corresponding coordinate gradients are sufficiently large in magnitude. Without loss of generality, we assume |Am | = g and Am = {j1 , ..., jg }, where j1 ≤ j2 ≤ ... ≤ jg . Algorithm: θb ← IteActUpd(λ, θ [0] , δ, τ, ϕ) Initialize: m ← 0, (A0 , A0 ) ← ActIntProc(λ, θ [0] , ϕ) Repeat θ [m+0.5] ← ActCooMin(λ, θ [m] , Am , τ ) [m+0.5] Am+0.5 ← {j | θj 6= 0} [m+0.5]
Am+0.5 ← {j | θj = 0} [m+1,1] [m+0.5] w ←θ For k ← 1, ..., g If |∇jk L(w[m+1,k−1] )| ≥ (1 + δ)λ [m+1,k] wj k ← Tλ,jk (w[m+1,k−1] ) Else [m+1,k] [m+1,k−1] wj k ← wj k [m+1,k]
w\jk
[m+1,k−1]
← w\jk
θ [m+1] ← w[m+1,g] [m+1] Am+1 ← {j | θj 6= 0} [m+1]
Am+1 ← {j | θj = 0} m←m+1 Until max |∇k L(θ [m+0.5] )| ≤ (1 + δ)λ k∈Am+0.5
Output: θb ← θ [m]
where the first equality comes the fact ∇Hλ0 (0) = 0 for the MCP and SCAD regularizers (We will elaborate more about this in §4). This indicates that 0 is a local optimum of (1.1). Accordingly, we set θb{0} = 0. Then for K = 1, 2, ..., N , we solve (1.1) for λK using θb{K−1} as initialization. We summarize the warm start initialization in Algorithm 7. The warm start initialization starts with large regularization parameters to suppress the overselection of irrelevant coordinates {j | θj∗ = 0} (in conjunction with the IteActUpd algorithm). Therefore the solution sparsity ensures the restricted strong convexity throughout all iterations, making the algorithm behaves as if minimizing a strongly convex function. Though large regularization parameters may also yield zero values for many relevant coordinates {j | θj∗ 6= 0} and result in larger estimation errors, this can be compensated by the decreasing regularization sequence. Eventually, PICASSO will gradually recover the relevant coordinates, reduce the estimation error of each output solution, and attain a sparse output solution with good statistical properties. Remark 3.2. Tibshirani et al. (2012) propose a sequential strong rule for coordinate preselection,
14
Algorithm 7: The warm start initialization solves (1.1) with respect to a decreasing sequence of regularization parameters {λK }N K=0 . The leading regularization parameter λ0 is chosen as λ0 = k∇L(0)k∞ , which yields an all zero output solution θb{0} = 0. For K = 1, ..., N , we solve N (1.1) for λK using θb{K−1} as an initial solution. {τK }N K=1 and {δK }K=1 are two sequence of convergence parameters, where τK and δK correspond to the K-th outer loop iteration with the regularization parameter λK . Algorithm: {θb{K} }N ← PICASSO({λK }N ) K=0
K=0
N Parameter: η, ϕ, {τK }N K=1 , {δK }K=1 {0} Initialize: λ0 ← k∇L(0)k∞ , θb ← 0 For K ← 1, 2, ..., N λK ← ηλK−1 θb{K} ← IteActUpd(λK , θb{K−1} , δK , τK , ϕ) Output: {θb{K} }N K=0
which initializes the active set for the K-th middle loop as [0]
[0]
A0 = {j | θj = 0, |∇j L(θ [0] )| ≥ 2λK − λK−1 } ∪ {j | θj 6= 0}, A0 = {j |
[0] θj
= 0, |∇j L(θ [0] )| < 2λK − λK−1 }.
Recall that λK = ηλK−1 . Then we have
2λK − λK−1
1−η = 1− λK . η
Thus (3.9) and (3.10) are essentially our proposed strong rule for PICASSO with ϕ =
4
(3.9) (3.10)
1−η η .
Computational and Statistical Theory
We develop a new theory to analyze the pathwise coordinate optimization framework, and establish the computational and statistical properties of PICASSO for sparse linear regression. Recall that in P (2.3), we rewrite the nonconvex regularizer as Rλ (θ) = λkθk1 + Hλ (θ), where Hλ (θ) = dj=1 hλ (|θj |) is a smooth, concave, and coordinate decomposable function. For notational simplicity, we define Leλ (θ) = L(θ) + Hλ (θ). Accordingly we can rewrite Fλ (θ) as Fλ (θ) = L(θ) + Rλ (θ) = Leλ (θ) + λkθk1 . Before we present the main results, we first introduce four assumptions.
4.1
Preliminaries
The first assumption requires λN to be sufficiently large. Assumption 4.1. We require that the geometrically decreasing regularization sequence satisfies 4 λN ≥ 4k∇L(θ ∗ )k∞ = kX> k∞ . n Moreover, we require η ∈ [0.96, 1).
15
Assumption 4.1 has been extensively studied in existing literature on high dimensional statistical theories for sparse linear regression (Zhang and Huang, 2008; Bickel et al., 2009; Zhang, 2010a; Negahban et al., 2012). It guarantees that all regularization parameters are sufficiently large for PICASSO to eliminate irrelevant coordinates along the solution path. Remark 4.2. Note that Assumption 4.1 is deterministic for any given λN . Since kX> k∞ is a random quantity, we need to verify that Assumption 4.1 holds with high probability when applying PICASSO to sparse linear regression. The second assumption imposes several regularity conditions on the nonconvex regularizer. Assumption 4.3. We require h(·) and h0 (·) of the nonconvex regularizer to satisfy: (R.1) For any a > b ≥ 0, there exists a constant α ≥ 0 such that −α(a − b) ≤ h0λ (a) − h0λ (b) ≤ 0.
(R.2) For some γ > 0 and ∀ a ≥ 0, we have h0λ (a) ∈ [−λ, 0] if a ≤ λγ, and h0λ (a) = −λ otherwise. (R.3) hλ (·) and h0λ (·) pass through the origin, i.e., hλ (0) = 0 and h0λ (0) = 0. (R.4) For ∀ a ≥ 0, we have |h0λ1 (a) − h0λ2 (a)| ≤ |λ1 − λ2 |.
For simplicity, we call α in (R.1) the concavity parameter of hλ (·), since α characterizes how concave it is. By examining hλ (·) defined in (2.4) and (2.5), we can verify that both SCAD and MCP regularizers satisfy Assumption 4.3. Particularly, we have α = 1/(γ − 1) for SCAD and α = 1/γ for MCP. Before we present the third assumption, we define the largest and smallest s sparse eigenvalues of the Hessian matrix ∇2 L(θ) = n1 X> X as follows. Definition 4.4. Given an integer s ≥ 1, we define
v> ∇2 L(θ)v , kvk22 kvk0 ≤s
Largest s sparse eigenvalue : ρ+ (s) = sup
v> ∇2 L(θ)v . kvk22 kvk0 ≤s
Smallest s sparse eigenvalue : ρ− (s) = inf
The next lemma connects the largest and smallest s sparse eigenvalues to the restricted strong convexity and smoothness. Lemma 4.5. Suppose that there exists an integer s such that 0 ≤ α < ρ− (s) ≤ ρ+ (s) < ∞, where α is the concavity parameter defined in (R.1) of Assumption 4.3. For any θ, θ 0 ∈ Rd satisfying kθ − θ 0 k0 ≤ s, we say that L(θ) is restricted strongly convex and smooth, i.e., ρ+ (s) 0 ρ− (s) 0 kθ − θk22 ≤ L(θ 0 ) − L(θ) − (θ 0 − θ)> ∇L(θ) ≤ kθ − θk22 . 2 2 For notational simplicity, we define ρe− (s) = ρ− (s) − α. Then we say that Leλ (θ) is also restricted strongly convex and smooth, i.e., ρe− (s) 0 ρ+ (s) 0 kθ − θk22 ≤ Leλ (θ 0 ) − Leλ (θ) − (θ 0 − θ)> ∇Leλ (θ) ≤ kθ − θk22 . 2 2 Moreover, we have Fλ (θ) satisfying the restricted strong convexity, i.e., for any ξ ∈ ∂kθk1 , ρe− (s) 0 kθ − θk22 ≤ Fλ (θ 0 ) − Fλ (θ) − (θ 0 − θ)> (∇Leλ (θ) + λξ). 2 16
The proof of Lemma 4.5 is presented in Appendix A. Lemma 4.5 indicates the importance of the solution sparsity to PICASSO: When θ is sufficiently sparse, the restricted strong convexity of L(θ) can dominate the concavity of Hλ (θ). Thus if an algorithm can guarantee the solution sparsity throughout all iterations, it will behave like minimizing a strongly convex optimization problem. Accordingly, linear convergence can be established. Now we introduce the third assumption on (1.1). Assumption 4.6. Given kθ ∗ k0 ≤ s∗ , there exists an integer se such that se ≥ (484κ2 + 100κ)s∗ ,
ρ+ (s∗ + 2e s + 1) < +∞,
where κ is defined as κ = ρ+ (s∗ + 2e s + 1)/e ρ− (s∗ + 2e s + 1).
and ρe− (s∗ + 2e s + 1) > 0,
Assumption 4.6 guarantees that the optimization problem satisfies the restricted strong convexity as long as the number of active irrelevant coordinates never exceeds se along the solution path. That is closely related to the restricted isometry property (RIP) and restricted eigenvalue (RE) conditions, which have been extensively studied in existing literature. See more details in Candes and Tao (2005); Bickel et al. (2009); Zhang (2010a); Raskutti et al. (2010); Lounici et al. (2011); Negahban et al. (2012). At last, we introduce the assumption on the computational parameters of PICASSO. Assumption 4.7. Recall that the convergence parameters δK ’s and τK ’s are defined in Algorithm 7, and the active set initialization parameter ϕ is defined in (3.5). We require s 1 1 δK ρe− (1) ϕ ≤ , δK ≤ , and τK ≤ for all K = 1, ..., N. ∗ 8 8 ρ+ (s + 2e s) ρ+ (1)(s∗ + 2e s) Assumption 4.7 guarantees that all middle and inner loops of PICASSO attain adequate precision such that their output solutions satisfy the desired computational and statistical properties.
4.2
Active Coordinate Minimization
First, we start with the convergence analysis for the inner loop of PICASSO. The following theorem presents the rate of convergence in terms of the objective value for Algorithm 2. For notational simplicity, we omit the outer loop index K, and denote λK and τK by λ and τ . Theorem 4.8. Suppose that Assumptions 4.3 and 4.6 hold. If the initial solution θ (0) satisfies ¯ For t = 1, 2..., kθ (0) k0 = s ≤ s∗ + 2e s, then (3.2) is strongly convex with a unique global optimum θ. we have t sρ2+ (s) (t) ¯ ¯ Fλ (θ ) − Fλ (θ) ≤ [Fλ (θ (0) ) − Fλ (θ)]. sρ2+ (s) + ρe− (s)e ρ− (1) Moreover, we need at most sρ2+ (s) ρe− (1)τ 2 λ2 −1 log · log ¯ sρ2+ (s) + ρe− (s)e ρ− (1) 2[Fλ (θ (0) ) − Fλ (θ)] iterations such that kθ (t+1) − θ (t) k2 ≤ τ λ, where τ is the convergence parameter defined in (3.3).
The detailed proof of Theorem 4.8 is presented in §8.1. Theorem 4.8 guarantees that given a sparse initial solution, Algorithm 2 essentially minimizes a strongly convex optimization problem, 17
though (1.1) can be globally nonconvex. Therefore it attains linear convergence to a unique global optimum in terms of the objective value. To the best of our knowledge, Theorem 4.8 is the first result establishing linear rates of convergence for cyclic coordinate optimization algorithms for minimizing nonsmooth strongly convex composite functions.
4.3
Active Set Updating
Then, we proceed with the convergence analysis for the middle loop of PICASSO. The following theorem presents the rate of convergence in terms of the objective value.4 For notational simplicity, we omit the outer loop index K, and denote λK and δK by λ and δ. Moreover, we define 4λ2 s∗ 4λ = , S = {j | θj∗ 6= 0}, and S = {j | θj∗ = 0}. (4.1) ρe− (s∗ + se)
Theorem 4.9. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. For any λ ≥ λN , if the initial [0] solution θ [0] in Algorithms 4-6 satisfies kθS k0 ≤ se and Fλ (θ [0] ) ≤ Fλ (θ ∗ ) + 4λ , then regardless the active set initialized by the strong rule or simple rule, we have |A0 ∩ S| ≤ se. Meanwhile, for [m] [m+0.5] m = 0, 1, 2, ..., we also have kθS k0 ≤ 2e s, kθS k0 ≤ se, and m ∗ + 2e ρ e (s s ) − [m] λ Greedy Selection : Fλ (θ ) − Fλ (θ¯ ) ≤ 1 − ∗ [Fλ (θ (0) ) − Fλ (θ¯λ )], (s + 2e s)ρ+ (1) ρe− (s∗ + 2e s) m [m] λ ¯ [Fλ (θ (0) ) − Fλ (θ¯λ )], Randomized Selection : EFλ (θ ) − Fλ (θ ) ≤ 1 − ∗ (s + 2e s)ρ+ (1) where θ¯λ is a unique sparse local optimum of (1.1), and satisfies the KKT condition Kλ (θ¯λ ) =
min
ξ∈∂kθ¯λ k1
k∇Leλ (θ¯λ ) + λξk∞ = 0
Moreover, when we terminate the IteActUpd algorithm with
and kθ¯Sλ k0 ≤ se. max
k∈Am+0.5
(4.2)
|∇k L(θ [m+0.5] )| ≤ (1 + δ)λ,
where δ is the convergence parameter defined in (3.7), the following results hold: (1) The output solution θbλ satisfies the approximate KKT condition Kλ (θbλ ) ≤ δλ;
(2) For the greedy selection, the number of active set updating iterations is at most ρe− (s∗ + 2e s) δλ −1 log 1− ∗ · log ; (s + 2e s)ρ+ (1) 3ρ+ (1)[Fλ (θ [0] ) − Fλ (θ¯λ )]
(3) For the randomized selection, given a constant ϑ ∈ (0, 1), the number of active set updating iterations is at most ! ρe− (s∗ + 2e s) ϑδλ −1 log 1− ∗ · log (s + 2e s)ρ+ (1) 3ρ+ (1) Fλ (θ (0) ) − Fλ (θ¯λ ) with probability at least 1 − ϑ;
4
Since the randomized selection rule randomly selects a coordinate in each active set updating iteration, the objective value of Fλ (θ [m] ) is essentially a random variable. Therefore its rate of convergence is analyzed in terms of the expected objective value.
18
p (4) For the truncated cyclic selection, given δ ≥ 73/(484κ + 100), the number of active set updating iterations is at most 2ρ+ (1)[Fλ (θ [0] ) − Fλ (θ¯λ )] . δ 2 λ2 The proof of Theorem 4.9 is presented in §8.2. Theorem 4.9 guarantees that when supplied a proper initial solution, the middle loop of PICASSO attains linear convergence to a unique sparse local optimum. Moreover, Theorem 4.9 has three important implications:
Added by Greedy Selection
✓ [m+1]
Failure
Success
✓ [m+1]
Added by Randomized Selection
✓ [m+1] Added by Cyclic Selection
Relevant Coordinates
✓ [m+0.5]
Success
Added by Truncated Cyclic Selection
(I) The greedy and randomized selection rules are very conservative and only select one coordinate each time. The truncated cyclic selection rule only selects coordinates whose corresponding coordinate gradients are sufficiently large in magnitude. These mechanisms prevent the overselection of irrelevant coordinates and encourage the solution sparsity. In contrast, the cyclic selection used in Friedman et al. (2007); Mazumder et al. (2011) may overselect irrelevant coordinates and compromise the restricted strong convexity. An illustrative example is provided in Figure 3. ✓ [m+1]
Success
Figure 3: An illustration of the active set selection. The greedy and randomized selection rules only select one coordinate each time, and the truncated cyclic selection rule only select coordinates only when their corresponding coordinate gradients are sufficiently large in magnitude. These mechanisms prevent the overselection of irrelevant coordinates. In contrast, the cyclic selection rule used in Friedman et al. (2007); Mazumder et al. (2011) may overselect irrelevant coordinates and compromise the restricted strong convexity. (II) Besides decreasing the objective value, the active coordinate minimization algorithm can remove some irrelevant coordinates from the active set. Then in conjunction with our proposed active set selection rules, the solution sparsity is guaranteed throughout all iterations. An illustrative example is provided in Figure 4. To the best of our knowledge, such a “forward-backward” phenomenon has not been discovered and rigorously characterized in existing literature. (III) The strong rule for PICASSO moves inactive coordinates to the active set when their 19
0.5]
Add
✓ [m] Remove ✓ [m+0.5]
b ✓
Relevant Coordinates
✓ [m
...
...
✓ [0]
Figure 4: An illustration of the active set updating algorithm. The active set updating iteration first removes some active coordinates from the active set, then add some inactive coordinates into the active set. Therefore the solution sparsity is guaranteed throughout all iterations. To the best of our knowledge, such a “forward-backward” phenomenon has not been discovered and rigorously characterized in existing literature.
corresponding coordinate gradients are sufficiently large in magnitude. Thus it can also prevent the overselection of irrelevant coordinates and guarantee the restricted strong convexity.
4.4
Warm Start Initialization
Next, we proceed with the convergence analysis for the outer loop of PICASSO. As has been shown in Theorem 4.9, each middle loop of PICASSO requires a proper initialization. Since θ ∗ and S are unknown in practice, it is difficult to manually pick such an initial solution. The next theorem shows that the warm start initialization can guide PICASSO to attain such a proper initialization for every middle loop without any prior knowledge on θ ∗ and S. Theorem 4.10. Recall that KλK (θ) is defined in (4.2). Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. If θ satisfies kθS k0 ≤ se and KλK−1 (θ) ≤ δK−1 λK−1 , then we have √ b 1 ≤ 11k∆ b S k1 ≤ 11 s∗ k∆k b 2 , Kλ (θ) ≤ λK /4, and Fλ (θ) ≤ Fλ (θ ∗ ) + 4λ . k∆k K
K
K
K
The proof of Theorem 4.10 is presented in Appendix D.1. The warm start initialization starts with an all zero local optimum and a sufficiently large regularization parameter λ0 , which naturally satisfy all requirements k0S k0 ≤ se and Kλ0 (0) = 0.
Thus θ [0] = 0 is a proper initial solution for λ1 . Then combining Theorem 4.10 with Theorem 4.9, we can show by induction that the output solution of each middle loop is always a proper initial solution for the next middle loop. Remark 4.11. From a geometric perspective, the warm start initialization yields a sequence 20
T of nested sparse basins of attraction Cλ0 ⊇ Cλ1 ⊇ ... ⊇ CλN such that θb{K−1} ∈ CλK−1 CλK . Therefore PICASSO iterates along these basins, and eventually converges to a “good” local optimum in CλN (close to θ ∗ ). An illustration of the warm start initialization is presented in Figure 5.
C
{K 2}
b{1} ✓
C
K 1
Basin of Attraction for
K
Basin of Attraction for
K+1
Basin of Attraction for
N
b{K+1} ✓
K+1
C
K
Basin of Attraction for
N
✓⇤ K
b{N } ✓
1
C
·
···
C
1}
1
··
b ✓
b{K ✓
b{K} ✓
Basin of Attraction for
1
b{0} ✓
Figure 5: An illustration of the warm start initialization (the outer loop). We start with large regularization parameters. This suppresses the overselection of irrelevant coordinates {j | θj∗ = 0} and yields highly sparse solutions. With the decrease of the regularization parameter, the PICASSO algorithm gradually recovers the relevant coordinates, and eventually obtains a sparse output solution with good statistical properties. Combining Theorems 4.8, 4.9, and 4.10, we establish the global convergence in terms of the objective value for PICASSO in the next theorem. Theorem 4.12. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. Recall that α is the concavity parameter defined in (R.1) of Assumption 4.3, δK ’s and τK ’s are the convergence parameters defined in Algorithm 7, and κ and se are defined in Assumption 4.6. For K = 1, · · · , N , we have: (1) At the K-th iteration of the outer loop, the number of exact coordinate minimization iterations within each inner loop is at most 2ρ e− (s∗ + 2e s)e ρ− (1) + (s∗ + 2e s)ρ2+ (s∗ + 2e s) ρe− (1)τK e− (s∗ + se) −1 ρ Tmax (τK ) = log · log ; 50s∗ (s∗ + 2e s)ρ2+ (s∗ + 2e s) (2) At the K-th iteration of the outer loop, the number of active set updating iterations based on the greedy selection rule is at most 2 δ ρe− (s∗ + se) ρe− (s∗ + 2e s) Grd Mmax (δK ) = log−1 1 − ∗ · log K ∗ ; (s + 2e s)ρ+ (1) 75s ρ+ (1)
(3) At the K-th iteration of the outer loop, given a constant ϑ ∈ (0, 1), the number of active set
21
updating iterations based on the randomized selection rule is at most 2 ϑδK ρe− (s∗ + se) ρe− (s∗ + 2e s) Rnd −1 · log 1− ∗ Mmax (δK ) = log (s + 2e s)ρ+ (1) 75s∗ ρ+ (1) with probability at least 1 − ϑ; p (4) At the K-th iteration of the outer loop, given δK ≥ 73/(484κ + 100), the number of active set updating iterations based on the truncated cyclic selection rule is at most 50ρ+ (1)s∗ Cyc Mmax (δK ) = 2 ; ρe− (s∗ + 2e s)δK
(5) To compute the entire solution path, the total number of exact coordinate minimization P Grd iterations in Algorithm 2 is at most N K=1 Tmax (τK ) · Mmax (δK );
(6) To compute the entire solution path using the randomized selection rule, the total number of P Rnd exact coordinate minimization iterations in Algorithm 2 is at most N K=1 Tmax (τK ) · Mmax (δK ) with probability at least 1 − N ϑ; (7) To compute the entire solution path using the truncated cyclic selection rule, the total number of P Cyc exact coordinate minimization iterations in Algorithm 2 is at most N K=1 Tmax (τK )·Mmax (δK ); (8) At the K-th iteration of the outer loop, we have
50λ2K s∗ . FλN (θb{K} ) − FλN (θ¯λN ) ≤ [1(K < N ) + 1(K = N ) · δN ] ρe− (s∗ + se)
The proof of Theorem 4.12 is presented in §8.3. Theorem 4.12 guarantees that PICASSO attains global linear convergence to a unique sparse local optimum, which is a significant improvement over sublinear convergence of the randomized coordinate minimization algorithms established in existing literature. To the best of our knowledge, this is the first result establishing the convergence properties of the pathwise coordinate optimization framework in high dimensions. Remark 4.13. We provide a simple illustration of Theorem 4.12. When we use PICASSO based on the greedy selection rule to obtain {θb{K} }N K=0 , the computational complexity of the strong rule and each active set selection step is O(d), and the computational complexity of calculating each cyclic coordinate minimization iteration is O(s∗ + 2e s). Given se = O(s∗ ), (s∗ )2 log d d, 1/δK ≤ d, and 1/τK ≤ d, the overall computational complexity is O(N d log d). In contrast, when we use the randomized coordinate minimization algorithm to obtain {θb{K} }N K=0 , the computational complexity of calculating each randomized coordinate minimization iteration is O(1), and the iteration complexity is O(d/), where is the gap toward the optimal objective value. Then given 1/ ≤ d, the overall computational complexity is O(N d2 ), which is much worse than that of PICASSO.
4.5
Statistical Theory
Finally, we analyze the statistical properties of the estimator by PICASSO for sparse linear regression. For compactness, we only consider the greedy selection rule and MCP regularizer, but the extensions 22
to the SCAD regularization and other selection rules are straightforward. We assume that kθ ∗ k0 ≤ s∗ ) and the design matrix X satisfies kXvk22 log d kXvk22 log d ≥ ψmin kvk22 − γmin kvk21 and ≤ ψmax kvk22 + γmax kvk21 , (4.3) n n n n where γmin , γmax , ψmin , and ψmax are positive constants, and do not scale with (s∗ , n, d). Existing literature has shown that (4.3) is satisfied by many common examples of sub-Gaussian random design with high probability (Raskutti et al., 2010; Negahban et al., 2012). We then verify Assumptions 4.1 and 4.6 by the following lemma. p Lemma 4.14. Given λN = 8σ log d/n, we have P λN ≥ 4k∇L(θ ∗ )k∞ = 4kX> k∞ ≥ 1 − 2d−2 . Moreover, given α = ψmin /4, there exists a universal positive constant C1 such that for large enough n, we have se = C1 s∗ ≥ [484κ2 + 100κ] · s∗ , ρe− (s∗ + 2e s + 1) ≥ ψmin /2, and ρ+ (s∗ + 2e s + 1) ≤ 5ψmax /4.
The proof of Lemma 4.14 is presented in Appendix E.1. Lemma 4.14 guarantees that the regularization sequence satisfies Assumption 4.1 with high probability, and Assumption 4.6 holds when the design matrix satisfies (4.3). Thus Theorem 4.12 guarantees that PICASSO attains linear convergence to a unique sparse local optimum in terms of the objective value with high probability for sparse linear regression. That enables us to characterize the statistical rate of convergence for the obtained estimator in the following theorem. p Theorem 4.15 (Parameter Estimation). Given α = γ −1 = ψmin /4 and λN = 8σ log d/n, for small enough δN and large enough n such that n ≥ C2 s∗ log d, where C2 is a generic constant, we have ! r r ∗ ∗ log d s s {N } ∗ 1 2 kθb − θ k2 = OP σ +σ , {z n } | {z n} | where
s∗1
= |{j |
|θj∗ |
≥ γλN }| and
s∗2
= |{j | 0
i∗ θ) for i = 1, ..., n. 1 + exp(X> i∗ θ) When θ ∗ is sparse, we consider the minimization problem n i 1 X h > log 1 + exp(X> θ) − y X θ . min L(θ) + Rλ (θ), where L(θ) = i i∗ i∗ n θ∈Rd πi (θ) =
(5.1)
(5.2)
i=1
For notational simplicity, we denote the logistic loss function in (5.2) as L(θ), and define Leλ (θ) = L(θ) + Hλ (θ). Then similar to sparse linear regression, we also write Fλ (θ) as Fλ (θ) = L(θ) + Rλ (θ) = Leλ (θ) + λkθk1 .
The logistic loss function is twice differentiable with n n 1X 1X 1 > ∇L(θ) = [πi (θ) − yi ]Xi∗ and ∇2 L(θ) = [1 − πi (θ)]πi (θ)Xi∗ X> X PX, i∗ = n n n i=1
i=1
where P = diag([1 − π1 (θ)]π1 (θ), ..., [1 − πn (θ)]πn (θ)) ∈ Rn×n . Similar to sparse linear regression, √ we also assume that the design matrix X satisfies the column normalization condition kX∗j k2 = n for all j = 1, ..., d.
5.1
Proximal Coordinate Gradient Descent
For sparse logistic regression, directly taking the minimum with respect to a selected coordinate does not admit a closed form solution, and therefore may involve some sophisticated optimization algorithms such as the root-finding method (Boyd and Vandenberghe, 2009). To address this issue, Shalev-Shwartz and Tewari (2011); Richt´arik and Tak´ aˇc (2012) suggest a more convenient approach, which takes a proximal coordinate gradient descent iteration. Taking the classical coordinate descent algorithm as an example, we randomly select a coordinate j at the (t) t-th iteration and consider a quadratic approximation of Fλ (θj ; θ\j ), (t)
Qλ,j,L (θj ; θ (t) ) = Vλ,j,L (θj ; θ (t) ) + λ|θj | + λkθ\j k1 , where L > 0 is a step size parameter, and Vλ,j,L (θj ; θ (t) ) is defined as L (t) (t) Vλ,j,L (θj ; θ (t) ) = Leλ (θ (t) ) + (θj − θj )∇j Leλ (θ (t) ) + (θj − θj )2 . 2 (t) 2 The step size parameter L usually satisfies L ≥ maxj ∇jj L(θ) such that Qλ,j,L (θj ; θ (t) ) ≥ Fλ (θj , θ\j ) for all j = 1, ...d. We then take (t+1)
θj
= argmin Qλ,j,L (θj ; θ (t) ) = argmin Vλ,j,L (θj ; θ (t) ) + λ|θj |. θj
(5.3)
θj
Different from the exact coordinate minimization, (5.3) always has a closed form solution obtained (t) (t) by soft thresholding. Particularly, we define θej = θj − ∇j Leλ (θ (t) )/L. Then we have (t+1)
θj
1 λ (t) (t) = argmin (θj − θej )2 + |θj | = Sλ/L (θej ) 2 L θj (t+1)
For notational convenience, we write θj
(t+1)
and θ\j
(t)
= θ\j .
= Tλ,j,L (θ (t) ). When applying PICASSO to solve sparse 25
logistic regression, we only need to replace Tλ,j (·) with Tλ,j,L (·) in Algorithms 2-7. Remark 5.1 (Step Size Parameter). For sparse logistic regression, we have ∇2jj L(θ) = n1 X> ∗j PX∗j . d Since P is a diagonal matrix, and πi (θ) ∈ (0, 1) for any θ ∈ R , we have kPk2 = maxi Pii ∈ (0, 1/4] 2 for all i = 1, ..., n. Then we have X> ∗j PX∗j ≤ kPk2 kX∗j k2 = n/4, where the last equality comes from the column normalization condition of X. Thus we choose L = 1/4. We then analyze the computational and statistical properties of the estimator obtained by PICASSO for sparse logistic regression. For compactness, we only consider the greedy selection rule and MCP regularizer, but the extensions to the SCAD regularization and other selection rules are straightforward.
5.2
Initialization by Convex Relaxation
We assume that kθ ∗ k0 ≤ s∗ , and for any kθ − θ ∗ k2 ≤ R, the design matrix satisfies r log d > 2 2 v ∇ L(θ)v ≥ ψmin kvk2 − γmin kvk1 kvk2 , (5.4) n r log d v> ∇2 L(θ)v ≤ ψmax kvk22 + γmax kvk1 kvk2 , (5.5) n where γmin , γmax , ψmin , ψmax , and R are positive constants, and do not scale with (s∗ , n, d). Existing literature has shown that many common examples of sub-Gaussian random design satisfy (5.4) and (5.5) with high probability (Raskutti et al., 2010; Negahban et al., 2012; Loh and Wainwright, 2015). Similar to sparse linear regression, we need to verify Assumptions 4.1 and 4.6 for sparse logistic regression by the following lemma. p Lemma 5.2. Recall that πi (θ)’s are defined in (5.1). Given λN = 16 log d/n, we have 4 P λN ≥ 4k∇L(θ ∗ )k∞ = kX> wk∞ ≥ 1 − d−7 , n where w = (π1 (θ ∗ ) − y1 , ..., πn (θ ∗ ) − yn )> . Moreover, given α = ψmin /4 and kθ − θ ∗ k2 ≤ R, there exists a universal positive constant C1 such that for large enough n, we have se = C1 s∗ ≥ [484κ2 + 100κ] · s∗ , ρe− (s∗ + 2e s + 1) ≥ ψmin /2, and ρ+ (s∗ + 2e s + 1) ≤ 5ψmax /4.
The proof of Lemma 5.2 directly follows Appendix E.1 and Loh and Wainwright (2015), and therefore is omitted. Lemma 5.2 guarantees that the regularization sequence satisfies Assumption 4.1 with high probability, and Assumption 4.6 holds when the design matrix satisfies (5.4) and (5.5). Different from sparse linear regression, however, the restricted strong convexity and smoothness only hold over an `2 ball centered at θ ∗ for sparse logistic regression. Thus directly choosing θb{0} = 0 may violate the restricted strong convexity. A simple counter example is kθ ∗ k2 > R, which results in k0 − θ ∗ k2 > R. To address this issue, we propose a new convex relaxation approach to obtain an initial solution for λ0 . Particularly, we solve the following convex relaxation of (1.1): min Feλ (θ), where Feλ (θ) = L(θ) + λ0 kθk1 (5.6) θ∈Rd
0
0
26
up to an adequate precision. For example, we choose θ relax satisfying the approximate KKT condition of (5.6) as follows, min
ξ∈∂kθ relax k1
k∇L(θ relax ) + λ0 ξk∞ ≤ δ0 λ0 ,
(5.7)
where δ0 ∈ (0, 1) is the initial precision parameter for λ0 . Since δ0 in (5.7) can be chosen as a sufficiently large value (e.g. δ0 = 1/8), calculating θ relax becomes very efficient even for algorithms with only sublinear rates of convergence to global optima, e.g., classical coordinate minimization and proximal gradient algorithms. For notational convenience, we call the above initialization procedure the convex relaxation approach. Lemma 5.3. Suppose that Assumptions 4.3 and 4.6 hold only for kθ − θ ∗ k2 ≤ R. Given ρ− (s∗ + √ √ se)R ≥ 9λ0 s∗ ≥ 18λN s∗ and δ0 = 1/8, we have kθSrelax k0 ≤ se,
kθ relax − θ ∗ k2 ≤ R,
and Fλ0 (θ relax ) ≤ Fλ0 (θ ∗ ) + 4λ0 .
The proof of Lemma 5.3 is provided in Appendix D.2. Lemma 5.3 guarantees that θ relax is a proper initial solution for λ0 . Therefore all convergence results in Theorem 4.12 still hold, and PICASSO attains linear convergence to a unique sparse local optimum in terms of the objective value with high probability. An illustration of the initialization by the convex relaxation approach is provided in Figure 6.
C
L
C 2
1
Basin of Attraction for
2
Basin of Attraction for
K
Basin of Attraction for
N
b{K} ✓
K
C
Basin of Attraction for
N
✓⇤ b{N } ✓
1
C
·
✓ Start Here!
C
{0}
···
0
··
relax
b ✓
b{1} ✓
Basin of Attraction for
0
0
Not a Good Initial Solution
Neighborhood : k✓
✓ ⇤ k2 R
Figure 6: An illustration of the initialization by the convex relaxation. When the restricted strong convexity and smoothness only hold over a neighborhood around θ ∗ (Green Region). Directly choosing 0 as the initial solution may violate the restricted strong convexity. Thus we adopt a convex relaxation approach to obtain an initial solution, which is guaranteed to be sparse and belong to the desired neighborhood. Remark 5.4. To guarantee the solution always satisfying the restricted strong convexity throughout
27
all iterations, Wang et al. (2014) solve (1.1) with an additional constraint min L(θ) + Rλ (θ)
subject to kθk2 ≤ R/2.
θ∈Rd
(5.8)
The `2 norm constraint in (5.8) introduces an additional tuning parameter. Moreover, their statistical analysis assumes kθ ∗ k2 ≤ R/2. This assumption is very strong, since R is a constant and cannot scale with (n, s∗ , d). In contrast, our proposed convex relaxation approach avoids this assumption, and allows kθ ∗ k2 to be arbitrarily large. The next theorem presents the statistical rate of convergence for the obtained estimator in parameter estimation. p Theorem 5.5 (Parameter Estimation). Given α = ψmin /4 and λN = 16 log d/n, for large enough n and small enough δN we have ! r r ∗ log s∗ ∗ log d s s 1 1 2 kθb{N } − θ ∗ k2 = OP + , n | {z } | {zn } where
s∗1
= |{j |
|θj∗ |
≥ γλN }| and
s∗2
= |{j | 0
∇Leλ (θ [m+0.5] ) +
Lemma 8.6. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. For the proximal coordinate gradient descent and m = 0, 1, 2..., we have h i 1 Fλ (θ [m+0.5] ) − Fλ (θ [m+1] ) ≥ ∗ Fλ (θ [m+0.5] ) − Jλ,L (w[m+1] ; θ [m+0.5] ) . s + 2e s 33
Lemma 8.7. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. For the proximal coordinate gradient descent and m = 0, 1, 2..., we have h i L [m+0.5] [m+1] [m+0.5] Fλ (θ [m+0.5] ) − Fλ (θ¯λ ) ≤ F (θ ) − J (w ; θ ) . λ λ,L ρe− (s∗ + 2e s)
The proof of Lemmas 8.6 and 8.7 is presented in Appendices C.5 and C.8. Lemmas 8.6 and 8.7 characterize the successive descent in each iteration and the gap towards the optimal objective value after each iteration respectively. Combining Lemmas 8.6 and 8.7, we obtain (s∗ + 2e s)L [m+0.5] λ [m+1] λ ¯ ¯ [F (θ ) − F ( θ )] − [F (θ ) − F ( θ ] , (8.5) Fλ (θ [m+0.5] ) − Fλ (θ¯λ ) ≤ λ λ λ λ ρe− (s∗ + 2e s) By simple manipulation, (8.5) implies ρe− (s∗ + 2e s) [Fλ (θ [m+0.5] ) − Fλ (θ¯λ )] Fλ (θ [m+1] ) − Fλ (θ¯λ ) ≤ 1 − 2 (s + 2e s)L (i) (ii) ρe− (s∗ + 2e s) ρe− (s∗ + 2e s) m+1 ≤ 1− ∗ [Fλ (θ [m] ) − Fλ (θ¯λ )] ≤ 1 − ∗ [Fλ (θ [0] ) − Fλ (θ¯λ )], (8.6) (s + 2e s)L (s + 2e s)L where (i) comes from (8.4), and (ii) comes from recursively applying (i). For the exact coordinate minimization, at the m-th iteration, we only need to conduct a proximal coordinate gradient descent iteration with L = ρ+ (1), and obtain an auxiliary solution θe[m+1] . Since Fλ (θ [m+1] ) ≤ Fλ (θe[m+1] ), by (8.6), we further have h i ρe− (s∗ + 2e s) [m+1] λ ¯ Fλ (θ ) − Fλ (θ ) ≤ 1 − ∗ Fλ (θ [m] ) − Fλ (θ¯λ ) . (8.7) (s + 2e s)ρ+ (1)
[Empirical Iteration Complexity]: Before we proceed with the proof, we introduce the following lemma.
Lemma 8.8. Suppose that Assumptions 4.3 and 4.6 hold. For any θ, we conduct an exact coordinate minimization or proximal coordinate gradient descent iteration over a coordinate k, and obtain w. Then we have Fλ (θ) − Fλ (w) ≥ ν−2(1) (wk − θk )2 . Moreover, when θk = 0 and |∇k Leλ (θ)| ≥ (1 + δ)λ, we have δ 2 λ2 δλ and Fλ (θ) − Fλ (w) ≥ . |wk | ≥ L 2ν+ (1) The proof of Lemma 8.8 is presented in Appendix B.3. Lemma 8.8 characterizes the sufficient descent when adding the selected inactive coordinate into the active set. Assume that the selected coordinate km satisfies |∇km Leλ (θ [m+0.5] )| ≥ (1 + δ)λ. Then by Lemma 8.8, we have δ 2 λ2 Fλ (θ [m+0.5] ) − Fλ (θ¯λ ) ≥ Fλ (θ [m+0.5] ) − Fλ (θ [m+1] ) ≥ . (8.8) 2ν+ (1) Moreover, by (8.6) and (8.7), we need at most ρe− (s∗ + 2e s) δ 2 λ2 −1 m = log 1− ∗ log (s + 2e s)ν+ (1) 3ν+ (1)[Fλ (θ [0] ) − Fλ (θ¯λ )] 2 2 iterations such that Fλ (θ [m+0.5] ) − Fλ (θ¯λ ) ≤ δ λ , which is contradicted by (8.8). Therefore we 3ν+ (1)
must have maxk∈Am |∇k Leλ (θ [m+0.5] )| ≤ (1 + δ)λ, and the algorithm is terminated.
[Approximate Optimal Output Solution] By Lemma 8.4, we know that when every inner loop terminates, the approximate KKT condition must hold over the active set. Since ∇Am Hλ (θ [m+0.5] ) = 0, the 34
stopping criterion maxk∈Am |∇k Leλ (θ [m+0.5] )| ≤ (1+δ)λ implies that the approximate KKT condition holds over the inactive set, k∇ Leλ (θ [m+0.5] ) + λξ k∞ ≤ δλ. min ξAm ∈∂kθ
[m+0.5] k1 Am
Am
Am
Combining the above two approximate KKT conditions implies that θ [m+0.5] satisfies the approximate KKT condition Kλ (θ [m+0.5] ) ≤ δλ. Moreover, the randomized selection rule can be analyzed in a similar manner. Due to space limitation, we present its proof in Appendix C.10. We then analyze the convergence properties of the middle loop using the truncated cyclic selection rule. To guarantee the sparsity of the active set, we need to show that the truncated cyclic selection moves no more than se inactive coordinates to the active set in each active set updating step. We prove this by contradiction. Before we proceed with the proof, we first introduce the following lemmas. Lemma 8.9. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. Given a solution θ [m+0.5] , we assume that the truncated cyclic selection adds exactly se + 1 coordinates into the active set. Then we have 36λ2 s∗ Fλ (θ [m] ) ≤ Fλ (θ [m+1] ) + . ρe− (s∗ + 2e s + 1) Lemma 8.10. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. Given a solution θ [m+0.5] , we assume that the truncated cyclic selection adds exactly se + 1 coordinates into the active set. Then we have (e s + 1)δ 2 λ2 . Fλ (θ [m+1] ) ≤ Fλ (θ [m+0.5] ) ≤ Fλ (θ [m] ) − 2ν+ (1) The proof of Lemmas 8.9 and 8.10 is presented in Appendices C.11 and C.12. Lemma 8.9 characterizes the maximum descent we can gain within each iteration of the middle loop. Lemma 8.10 characterizes the successive descent by adding se + 1 inactive coordinates into the active set.
[Solution sparsity] By simple manipulation of Lemmas 8.9 and 8.10, we have 72κ 72ν+ (1)s∗ ≤ 2 · s∗ . (8.9) se ≤ 2 ∗ δ ρe− (s + 2e s) + 1 δ p Thus if we have δ ≥ 73/(484κ + 100), then (8.9) implies se < (484κ2 +100κ)s∗ , which is contradicted by Assumption 4.6. Thus the truncated cyclic selection cannot add more than se inactive coordinates into the active set.
[Empirical Iteration Complexity] Since the algorithm guarantees the solution sparsity, i.e., kθS k0 ≤ se, by the restricted strong convexity of Fλ (θ), we have ρ− (s∗ + 2e s) Fλ (θ) − Fλ (θ¯λ ) ≥ (θ − θ¯λ )> (∇Leλ (θ¯λ ) + λξ) + kθ − θ¯λ k22 ≥ 0, 2 where ξ ∈ ∂kθ¯λ k1 satisfies the optimality condition of (1.1), i.e., ∇Leλ (θ¯λ ) + λξ = 0. Thus we know that Fλ (θ) is always lower bounded by Fλ (θ¯λ ). Moreover, by Lemma 8.8, we have δ 2 λ2 Fλ (θ [m] ) − Fλ (θ [m+1] ) ≥ . 2ν+ (1) 35
Therefore, the number of the active set updating iterations is at most
8.3
2ν+ (1)[Fλ (θ [0] ) − Fλ (θ¯λ )] . δ 2 λ2
Proof of Theorem 4.12
Proof. [Result (1)] Before we proceed with the proof, we introduce the following lemma. Lemma 8.11. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. For any λ ≥ λN , if θ satisfies kθS k0 ≤ se and Kλ (θ) ≤ δλ, where δ ≤ 1/8, then for any λ0 ∈ [λN , λ], we have 40(Kλ (θ) + 3(λ − λ0 ))(λ + λ0 )s∗ 0 Fλ0 (θ) − Fλ0 (θ¯λ ) ≤ . ρe− (s∗ + se)
The proof of Lemmas 8.11 is presented in Appendix D.3. If we take λ = λ0 = λK and θ = θb{K−1} , then Lemma 8.11 implies 25s∗ λ2K FλK (θb{K−1} ) − FλK (θ¯λK ) ≤ . (8.10) ρe− (s∗ + se) Since the objective value always decreases in each middle loop, for any inner loop with λK , we have ¯ ≤ Fλ (θb{K−1} ) − Fλ (θ¯λK ). Therefore by Theorem 4.8 and (8.10), we know FλK (θ (0) ) − FλK (θ) K K that the number of iterations within each inner loop is at most 2ρ e− (s)ν− (1) + sρ2+ (s) ν− (1)τK e− (s∗ + se) −1 ρ , log log 25s∗ sρ2+ (s) which is Result (1) in Theorem 4.12. [Results (2)–(4)] Combining Result (2) in Theorem 4.9 with (8.10), we know that the number of active set updating iterations within each middle loop is at most 2 δK ρe− (s∗ + se) ρe− (s∗ + 2e s) −1 log 1− ∗ log , (s + 2e s)ν+ (1) 75ν+ (1)s∗ which is Result (2) in Theorem 4.12. Similarly, combining (8.10) with Results (3) and (4) in Theorem 4.9, we obtain Results (3) and (4) in Theorem 4.12 respectively. [Result (5)–(7)] They are straightforward combinations of Results (1)–(4). [Result (8)] For K < N , we take λ0 = λN , λ = λK , and θ = θb{K} . Then by Lemma 8.11, we have 25(λK + λN )(KλK (θb{K} ) + 3(λK − λN ))s∗ . (8.11) FλN (θb{K} ) − FλN (θ¯λN ) ≤ ρe− (s∗ + se) Since we have λK > λN for K = 0, ..., N − 1, we directly obtain Result (8) using (8.11).
8.4
Proof of Theorem 4.15
Before we proceed with the main proof, we first introduce the following lemmas.
36
Lemma 8.12. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. Then we have ! p √ ∗ )k ∗ λ |S | L(θ k∇ δ λ s 2 2 S N 1 {N } ∗ kθb − θ k2 = O + + , ρe (s∗ + 2e s) ρe (s∗ + se) ρe (s∗ + se) | − {z } | − {z } | − {z } V1
V2
V3
where S1 = {j | |θj∗ | ≥ γλN } and S2∗ = {j | 0 < |θj∗ | < γλN }.
The proof of Lemma 8.12 is presented in Appendix E.2. Lemma 8.12 divides the estimation error of θb{N } into three parts: V1 is the error for strong signals; V2 is the error for weak signals; V3 is the optimization error. Lemma 8.13. Suppose that Assumptions 4.3 and 4.6 hold, X satisfies the column normalization condition, and the observation noise ∼ N (0, σ 2 I) is Gaussian. We then have ! r ρ (|S |) · |S | 1 + 1 1 ≤ 2 exp (−2|S1 |) . P kX> k2 ≥ 3σ n ∗S1 n The proof of Lemma 8.13 is provided in Appendix E.3. Lemma 8.13 characterizes the large deviation properties of k∇S1 L(θ ∗ )k2 in Lemma 8.12 for sparse linear regression. We the proceed with the main proof. For notational simplicity, we omit the index N and denote 1 {N b b λ, and δ respectively. If we choose a sufficiently small δ such that δ ≤ √ θ } , λN , and δN by θ, , 40 s∗ then we apply Lemmas 8.12 and 8.13, and obtain r p p 3λ |S2 | 3 |S1 |σ ρ+ (|S1 |) · |S1 | 0.3λ b + + . k∆k2 ≤ ∗ ∗ ρe− (s + 2e s) n ρe− (s + 2e s) ρe− (s∗ + 2e s) Since all above results rely on Assumptions 4.1 and 4.6, by Lemma 4.14, we have r r p p 15 |S |σ (96 |S | + 10)σ ρ (|S |)|S | log d 1 2 + 1 1 b 2≤ k∆k + ψmin n ψmin n with probability at least 1 − 2 exp(−2 log d) − 2 exp (−2 · |S1 |), which completes the proof.
8.5
Proof of Theorem 4.16
Proof. For any θ ∗ , we consider a partition of Rd as n n Cσ o Cσ o ∗ p S1 = j θj ≥ p ∗ , and S = j < . 3 s1 + s∗2 s∗1 + s∗2 We consider the first scenario, where θS∗ 2 is known. Then we establish the minimax lower bound for estimating θS∗ 1 . Particularly, let θeS1 denote any estimator of θS∗ 1 based on y − XS2 θS∗ 2 ∼ N (XS1 θ ∗ , σ 2 I). This is essentially a low dimension linear regression problem since s∗1 n. By the minimax lower bound for standard linear regression model established in Chapter 2 of Duchi (2015), we have r s∗1 ∗ inf sup EkθeS1 − θS1 k2 ≥ C6 σ n θeS1 θ∈Θσ (s∗1 ,s∗2 ,d) for a universal constant C6 . We then consider a second scenario, where θS∗ 1 is known. Then we establish the minimax lower bound for estimating θS∗ 2 . Particularly, let θeS2 denote any estimator of θS∗ 2 based on y − XS1 θS∗ 1 ∼ N (XS2 θS∗ 2 , σ 2 I). This is essentially a high dimensional sparse linear 37
regression problem. By the minimax lower bound for sparse linear regression model established in Raskutti et al. (2011), we have r r ∗ log(d − s∗ ) s s∗2 log d ∗ 2 2 EkθeS2 − θS2 k2 ≥ 2C7 σ inf sup ≥ C7 σ , n n θeS2 θ∈Θσ (s∗1 ,s∗2 ,d)
where C7 is a universal constant, and the last inequality comes from the fact s∗2 d. Combining two scenarios, we can have n o inf sup Ekθb − θ ∗ k2 ≥ max inf sup EkθeS1 − θS∗ 1 k2 + inf sup EkθeS2 − θS∗ 2 k2 θb θ∈Θσ (s∗1 ,s∗2 ,d)
θeS1 θ∈Θσ (s∗1 ,s∗2 ,d)
θeS2 θ∈Θσ (s∗1 ,s∗2 ,d)
1 1 inf sup EkθeS1 − θS∗ 1 k2 + inf sup EkθeS2 − θS∗ 2 k2 2 θeS1 θ∈Θσ (s∗1 ,s∗2 ,d) 2 θeS2 θ∈Θσ (s∗1 ,s∗2 ,d) ! r r r r C6 s∗1 C7 s∗2 log d s∗1 s∗2 log d , ≥ σ + σ ≥ C4 σ +σ 2 n 2 n n n ≥
where C4 = min{C6 /2, C7 /2}.
8.6
Proof of Theorem 4.17
¯ Before we proceed with the Proof. For notational simplicity, we denote λN by λ, and θ¯λN by θ. proof, we first introduce the following lemmas. Lemma 8.14. Suppose that X satisfies the column normalization condition, and the observation noise ∼ N (0, σ 2 I) is Gaussian. Then we have ! r 1 log d P kX> k∞ ≥ 2σ ≤ 2d−2 . n n Lemma 8.15. Suppose that Assumptions 4.1, 4.3, and 4.6, and the following event ( ) r 1 log d E1 = kX> k∞ ≥ 2σ n n hold. We have
1 X∗S (Y − Xθbo ) + ∇S Hλ (θbo ) + λ∇kθbSo k1 = 0. n
Lemma 8.16. Suppose that Assumptions 4.1, 4.3, and 4.6, and the following event ( ) r 1 log d > E2 = kU k∞ ≥ 2σ n n −1 > bo bo hold, where U = X> (I − X∗S (X> ∗S X∗S ) X∗S ). There exists some ξS ∈ ∂kθS k1 such that 1 > X (Y − Xθbo ) + ∇S Hλ (θbo ) + λξbSo = 0. n ∗S
The proof of Lemma 8.14 is provided in Negahban et al. (2012), therefore is omitted. The proof of Lemmas 8.15 and 8.16 is presented in Appendices E.4 and E.5. Lemmas 8.15 and 8.16 imply that θbo satisfies the KKT condition of (1.1) over S and S respectively. Note that the above results only 38
depend on Conditions E1 and E2 . Meanwhile, we also have
> −1 > kU∗j k2 = kX> ∗j (I − X∗S (X∗S X∗S ) X∗S )k2
−1 > ≤ kI − X∗S (X> ∗S X∗S ) X∗S k2 kX∗j k2 ≤ kX∗j k2 =
√
n,
(8.12)
−1 > where the last inequality comes from kI − X∗S (X> ∗S X∗S ) X∗S k2 ≤ 1. Therefore (8.12) implies that Lemma 8.14 can be also applied to verify E2 . Moreover, since both θb{N } and θbo are sparse local ¯ ≥ 1 − 4d−2 . optima, by Lemma C.1, we further have P(θbo = θ) ¯ given a sufficiently small δN , we have Moreover, since θb converges to θ, ¯ − ∇Leλ (θ)k b ∞ ≤ kLeλ (θ) ¯ − Leλ (θ)k b 2 ≤ ρ+ (s∗ )kθ¯ − θk b 2 ≤ ω λ. k∇Leλ (θ) 4 ¯ e Since we have proved k∇S Lλ (θ)k∞ ≤ λ/4 in Lemma 8.16, we have b ∞ ≤ k∇ Leλ (θ)k ¯ ∞ + k∇Leλ (θ) ¯ − ∇Leλ (θ)k b ∞ ≤ λ + ω. kLeλ (θ)k S 4 b Since θ also satisfies the approximate KKT condition and δ ≤ 1/8, then we must have θbS = 0. Moreover, since we have also proved that there exists some constant C8 such that minj∈S |θ¯j | ≥ p p C8 σ log d/n in Lemma 8.15, then for ω/ρ− (s∗ ) C8 σ log d/n, we have r log d b ¯ > 0. min |θj | = min |θj | − ω ≥ C8 σ j∈S j∈S n b = supp(θ) ¯ = supp(θ ∗ ). Meanwhile, since all Combining with the fact θb = 0, we have supp(θ) S
signals are strong enough, then by Theorem 4.15, we also have r s∗ ∗ b kθ − θ k2 ≤ C3 σ , n which completes the proof.
References Adams, R. P. et al. (2007). Identification of essential oil components by gas chromatography/mass spectrometry. Ed. 4, Allured publishing corporation. Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37 1705–1732. Boyd, S. and Vandenberghe, L. (2009). Convex Optimization. Cambridge University Press. Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics 5 232–253. ¨ hlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data: methods, theory and Bu applications. Springer. Candes, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Transactions on Information Theory 51 4203–4215. Chen, S., Donoho, D. and Saunders, M. (1998). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20 33–61. Duchi, J. (2015). Lecture notes for statistics and information theory. http://stanford.edu/class/ stats311/Lectures/full_notes.pdf. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. The Annals of Statistics 32 407–499.
39
Eloyan, A., Muschelli, J., Nebel, M. B., Liu, H., Han, F., Zhao, T., Barber, A. D., Joel, S., Pekar, J. J., Mostofsky, S. H. et al. (2012). Automated diagnoses of attention deficit hyperactive disorder using magnetic resonance imaging. Frontiers in systems neuroscience 6. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 1348–1360. Fan, J., Xue, L., Zou, H. et al. (2014). Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics 42 819–849. ¨ fling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Friedman, J., Hastie, T., Ho The Annals of Applied Statistics 1 302–332. Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 1–13. Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics 7 397–416. Genkin, A., Lewis, D. D. and Madigan, D. (2007). Large-scale bayesian logistic regression for text categorization. Technometrics 49 291–304. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research 3 1157–1182. Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E. and Tatham, R. L. (2006). Multivariate data analysis, vol. 6. Pearson Prentice Hall Upper Saddle River, NJ. Hastie, T. (2009). Fast regularization paths via coordinate descent. The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Denver 2008. Plenary Talk. Huang, J., Ma, S. and Zhang, C.-H. (2008). Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica 18 1603. Kim, Y., Choi, H. and Oh, H.-S. (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association 103 1665–1673. Kim, Y. and Kwon, S. (2012). Global optimality of nonconvex penalized estimators. Biometrika 99 315–325. Kooij, A. J. et al. (2007). Prediction accuracy and stability of regression with optimal scaling transformations. Child & Family Studies and Data Theory (AGP-D), Department of Education and Child Studies, Faculty of Social and Behavioural Sciences, Leiden University. Krishnapuram, B., Carin, L., Figueiredo, M. A. and Hartemink, A. J. (2005). Sparse multinomial logistic regression: Fast algorithms and generalization bounds. Pattern Analysis and Machine Intelligence, IEEE Transactions on 27 957–968. Ledoux, M. and Talagrand, M. (2011). Probability in Banach Spaces: isoperimetry and processes. Springer. Loh, P.-L. and Wainwright, M. J. (2015). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research To appear. Lounici, K., Pontil, M., van de Geer, S., Tsybakov, A. B. et al. (2011). Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics 39 2164–2204. Lu, Z. and Xiao, L. (2014). On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming To appear. Mazumder, R., Friedman, J. H. and Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association 106 1125–1138. ¨ hlmann, P. (2006). High dimensional graphs and variable selection with the Meinshausen, N. and Bu lasso. The Annals of Statistics 34 1436–1462. Neale, B. M., Kou, Y., Liu, L., Ma’Ayan, A., Samocha, K. E., Sabo, A., Lin, C.-F., Stevens, C.,
40
Wang, L.-S., Makarov, V. et al. (2012). Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485 242–245. Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science 27 538–557. Nesterov, Y. (2013). Gradient methods for minimizing composite objective function. Mathematical Programming Series B 140 125–161. Raskutti, G., Wainwright, M. J. and Yu, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs. The Journal of Machine Learning Research 11 2241–2259. Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over-balls. Information Theory, IEEE Transactions on 57 6976–6994. ´rik, P. and Taka ´c ˇ, M. (2012). Iteration complexity of randomized block-coordinate descent methods Richta for minimizing a composite function. Mathematical Programming 1–38. Scheetz, T. E., Kim, K.-Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson, K. L., Dorrance, A. M., DiBona, G. F., Huang, J., Casavant, T. L. et al. (2006). Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences 103 14429–14434. Shalev-Shwartz, S. and Tewari, A. (2011). Stochastic methods for `1 -regularized loss minimization. The Journal of Machine Learning Research 12 1865–1892. Shen, X., Pan, W. and Zhu, Y. (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107 223–232. Shevade, S. K. and Keerthi, S. S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19 2246–2253. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58 267–288. Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J. and Tibshirani, R. J. (2012). Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 245–266. Wainwright, M. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using `1 constrained quadratic programming. IEEE Transactions on Information Theory 55 2183–2201. Wang, L., Kim, Y. and Li, R. (2013). Calibrating nonconvex penalized regression in ultra-high dimension. The Annals of Statistics 41 2505–2536. Wang, Z., Liu, H. and Zhang, T. (2014). Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics 42 2164–2201. Xue, L., Zou, H. and Cai, T. (2012). Nonconcave penalized composite conditional likelihood estimation of sparse ising models. The Annals of Statistics 40 1403–1429. Zhang, C.-H. (2010a). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38 894–942. Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics 36 1567–1594. Zhang, C.-H. and Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science 27 576–593. Zhang, T. (2009). Some sharp performance bounds for least squares regression with `1 regularization. The Annals of Statistics 37 2109–2144. Zhang, T. (2010b). Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research 11 1081–1107.
41
Zhang, T. et al. (2013). Multi-stage convex relaxation for feature selection. Bernoulli 19 2277–2293. Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research 7 2541–2563. Zhao, T. and Liu, H. (2014). Accelerated path-following iterative shrinkage thresholding algorithm. Tech. rep., Princeton University. Zhao, T., Liu, H., Roeder, K., Lafferty, J. and Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in R. The Journal of Machine Learning Research 13 1059–1062. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101 1418–1429. Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics 36 1509–1533.
42
Supplementary Materials to “A General Theory of Pathwise Coordinate Optimization” Tuo Zhao, Han Liu, and Tong Zhang
Abstract The Supplementary Materials contain the supplementary proof of the theoretical results in the paper “A General Theory of Pathwise Coordinate Optimization” authored by Tuo Zhao, Han Liu, and Tong Zhang. Particularly, Appendix A provides the proof of Lemma 4.5; Appendix B provides the proof of all lemmas related to Theorem 4.8; Appendix C provides the proof of all lemmas related to Theorem 4.9; Appendix D provides the proof of all lemmas related to Theorem 4.12; Appendix E provides the proof of all lemmas related to statistical theory.
A
Proof of Lemma 4.5
Proof. Since L(θ) is twice differentiable and kθ − θ 0 k0 ≤ s, by the mean value theorem, we have 1 e 0 − θ), L(θ 0 ) − L(θ) − (θ 0 − θ)> ∇L(θ) = (θ 0 − θ)> ∇2 L(θ)(θ (A.1) 2 where θe = (1 − β)θ 0 + βθ for some β ∈ (0, 1). By Definition 4.4, we have 1 ρ− (s) 0 e 0 − θ) ≤ ρ+ (s) kθ 0 − θk2 . kθ − θk22 ≤ (θ 0 − θ)> ∇2 L(θ)(θ (A.2) 2 2 2 2 Combining (A.1) with (A.2), we have ρ− (s) 0 ρ+ (s) 0 kθ − θk22 ≤ L(θ 0 ) − L(θ) − (θ 0 − θ)> ∇L(θ) ≤ kθ − θk22 . (A.3) 2 2 By (R.1) in Assumption 4.3, we have α − kθ 0 − θk22 ≤ Hλ (θ 0 ) − Hλ (θ) − (θ 0 − θ)> ∇Hλ (θ) ≤ 0. (A.4) 2 Combining (A.3) with (A.4), we have ρ+ (s) 0 ρ− (s) − α 0 kθ − θk22 ≤ Leλ (θ 0 ) − Leλ (θ) − (θ 0 − θ)> ∇Leλ (θ) ≤ kθ − θk22 . (A.5) 2 2 By the convexity of kθk1 , we have kθ 0 k1 ≥ kθk1 + (θ 0 − θ)> ξ
for any ξ ∈ ∂kθk1 . Combining (A.6) with (A.5), we obtain
ρ− (s) 0 Fλ (θ 0 ) ≥ Fλ (θ) + (θ 0 − θ)> (∇Leλ (θ) + λξ) + kθ − θk22 . 2
1
(A.6)
B B.1
Lemmas for Proving Theorem 4.8 Proof of Lemma 8.1
Proof. By Lemma 8.8, we have Fλ (w(t+1,k−1) ) − Fλ (w(t+1,k) ) ≥
ν− (1) (t+1) ν− (1) (t+1,k−1) (t+1,k) 2 (t) (wk − wk ) = (θk − θ k )2 , 2 2
which further implies (t)
Fλ (θ ) − Fλ (θ
B.2
(t+1)
)=
s X k=1
[Fλ (w(t+1,k−1) ) − Fλ (w(t+1,k) )] ≥
ν− (1) (t) kθ − θ (t+1) k2 . 2
Proof of Lemma 8.2
Proof. We first analyze the gap for the proximal coordinate gradient descent. Let θ ∈ Rd be a vector satisfying θA = 0. By the restricted strong convexity of Fλ (θ), we have ρe− (s) (t+1) Fλ (θ) ≥ Fλ (θ (t+1) ) + (∇A Leλ (θ (t+1) ) + λξA )> (θ − θ (t+1) ) + kθ − θ (t+1) k22 , (B.1) 2 (t+1) where ξA satisfies the optimality condition of the proximal coordinate gradient descent, (t+1)
∇Vλ,k,L (θk
(t+1)
; w(t+1,k−1) ) + λξk
= 0 for any k ∈ A.
We minimize both sides of (B.1) with respect to θA and set θA = 0. We then obtain 1 (t+1) ¯ ≤ Fλ (θ (t+1) ) − Fλ (θ) k∇A Leλ (θ (t+1) ) + λξA k22 2e ρ− (s) s 1 X (i) (t+1) = k∇k Leλ (θ (t+1) ) − ∇Vλ,k,L (θk ; w(t+1,k−1) )k22 2e ρ− (s) (ii)
≤
ρ2+ (s)
2e ρ− (s)
k=1 s X k=1
kθ (t+1) − w(t+1,k−1) k22 ≤
sρ2+ (s) (t+1) kθ − θ (t) k22 , 2e ρ− (s)
(B.2)
(B.3)
(t+1)
where (i) comes from (B.2), and (ii) comes from ∇Vλ,k,L (θk ; w(t+1,k−1) ) = ∇Leλ (w(t+1,k−1) ) and the restricted strong smoothness of Leλ (θ). (t+1) (t+1) For the exact coordinate minimization, we have ∇Vλ,k,L (θk ; w(t+1,k−1) ) = ∇Yλ,k (θk ; w(t+1,k−1) ). Therefore (B.3) also holds for the exact coordinate minimization.
B.3
Proof of Lemma 8.8
Proof. For the proximal coordinate gradient descent, we have Fλ (θ) = Vλ,k,L (θk ; θ) + λ|θk | + λkθ\k k1 ,
Fλ (w) ≤ Vλ,k,L (wk ; θ) +
Since Vλ,k,L (θk ; θ) is strongly convex in θk , we have
λ|θk0 |
+ λkθ\k k1 .
Vλ,k,L (θk ; θ) − Vλ,k,L (wk ; θ) ≥ (θk − wk )∇Vλ,k,L (wk ; θ) + 2
(B.4) (B.5) L (wk − θk )2 . 2
(B.6)
By the convexity of the absolute value function, we have |θk | − |wk | ≥ (θk − wk )ξk ,
(B.7)
∇Vλ,k,L (wk ; θ) + λξk = 0.
(B.8)
where ξk ∈ ∂|wk | satisfies the optimality condition of the proximal coordinate gradient descent, Subtracting (B.4) by (B.5), we have
Fλ (θ) − Fλ (w) ≥ Vλ,k,L (θk ; θ) − Vλ,k,L (wk ; θ) + λ|θk | − λ|wk | (i)
(ii) L L (wk − θk )2 ≥ (wk − θk )2 . 2 2 where (i) comes from (B.6) and (B.7), and (ii) comes from (B.8). For the exact coordinate minimization, we only need to slightly trim the above analysis. More specifically, we replace Vλ,k,L (wk ; θ) with Yλ,k (wk ; θ) = Leλ (wk , θ\k ).
≥(θk − wk )(∇Vλ,k,L (wk ; θ) + λξk ) +
Since Leλ (θ) is restricted strongly convex, we have
Yλ,k (θk ; θ) − Yλ,k (wk ; θ) ≥ (θk − wk )∇Yλ,k (θk0 ; θ) +
Eventually, we can obtain
ρe− (1) (wk − θk )2 . 2
ρe− (1) (wk − θk )2 . 2 We then proceed to analyze the descent for the proximal coordinate gradient descent when θk = 0 and |∇k Leλ (θ)| ≥ (1 + δ)λ. Then we have δλ |wk | = |Sλ/L (−∇k Leλ (θ)/L)| ≥ , L where the last inequality comes from the definition of the soft thresholding function. Thus we obtain δ 2 λ2 L Fλ (θ) − Fλ (w) ≥ wk2 ≥ . 2 2L For the exact coordinate minimization, we construct an auxiliary solution w0 by a proximal coordinate gradient descent iteration using L = ρ+ (1). Since w is obtained by the exact minimization, we have δ 2 λ2 . Fλ (θ) − Fλ (w) ≥ Fλ (θ) − Fλ (w0 ) ≥ 2ρ+ (1) Fλ (w) − Fλ (θ) ≥
C C.1
Lemmas for Proving Theorem 4.9 Proof of Lemma 8.3
Proof. Before we proceed with the proof, we first introduce the following lemma. Lemma C.1. Suppose that Assumptions 4.3 and (4.6) hold. If θ¯λ satisfies kθ¯Sλ k0 ≤ se and Kλ (θ¯λ ) = 0, then θ¯λ is a unique sparse local optimum to (1.1). 3
The proof of Lemma is presented in Appendix C.9. We then proceed with the proof. We consider a sequence of auxiliary solutions obtained by the proximal gradient algorithm. The details for generating such a sequence can be found in Wang et al. (2014). By Theorem 5.1 in Wang et al. (2014), we know that such a sequence of solutions converges to a sparse local optimum θ¯λ . By Lemma C.1, we know that the sparse local optimum is unique.
C.2
Proof of Lemma 8.4
Proof. Before we proceed with the proof, we first introduce the following lemma. Lemma C.2. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. For any λ ≥ λN , if θ satisfies 4λ2 s∗ kθS k0 ≤ s and Fλ (θ) ≤ Fλ (θ ∗ ) + , (C.1) ρe− (s∗ + s) where s ≤ 2e s + 1, then we have √ 9λ s∗ 25λs∗ ∗ kθ − θ k2 ≤ and kθ − θ ∗ k1 ≤ . ∗ ρe− (s + s) ρe− (s∗ + s)
The proof of Lemma C.2 is presented in Appendix C.3. Lemma C.2 characterizes the estimation errors of any sufficiently sparse solution with a sufficiently small objective value. When the inner loop terminates, we have the output solution as θb = θ (t+1) . Since both the exact and proximal coordinate gradient descent iterations always decrease the objective value, we have 4λ2 s∗ Fλ (θ (t+1) ) ≤ Fλ (θ ∗ ) + . (C.2) ρe− (s∗ + 2e s) By (B.3) in Appendix B.2, we have shown (t+1) 2 k ≤ (s∗ + 2e s)ρ2 (s∗ + 2e s)kθ (t+1) − θ (t) k2 . (C.3) k∇A Leλ (θ (t+1) ) + λξ A
2
+
2
Since Assumption 4.7 holds and ρe− (1) ≤ ν+ (1), we have kθ (t+1) − θ (t) k22 ≤ τ 2 λ2 ≤
δ 2 λ2 . (s∗ + 2e s)ρ2+ (s∗ + 2e s)
(C.4)
Combining (C.3) with (C.4), we have θ (t+1) satisfying the approximate KKT condition over the active set, (t+1) min k∇A Leλ (θ (t+1) ) + λξA k∞ ≤ k∇A Leλ (θ (t+1) ) + λξ k2 ≤ δλ. (t+1)
ξA ∈∂kθA
A
k1
We now proceed to characterize the sparsity of θb = θ (t+1) by exploiting the above approximate KKT condition. By Assumption 4.1, we have λ ≥ 4k∇Leλ (θ ∗ )k∞ , which implies j |∇j Leλ (θ ∗ )| ≥ λ , j ∈ S ∩ A = 0. (C.5) 4
We then consider an arbitrary set S 0 such that b − ∇j Leλ (θ ∗ )| ≥ λ , j ∈ S ∩ A . S 0 = j | |∇j Leλ (θ) 2 Let s0 = |S 0 |. Then there exists a v ∈ Rd such that kvk∞ = 1, kvk0 ≤ s0 ,
b − ∇Leλ (θ ∗ )). and s0 λ/2 ≥ v> (∇Leλ (θ) 4
(C.6)
By Cauchy-Schwarz inequality, (C.6) implies √ s0 λ b − ∇Leλ (θ ∗ )k2 ≤ s0 k∇Leλ (θ) b − ∇Leλ (θ ∗ )k2 ≤ kvk2 k∇Leλ (θ) 2 (ii) (i) √ √ s) s0 ≤ ρ+ (s∗ + 2e s) s0 kθb − θ ∗ k2 ≤ ρ+ (s∗ + 2e
√ 9λ s∗ , ρe− (s∗ + 2e s)
(C.7)
where (i) comes from the restricted strong smoothness of Leλ (θ), and (ii) comes from (C.2) and Lemma C.2. By simple manipulation, (C.7) can be rewritten as √ √ 18ρ+ (s∗ + 2e s) s∗ 0 s ≤ . (C.8) ρe− (s∗ + 2e s) Since S 0 is arbitrary defined, by simple manipulation, (C.8) implies b − ∇j Leλ (θ ∗ )| ≥ λ , j ∈ S ∩ A ≤ 364κ2 s∗ . j | |∇j Leλ (θ) (C.9) 2 Combining (C.5) with (C.9), we have b ≥ 3λ , j ∈ S ∩ A ≤ j |∇j Leλ (θ ∗ )| ≥ λ , j ∈ S ∩ A j | |∇j Leλ (θ)| 4 4 b − ∇j Leλ (θ ∗ )| ≥ λ , j ∈ S ∩ A ≤ 364κ2 s∗ < se, + j | |∇j Leλ (θ) (C.10) 2 where the last inequality comes from Assumption 4.6. Since we require δ ≤ 1/8 in Assumption 4.7, (C.10) implies that for any u ∈ Rd satisfying kuk∞ ≤ 1, we have b + δλuj | ≥ 7λ , j ∈ S ∩ A ≤ se. j | |∇j Leλ (θ) 8 b + δλuj | ≤ 7λ/8, there exists a ξj such that Then for any j ∈ S ∩ A satisfying |∇j Leλ (θ) |ξj | ≤ 1
b + δλuj + λξj = 0, and ∇j Leλ (θ)
which further implies θbj = 0. Therefore we must have kθbS k0 ≤ se.
C.3
Proof of Lemma C.2
Proof. For notational simplicity, we define ∆ = θ − θ ∗ . We first rewrite (C.1) as 4λ2 s∗ λkθ ∗ k1 − λkθk1 + ≥ Leλ (θ) − Leλ (θ ∗ ). ρe− (s∗ + s) By the restricted strong convexity of Leλ (θ), we have
(i) ρe− (s∗ + s) ∗ ∗ > ∗ k∆k22 ≥ ∆> Leλ (θ) − Leλ (θ ∗ ) − S [∇S L(θ ) + ∇S Hλ (θ )] + ∆S ∇S L(θ ) 2
(C.11)
(ii)
≥ −k∆S k1 k∇L(θ ∗ )k∞ − k∆S k1 k∇L(θ ∗ )k∞ − k∆S k1 k∇S Hλ (θ ∗ )k∞ , (θ ∗ )
(C.12)
where (i) comes from ∇S Hλ = 0 by (R.3) of Assumption 4.3, and (ii) comes from H¨older’s inequality. By Assumption 4.1 and (R.2) of Assumption 4.3, we have λ k∇L(θ ∗ )k∞ ≤ and k∇S Hλ (θ ∗ )k∞ ≤ λ. (C.13) 4
5
Combining (C.12) with (C.13), we obtain 5λ λ ρe− (s∗ + s) Leλ (θ) − Leλ (θ ∗ ) ≥ − k∆S k1 − k∆S k1 + k∆k22 . 4 4 2 Plugging (C.14) and
(C.14)
kθ ∗ k1 − kθk1 = kθS∗ k1 − (kθS k1 + k∆S k1 ) ≤ k∆S k1 − k∆S k1
into (C.11), we obtain 9λ 4λ2 s∗ 3λ ρe− (s∗ + s) k∆S k1 + ≥ k∆ k + k∆k22 . (C.15) 1 S 4 ρe− (s∗ + s) 4 2 We consider the first case: ρe− (s∗ + s)k∆k1 > 16λs∗ . Then we have λ ρe− (s∗ + s) 5λ k∆S k1 ≥ k∆S k1 + k∆k22 . (C.16) 2 2 2 By simple manipulation, (C.16) implies 5λ √ ∗ ρe− (s∗ + s) 5λ 5λ √ ∗ s k∆S k2 ≤ s k∆k2 , (C.17) k∆k22 ≤ k∆S k1 ≤ 2 2 2 2 ∗ where the second inequality comes from the fact that ∆S only contains s entries. By simple manipulation, (C.17) further implies √ 5λ s∗ . (C.18) k∆k2 ≤ ρe− (s∗ + s) Meanwhile, (C.16) also implies k∆S k1 ≤ 5k∆S k1 .
Combining (C.18) with (C.19), we obtain √ √ k∆k1 ≤ 5k∆S k1 ≤ 5 s∗ k∆S k2 ≤ 5 s∗ k∆k2 ≤
25λs∗ . ρe− (s∗ + s) We then consider the second case: ρe− (s∗ + s)k∆k1 ≤ 16λs∗ . Then (C.15) implies √ 9λ s∗ . k∆k2 ≤ ρe− (s∗ + s) Combining two cases, we obtain √ 9λ s∗ 25λs∗ k∆k2 ≤ and k∆k1 ≤ . ∗ ρe− (s + s) ρe− (s∗ + s)
C.4
(C.19)
(C.20)
Proof of Lemma 8.5
Proof. By Assumption 4.1, we have λ ≥ 4k∇Leλ (θ ∗ )k∞ , which implies j |∇j Leλ (θ ∗ )| ≥ λ , j ∈ S ∩ A = 0. 4
We then consider an arbitrary set S 0 such that λ 0 [0] ∗ e e S = j | |∇j Lλ (θ ) − ∇j Lλ (θ )| ≥ , j ∈ S . 2
6
(C.21)
Let s0 = |S 0 |. Then there exists a v ∈ Rd such that kvk∞ = 1, kvk0 ≤ s0 ,
and s0 λ/2 ≥ v> (∇Leλ (θ [0] ) − ∇Leλ (θ ∗ )).
By Cauchy-Schwarz inequality, (C.22) implies √ s0 λ ≤ kvk2 k∇Leλ (θ [0] ) − ∇Leλ (θ ∗ )k2 ≤ s0 k∇Leλ (θ [0] ) − ∇Leλ (θ ∗ )k2 2 (i) (ii) √ √ ≤ ρ+ (s∗ + 2e s) s0 kθ [0] − θ ∗ k2 ≤ ρ+ (s∗ + 2e s) s0
√ 9λ s∗ , ρe− (s∗ + 2e s)
(C.22)
(C.23)
where (i) comes from the restricted strong smoothness of Leλ (θ), and (ii) comes from Lemma C.2. By simple manipulation, (C.23) can be rewritten as √ √ 18ρ+ (s∗ + 2e s) s∗ 0 . (C.24) s ≤ ρe− (s∗ + 2e s) Since S 0 is arbitrary defined, by simple manipulation, (C.8) implies j | |∇j Leλ (θ [0] ) − ∇j Leλ (θ ∗ )| ≥ λ , j ∈ S ∩ A ≤ 364κ2 s∗ . (C.25) 2 Combining (C.21) with (C.25), we have j | |∇j Leλ (θ [0] )| ≥ 3λ , j ∈ S ∩ A ≤ j |∇j Leλ (θ ∗ )| ≥ λ , j ∈ S ∩ A 4 4 λ (C.26) + j | |∇j Leλ (θ [0] ) − ∇j Leλ (θ ∗ )| ≥ , j ∈ S ∩ A ≤ 364κ2 s∗ < se, 2 where the last inequality comes from Assumption 4.6. Since Assumption 4.7 requires ϕ ≤ 1/8, we have (1 − ϕ)λ > 34 λ. Thus (C.26) implies that the strong rule selects at most se irrelevant coordinates.
C.5
Proof of Lemma 8.6
Proof. Before we proceed with the proof, we first introduce the following lemmas. Lemma C.3. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. For any λ ≥ λN , if θ satisfies 4λ2 s∗ kθS k0 ≤ se and Fλ (θ) ≤ Fλ (θ ∗ ) + , (C.27) ρe− (s∗ + se) then we have k[Tλ,L (θ)]S k0 ≤ se.
The proof of Lemma C.3 is presented in Appendix C.6. Since θ [m+0.5] satisfies (C.27) for all [m+0.5] k0 ≤ se for all m = 0, 1, 2, .... m = 0, 1, 2, ..., by Lemma C.3, we have kwS Lemma C.4. Suppose that Assumptions 4.1, 4.3, 4.6, and 4.7 hold. For every active set updating iteration, if we select a coordinate as km = argmax |∇k Leλ (θ [m+0.5] )|, k∈Am
then we have
km = argmin Qλ,k,L (Tλ,k,L (θ [m+0.5] ); θ [m+0.5] ). k
7
The proof of Lemma C.4 is presented in Appendix C.7. Lemma C.4 guarantees that our selected coordinate km leads to a sufficient descent in the objective value. Therefore we have [m+1]
Fλ (θ [m+0.5] ) − Fλ (θ [m+1] ) ≥ Fλ (θ [m+0.5] ) − Qλ,km ,L (θkm ; θ [m+0.5] ) 1 X [m+0.5] [m+0.5] ≥ Fλ (θ [m+0.5] ) − Qλ,k,L (wk ;θ ), (C.28) |Bm | k∈Bm [m+0.5] [m+1] where Bm = k | wk 6= 0 and |Bm | ≤ s∗ + 2e s. By rearranging (C.28), we 6= 0 or θk obtain h i 1 Fλ (θ [m+0.5] ) − Fλ (θ [m+1] ) ≥ ∗ Fλ (θ [m+0.5] ) − Jλ,L (w[m+1] ; θ [m+0.5] ) . s + 2e s
C.6
Proof of Lemma C.3
Proof. We define an auxiliary solution 1 1 1 θe = θ − ∇Leλ (θ) = θ − ∇Leλ (θ ∗ ) + (∇Leλ (θ) − ∇Leλ (θ ∗ )). L L L For notational simplicity, we denote ∆ = θ − θ ∗ . We first consider j ∈ S : |θj | ≥ λ ≤ j ∈ S : |∆j | ≥ λ 4L 4L ≤
4L 4L 100Ls∗ k∆S k1 ≤ k∆k1 ≤ , λ λ ρe− (s∗ + se)
(C.29)
where the last inequality comes from Lemma C.2. By Assumption 4.1, we have k∇Leλ (θ ∗ )k∞,2 ≤ λ/4, which implies j ∈ S : |∇j Leλ (θ ∗ )| ≥ λ = 0. (C.30) 4 Recall that in Appendix C.2, we have shown that j | |∇j Leλ (θ)| ≥ λ , j ∈ S ∩ A ≤ 364κ2 s∗ . 2
(C.31)
Combining (C.29) and (C.30) with (C.31), we have λ λ λ ∗ j ∈ S : |θej | ≥ ≤ j ∈ S : |θj | ≥ + j ∈ S : |∇j Leλ (θ )| ≥ L 4L 4 λ 100Ls∗ 2 e + j | |∇j Lλ (θ)| ≥ , j ∈ S ∩ A ≤ 364κ + s∗ ≤ se, (C.32) 2 ρe− (s∗ + se) where the last inequality comes from L ≤ ρ+ (s∗ + 2e s) and Assumption 4.6. By definition of the soft thresholding operator, we have [Tλ,L (θ)]j = Sλ/L (θej ). Therefore (C.32) further implies k[Tλ,L (θ)]S k0 ≤ se.
C.7
Proof of Lemma C.4
Proof. Suppose that there exists a coordinate k such that [m+0.5] θ = 0 and |∇k Leλ (θ [m+0.5] )| ≥ (1 + δ)λ. k
8
(C.33)
We conduct a proximal coordinate gradient descent iteration over the coordinate k, and obtain an [m+1] [m+1] auxiliary solution wk . Since wk is obtained by the proximal coordinate gradient descent over the coordinate k, we have [m+1]
wk
= argmin Qλ,k,L (wk ; θ [m+0.5] ).
(C.34)
wk
[m+1]
We then proceed to derive an upper bound for Qλ,k,L (wk
; θ [m+0.5] ). We consider
[m+1]
[m+1] [m+0.5] ; θ [m+0.5] ) = Leλ (θ [m+0.5] ) + (wk − θk )∇k Leλ (θ [m+0.5] ) L [m+1] [m+0.5] 2 [m+1] [m+0.5] − θk ) + λ|wk | + λkθ\k k1 . + (wk 2 By the convexity of the absolute value function, we have
Qλ,k,L (wk
[m+0.5]
|θk
[m+1]
[m+1]
| ≥ |wk
[m+0.5]
| + (θk
[m+1]
− wk
)ξk ,
(C.35) (C.36)
where ξk ∈ ∂|wk
| satisfies the optimality condition of (C.34), i.e., λ 1 [m+1] [m+1] [m+0.5] |. wk − θk + ∇k Leλ (θ [m+0.5] ) + ξk = 0 for some ξk ∈ ∂|wk L L Combining (C.36) with (C.35), we have [m+1]
Qλ,k,L (wk
; θ [m+0.5] ) − Fλ (θ [m+0.5] )
[m+1]
≤ (wk
[m+0.5]
− θk
)(∇k Leλ (θ [m+0.5] ) + λξk ) +
L [m+1] [m+0.5] 2 (w − θk ) 2 k
(ii) δ 2 λ2 L [m+1] [m+0.5] 2 = − (wk − θk ) ≤− , 2 2L where (i) comes from (C.37) and (ii) comes from Lemma 8.8 and (C.33). [m+0.5] We assume that there exists another coordinate j with θj = 0 such that (i)
(C.37)
|∇k Leλ (θ [m+0.5] )| > |∇j Leλ (θ [m+0.5] )|.
(C.38)
(C.39)
Similarly, we conduct a proximal coordinate gradient descent iteration over the coordinate j, and [m+1] obtain an auxiliary solution wj . By definition of the soft thresholding function, we can rewrite [m+1]
wk
[m+1]
and wj
as
zk ∇k Leλ (θ [m+0.5] ) L where zk and zj are defined as λ zk = 1 − |∇k Leλ (θ [m+0.5] )| [m+1]
wk
=−
[m+1]
and wj
=−
and zj = 1 −
zj ∇j Leλ (θ [m+0.5] ), L λ
|∇j Leλ (θ [m+0.5] )|
.
By (C.39), we know zk ≥ zj . Moreover, we define |∇j Leλ (θ [m+0.5] )| z [m+1] z= · zj and w ek = − ∇k Leλ (θ [m+0.5] ). [m+0.5] e L |∇k Lλ (θ )|
9
(C.40)
[m+1]
Note that we have |w ek
[m+1]
| = |wj
|. We then consider
[m+1] [m+0.5] Qλ,k,L (w ek ;θ ) − Leλ (θ [m+0.5] ) z L [m+1] 2 [m+1] [m+0.5] = − |∇k Leλ (θ [m+0.5] )|2 + |w e | + λ|w ek | + λkθ\k k1 L 2 k L [m+1] 2 (i) zj [m+1] [m+0.5] = − |∇k Leλ (θ [m+0.5] )| · |∇j Leλ (θ [m+0.5] )| + |w e | + λ|w ek | + λkθ\k k1 L 2 k (ii) zj L [m+1] 2 [m+1] [m+0.5] < − |∇j Leλ (θ [m+0.5] )|2 + |wj | + λ|wj | + λkθ\k k1 L 2 [m+1] [m+0.5] = Qλ,k,L (wj ;θ ) − Leλ (θ [m+0.5] ),
where (i) comes from (C.40) and (ii) comes from (C.33). We then have [m+1]
Qλ,k,L (wk
[m+1]
; θ [m+0.5] ) ≤ Qλ,k,L (w ek
[m+1]
; θ [m+0.5] ) ≤ Qλ,j,L (wj
; θ [m+0.5] ),
(C.41)
where the least inequality comes from (C.34). Thus (C.41) guarantees [m+0.5]
Qλ,km ,L (wkm
[m+1]
; θ [m+0.5] ) = min Qλ,j,L (wj
; θ [m+0.5] ),
where km = argmaxk∈Am |∇Lek (θ)[m+0.5] |. For any j ∈ Am , we construct two auxiliary solutions w[m+1] and v[m+1] by [m+1]
wj
(C.42)
j∈Am
[m+1]
= argmin Qλ,j,L (vj ; θ [m+0.5] ) and vj vj
[m+0.5]
= argmin Fλ (vj , θ\j
).
(C.43)
vj
Recall that θ [m+0.5] is the output solution of the previous inner loop, i.e, θ [m+0.5] = θ (t+1) . By the restricted strong convexity of Fλ (θ), we have (∇j Leλ (θ (t+1) ) + λξj )2 k∇A Leλ (θ (t+1) ) + λξA k22 [m+1] (t+1) Fλ (θ (t+1) ) − Fλ (vj , θ\j ) ≤ ≤ , 2e ρ− (1) 2e ρ− (1) (t+1)
for some ξA ∈ ∂kθA
k1 . Since the inner loop terminates when kθ (t+1) − θ (t) k22 ≤ τ 2 λ2 , we have
(s∗ + 2e s)ρ2+ (s∗ + 2e s)kθ (t+1) − θ (t) k22 δ 2 λ2 ≤ , 2e ρ− (1) 2L where the equality comes from Assumption 4.7. Therefore (C.44) further implies δ 2 λ2 [m+1] [m+0.5] [m+1] (t+1) Qλ,j,L (wj ;θ ) − Fλ (θ [m+0.5] ) ≥ Fλ (θ (t+1) ) − Fλ (vj , θ\j ) ≥ − . 2L Since j is arbitrarily selected from Am , combining (C.38) with (C.45), we have [m+1]
Fλ (θ (t+1) ) − Fλ (vj
(t+1)
, θ\j
)≤
[m+0.5]
Qλ,km ,L (wkm
[m+1]
; θ [m+0.5] ) ≤ min Qλ,j,L (wj j∈Am
; θ [m+0.5] ).
Combining (C.42) with (C.46), we have [m+0.5]
Qλ,km ,L (wkm
[m+1]
; θ [m+0.5] ) = min Qλ,j,L (wj j
which completes the proof.
10
; θ [m+0.5] ),
(C.44)
(C.45) (C.46)
C.8
Proof of Lemma 8.7
Proof. Define Dm = {w | w ∈ Rd , wBm = 0}, we have
Jλ,L (w[m+1] ; θ [m+0.5] ) = min Jλ,L (w; θ [m+0.5] ) w∈Dm
L = min Leλ (θ [m+0.5] ) + (w − θ [m+0.5] )> ∇Leλ (θ [m+0.5] ) + λkwk1 + kw − θ [m+0.5] k22 w∈Dm 2 ∗ (L − ρ− (s + 2e s)) ≤ min Fλ (w) + kw − θ [m+0.5] k22 , w∈Dm 2 where the last inequality coms from the restricted strong convexity of Leλ (θ), i.e., ρ− (s∗ + 2e s) Leλ (w) ≤ Leλ (θ [m+0.5] ) + (w − θ [m+0.5] )> ∇Leλ (θ [m+0.5] ) + kw − θ [m+0.5] k22 . 2 Let w = z θ¯λ + (1 − z)θ [m+0.5] ) for z ∈ [0, 1]. Then we have z 2 (L − ρ− (s∗ + 2e s)) ¯λ Jλ,L (w[m+1] ; θ [m+0.5] ) ≤ min Fλ (z θ¯λ + (1 − z)θ [m+0.5] ) + kθ − θ [m+0.5] k22 2 z∈[0,1] ≤ Fλ (θ [m+0.5] ) + min z[Fλ (θ¯λ ) − Fλ (θ [m+0.5] )] z∈[0,1]
(z 2 L − zρ− (s∗ + 2e s)) ¯λ kθ − θ [m+0.5] k22 , (C.47) 2 where the last inequality comes from the restricted strong convexity of Fλ (θ), i.e., z(1 − z)ρ− (s∗ + 2e s) ¯λ Fλ (z θ¯λ + (1 − z)θ [m+0.5] ) + kθ − θ [m+0.5] k22 ≤ zFλ (θ¯λ ) + (1 − z)Fλ (θ [m+0.5] ). 2 By the restricted strong convexity of Fλ (θ), we have 2[Fλ (θ [m+0.5] ) − Fλ (θ¯λ )] kθ¯λ − θ [m+0.5] k22 ≤ . (C.48) ρ− (s∗ + 2e s) Combining (C.48) with (C.47), we obtain z2L [m+1] (t) [m+0.5] Jλ,L (w ; θ ) − Fλ (θ ) ≤ min − 2z [Fλ (θ [m+0.5] ) − Fλ (θ¯λ )]. (C.49) s) z∈[0,1] ρ− (s∗ + 2e By setting z = ρe− (s∗ + 2e s)/L, we minimize the R.H.S of (C.49) and obtain ρe− (s∗ + 2e s) Fλ (θ [m+0.5] ) − Jλ,L (w[m+1] ; θ (t) ) ≥ [Fλ (θ [m+0.5] ) − Fλ (θ¯λ )]. L +
C.9
Proof of Lemma C.1
Proof. We prove the uniqueness of θ¯λ by contradiction. Assume that there exist two different local optima θ¯λ and θeλ . Let ξ¯ ∈ ∂kθ¯λ k1 and ξe ∈ ∂kθeλ k1 be two subgradient vectors satisfying ∇Leλ (θ¯λ ) + λξ¯ = 0 and ∇Leλ (θeλ ) + λξe = 0. (C.50)
Since kθ¯Sλ k0 ≤ se and kθeSλ k0 ≤ se, by the restricted strong convexity of Fλ (θ), we have ρe− (s∗ + 2e s) ¯λ eλ 2 kθ − θ k2 , 2 ∗ s) ¯λ eλ 2 ¯ + ρe− (s + 2e Fλ (θeλ ) ≥ Fλ (θ¯λ ) + (θeλ − θ¯λ )> (∇L¯λ (θeλ ) + λξ) kθ − θ k2 . 2 e + Fλ (θ¯λ ) ≥ Fλ (θeλ ) + (θ¯λ − θeλ )> (∇Leλ (θeλ ) + λξ) 11
(C.51) (C.52)
Combining (C.50) and (C.51) with (C.52), we have kθ¯λ − θeλ k22 = 0 implying θ¯λ = θeλ . That is contradicted by the assumption. Thus the local optimum θ¯λ is unique.
C.10
Proof of Theorem 4.9 (cont’d)
Proof. We can prove similar results for the randomized selection rule by slightly trimming the proof [m] [m] of the greedy selection rule. We first follow similar lines to show kθS k0 ≤ se and kθS k0 ≤ se + 1 for all m = 0, 1, 2, ..., since the randomized selection also moves only one inactive coordinate into the active set in each iteration. For the proximal coordinate gradient descent, we construct the same auxiliary solution w[m+1] = [m+0.5] [m+0.5] > (w1 , ..., wd ) , where [m+1]
wk [m+1] wk
= argmin Qλ,k,L (θk ; θ [m+0.5] ). θk
[m+0.5] θk }.
We define B = {j | 6= Similarly, Lemma 8.4 guarantees |B| ≤ s∗ + 2e s. We then divide all coordinates into three subsets M1 = B ∩ {j | j ∈ Am , |∇j Leλ (θ [m+0.5] )| ≥ (1 + δ)λ}, M2 = B ∩ {j | j ∈ Am , |∇j Leλ (θ [m+0.5] )| ≤ (1 + δ)λ}, M3 = B ∩ {j | j ∈ Am }.
Following similar lines to the proof of Lemma C.4, we have [m+1]
max Qλ,k,L (wk
k∈M1
; θ [m+0.5] ) ≤
min
j∈M2 ∪M3
[m+1]
Qλ,j,L (wj
; θ [m+0.5] ),
which implies that X 1 1 X [m+1] [m+0.5] [m+1] [m+0.5] Qλ,j,L (wj ;θ )≥ Qλ,k,L (wk ;θ ). |B| |M1 | k∈|M1 |
j∈|B|
Then conditioning on
(C.53)
θ [m+0.5] ,
we have X 1 [m+1] [m+0.5] EFλ (θ [m+1] )|θ [m+0.5] = Fλ (wk , θ\k ) |M1 | k∈M1 X 1 X 1 [m+1] [m+0.5] [m+1] [m+0.5] ≤ Qλ,k,L (wk ;θ )≤ Qλ,j,L (wj ;θ ), |M1 | |B| k∈|M1 |
(C.54)
j∈|B|
where the last inequality comes from (C.53). By rearranging (C.54), we have 1 Fλ (θ [m+0.5] ) − E[Fλ (θ [m+1] )|θ [m+0.5] ] ≥ ∗ [Jλ,L (w[m+1] ; θ [m+0.5] ) − Fλ (θ [m+0.5] )]. s + 2e s We then follow similar lines to show ρe− (s∗ + 2e s) [m+1] λ ¯ EFλ (θ ) − F λ (θ ) ≤ 1 − ∗ [Fλ (θ [m] ) − Fλ (θ¯λ )]. (C.55) (s + 2e s)L
For the exact coordinate minimization, we construct a similar auxiliary solution θ [m+0.75] obtain by
12
the proximal coordinate gradient descent using L = ρ+ (1), then follow similar lines to show h i EFλ (θ [m+1] ) − Fλ (θ¯λ ) ≤ EFλ (θ [m+0.75] ) − Fλ (θ¯λ ) h i ρe− (s∗ + 2e s) ≤ 1− ∗ Fλ (θ [m] ) − Fλ (θ¯λ ) . (C.56) (s + 2e s)ρ+ (1) We then combine (C.55) with (C.56), and take the expectation over m = 0, 1, 2, ..., and then obtain i ρe− (s∗ + 2e s) [m] h [m+1] λ ¯ Fλ (θ [m] ) − Fλ (θ¯λ ) (C.57) EFλ (θ ) − Fλ (θ ) ≤ 1 − ∗ (s + 2e s)ρ+ (1) The iteration complexity can also be derived in a similar fashion. When we have δ 2 λ2 Fλ (θ [m] ) − Fλ (θ¯λ ) ≤ , (C.58) 3ν+ (1) we must have |∇k Leλ (θ [m+0.5] )| ≤ (1 + δ)λ. That implies that the algorithm must terminate when m
(C.58) holds. Applying the Markov inequality to (C.57), we have 3ν+ (1) δ 2 λ2 [m] λ ¯ ≤ 2 2 [EFλ (θ [m] ) − Fλ (θ¯λ )] P Fλ (θ ) − Fλ (θ ) ≥ 3ν+ (1) δ λ i ∗ ρe− (s + 2e s) m h 3ν+ (1) Fλ (θ (0) ) − Fλ (θ¯λ ) ≤ ϑ. 1− ∗ ≤ 2 2 δ λ (s + 2e s)ν+ (1) By simple manipulation, we need ! ρe− (s∗ + 2e s) ϑδ 2 λ2 −1 m ≥ log 1− ∗ log (s + 2e s)ν+ (1) 3ν+ (1) Fλ (θ (0) ) − Fλ (θ¯λ ) iterations such that
C.11
δ 2 λ2 [m] λ ¯ P Fλ (θ ) − Fλ (θ ) ≤ ≥ 1 − ϑ. 3ν+ (1)
Proof of Lemma 8.9
Proof. We assume that the truncated cyclic selection adds exactly se + 1 inactive coordinates into the active set, and obtain a new active set A[m+1] . We define an auxiliary set B = A[m+1] ∪ S. Since the objective value always decreases in each middle loop, we have 4λ2 s∗ Fλ (θ [m+1] ) ≤ Fλ (θ ∗ ) + . (C.59) ρe− (s∗ + se)
We then define w[m+1] as a local optimum to the following optimization problem, min Fλ (θ)
θ∈Rd
subject to θB = 0.
(C.60)
By Assumption 4.6, we know that (C.60) is a strongly convex optimization problem. Therefore w[m+1] is a unique global optimum. Moreover, since |B ∩ S| ≤ 2e s + 1, by Lemma C.2, we have ∗ 25λs kw[m+1] − θ ∗ k1 ≤ . ρe− (s∗ + 2e s + 1)
13
By the restricted strong convexity of Fλ (θ), for any ξ ∈ ∂kθ ∗ k1 such that Fλ (θ ∗ ) − Fλ (w[m+1] ) ≤ −(w[m+1] − θ ∗ )> (∇Leλ (θ ∗ ) + λξ) (i)
≤ kw[m+1] − θ ∗ k1 · k∇Leλ (θ ∗ ) + λξk∞ (ii) λ 25λs∗ 125λ2 s∗ ≤ + λ ≤ , (C.61) ρe− (s∗ + 2e s + 1) 4 4e ρ− (s∗ + 2e s + 1) where (i) comes from H¨older’s inequality, and (ii) comes from Assumption 4.1 and the fact kξk∞ ≤ 1. Since w[m+1] is the global optimum to (C.60), (C.61) further implies 125λ2 s∗ 125λ2 s∗ [m+1] Fλ (θ ∗ ) ≤ Fλ (w[m+1] ) + ≤ F (θ ) + . (C.62) λ 4e ρ− (s∗ + 2e s + 1) 4e ρ− (s∗ + 2e s + 1) Combining (C.62) with (C.59), we have 36λ2 s∗ Fλ (θ [m] ) ≤ Fλ (θ [m+1] ) + . ρe− (s∗ + 2e s + 1)
C.12
Proof of Lemma 8.10
Proof. Since the truncated cyclic selection only selects an inactive coordinate when its corresponding coordinate gradient is sufficiently large in magnitude, by Lemma 8.8, we know that adding se + 1 inactive coordinates leads to (e s + 1)δ 2 λ2 . Fλ (θ [m+1] ) ≤ Fλ (θ [m+0.5] ) ≤ Fλ (θ [m] ) − 2ν+ (1)
D D.1
Lemmas for Proving Theorem 4.12 Proof of Theorem 4.10
Proof. For notational simplicity, we define ∆ = θ − θ ∗ . Let ξe ∈ ∂kθk1 be a subgradient vector satisfying e ∞. KλK−1 (θ) = k∇LeλK−1 (θ) + λK−1 ξk
We then consider the following decomposition e ∞ ≤ k∇Leλ Kλ (θ) ≤ k∇Leλ (θ) + λK ξk K
K
K−1
(i)
e ∞ + kλK ξe − λK−1 ξk e∞ (θ) + λK−1 ξk (ii)
λK , (D.1) 4 ≤ 1/8 and 1 − η ≤ 1/24 in
+ k∇HλK (θ) − ∇HλK−1 (θ)k∞ ≤ δK−1 λK−1 + 3(1 − η)λK−1 ≤
where (i) comes from (R.4) in Assumption 4.3, and (ii) comes from δK−1 Assumption 4.1. We then proceed to characterize the statistical error of θ in terms of λK . For notational simplicity, we omit the index K and denote λK by λ. Since (D.1) implies that θ satisfies the approximate
14
KKT condition for λ, then by the restricted strong convexity of Leλ (θ), we have ρe− (s∗ + se) e Fλ (θ ∗ ) − k∆k22 ≥ Fλ (θ) − ∆> (∇Leλ (θ) + λξ) 2 (i) (ii) e ∞ · k∆k1 ≥ Fλ (θ) − λ k∆k1 , (D.2) ≥ Fλ (θ) − k∇Leλ (θ) + λξk 4 where (i) comes from H¨ older’s inequality and (ii) comes from (D.1). We then rewrite (D.2) as λ ρe− (s∗ + se) λkθ ∗ k1 − λkθk1 + k∆k1 ≥ Leλ (θ) − Leλ (θ ∗ ) + k∆k22 . (D.3) 4 2 By the restricted strong convexity of Leλ (θ) again, we have ρe− (s∗ + se) Leλ (θ) − Leλ (θ ∗ ) − k∆k22 ≥ ∆> ∇Leλ (θ ∗ ) 2 (i)
∗ > ∗ ∗ > = ∆> S ∇S Lλ (θ ) + ∆S ∇S Lλ (θ ) + ∆S ∇S Hλ (θ )
(ii)
≥ −k∆S k1 k∇L(θ ∗ )k∞ − k∆S k1 k∇L(θ ∗ )k∞ − k∆S k1 k∇S Hλ (θ ∗ )k∞ ,
(D.4)
(θ ∗ )
where (i) comes from ∇S Hλ = 0 by (R.3) of Assumption 4.3, and (ii) comes from H¨older’s inequality. By Assumption 4.1 and (R.2) of Assumption 4.3, we have k∇L(θ ∗ )k∞ ≤ λ/4
and k∇S Hλ (θ ∗ )k∞ ≤ λ.
Combining (D.4) with (D.5), we obtain 3 λ Leλ (θ) − Leλ (θ ∗ ) ≥ − λk∆S k1 − k∆S k1 + ρe− (s∗ + se)k∆k22 . 2 2 Plugging (D.6) and
(D.5)
(D.6)
kθ ∗ k1 − kθk1 = kθS∗ k1 − (kθS k1 + k∆S k1 ) ≤ k∆S k1 − k∆S k1
into (D.3), we obtain
11λ λ (D.7) k∆S k1 ≥ k∆S k1 + ρe− (s∗ + se)k∆k22 . 4 4 By simple manipulation, (D.7) implies 11λ √ ∗ 11λ √ ∗ 11λ ρe− (s∗ + se)k∆k22 ≤ k∆S k1 ≤ s k∆S k2 ≤ s k∆k2 , (D.8) 4 4 4 where the second inequality comes from the fact that ∆S only contains s∗ rows. By simple manipulation again, (D.8) implies √ 11λ s∗ k∆k2 ≤ . (D.9) 4e ρ− (s∗ + se) Meanwhile, (D.7) also implies k∆S k1 ≤ 11k∆S k1 .
Combining (D.9) with (D.10), we obtain √ √ k∆k1 ≤ 11k∆S k1 ≤ 11 s∗ k∆S k2 ≤ 11 s∗ k∆k2 ≤ Plugging (D.11) and (D.9) into (D.2), we have Fλ (θ) − Fλ (θ ∗ ) ≤ δλk∆k1 ≤
15
(D.10)
31λs∗ . ρe− (s∗ + se)
4λ2 s∗ . ρe− (s∗ + se)
(D.11)
D.2
Proof of Lemma 5.3
Proof. For notational simplicity, we denote θ relax by θ and write Feλ (θ) = L(θ) + λkθk1 . Let ξe ∈ ∂kθk1 be a subgradient vector satisfying e ∞ = min k∇L(θ) + λξk∞ . k∇L(θ) + λξk ξ∈∂kθk1
For notational simplicity, we define ∆ = − θ. Since Feλ (θ) is a convex function, we have e Feλ (θ ∗ ) ≥ Feλ (θ) − ∆> (∇L(θ) + λξ) θ∗
e ∞ ≥ Feλ (θ) − λ k∆k1 , ≥ Feλ (θ) − k∆k1 k∇L(θ) + λξk (D.12) 8 where the second inequality comes from H¨ older’s inequality, and the last inequality comes from (5.7). To establish the statistical properties of θ, we need to verify that θ satisfies kθ − θ ∗ k2 ≤ R such that the restricted strong convexity holds for θ. We prove it by contradiction. We first assume kθ − θ ∗ k2 ≥ R. Then there exists some z ∈ (0, 1) such that θe = (1 − z)θ + zθ ∗ and kθe − θ ∗ k2 = R. (D.13) Then by the convexity of Feλ (θ) again, (D.12) and (D.13) imply e ≤ (1 − z)Feλ (θ) + z Feλ (θ ∗ ) Feλ (θ)
λ e (1 − z)λ k∆k1 + z Feλ (θ ∗ ) ≤ Feλ (θ ∗ ) + k∆k 1, 8 8 where the last inequality comes from the fact e 1 = kθe − θ ∗ k1 = k(1 − z)θ + zθ ∗ − θ ∗ k1 = (1 − z)k∆k1 . k∆k ≤ (1 − z)Feλ (θ ∗ ) +
(D.14)
By simple manipulation, we can rewrite (D.14) as
e − L(θ ∗ ) ≤ λkθ ∗ k1 − λkθk e 1 + λ k∆k e 1. L(θ) (D.15) 8 By the convexity of L(θ), we have e − L(θ ∗ ) ≥ ∆ e > ∇L(θ ∗ ) ≥ −k∆k e 1 k∇L(θ ∗ )k∞ ≥ − λ k∆S k1 − λ k∆ k1 , (D.16) L(θ) S 8 8 where the last inequality comes from our assumption λ ≥ 8k∇L(θ ∗ )k∞ . By the decomposability of the `1 norm, we have 1 e 1 e 1 e ∗ kθ ∗ k1 − kθk1 + k∆k 1 = kθS k1 − (kθS k1 + k∆S k1 ) + k∆S k1 + k∆S k1 8 8 8 9 9 7 e k1 ≤ k∆ e S k1 − k∆ e k1 . ≤ k∆S k1 − (1 − δ)k∆ (D.17) S S 8 8 8 Combining (D.15) with (D.16) and (D.17), we obtain e k1 ≤ 5 k∆ e S k1 . k∆ (D.18) S 3
16
e we define the following sets: To establish the statistical properties of θ, n o X S0 = j j ∈ S, 1(|θek | ≥ |θej |) ≤ se , k∈S
n S1 = j j ∈ S \ S0 ,
X
k∈S\S0
n S2 = j j ∈ S \ (S0 ∪ S1 ),
o 1(|θek | ≥ |θej |) ≤ se , X
k∈S\(S0 ∪S1 )
n S3 = j j ∈ S \ (S0 ∪ S1 ∪ S2 ),
o 1(|θek | ≥ |θej |) ≤ se ,
X
k∈S\(S0 ∪S1 ∪S2 )
o 1(|θek | ≥ |θej |) ≤ se , ....
Before we proceed with the proof, we introduce the following lemma.
Lemma D.1 (Lemma 6.9 in B¨ uhlmann and van de Geer (2011)). Let b1 ≥ b2 ≥ ... ≥ 0. For s ∈ {1, 2, ...}, we have v u sX (k+1)s ∞ ∞ Xu u X 2 √ X 2 t bj ≤ s bj . bj ≤ j≥i+1
k=1
j=ks+1
k=1
The proof of Lemma D.1 is provided in B¨ uhlmann and van de Geer (2011), and therefore is omitted. By Lemma D.1 and (D.18), we have r r ∗ X 5 1 s 5 s∗ e e S k1 ≤ √ k∆ e k1 ≤ e S k2 ≤ k∆ k∆ k∆A k2 , j S 3 se 3 se s e j≥1
where A = S ∪ S0 . By definition of the largest sparse eigenvalue and Assumption 4.6, given θ¨ = z θe + (1 − z)θ ∗ for any z ∈ [0, 1] and j ≥ 1, we have e> 2 ¨∆ e A ≤ ρ+ (s∗ + se)k∆ e S k2 k∆ e A k2 , ∆Sj ∇Sj A L(θ) j
which further implies
¨∆ e > ∇2 L(θ) e A| |∆ A AA
≤
X j≥1
¨∆ e > ∇2 L(θ) e A| |∆ Sj S j A
5ρ+ (s∗ + 2e s) e 2 = k∆A k2 3
r
s∗ . se
By definition of the smallest sparse eigenvalue and Assumption 4.6 again, we have ¨∆ eA e > ∇2 L(θ) ∆ A AA ≥ ρ− (s∗ + se). 2 e k∆A k2 Combining (D.19) with (D.20), we have r 5ρ+ (s∗ + se) s∗ e > 2 > 2 ¨ ¨∆ e e e A, |∆A ∇AA L(θ)∆A | ≤ ∆ ∇ L(θ) 3ρ− (s∗ + se) se A AA which further implies r ¨∆ e > ∇2 L(θ) e A| |∆ 5ρ+ (s∗ + se) s∗ A AA ≤ . ¨∆ e > ∇2 L(θ) e A| 3ρ− (s∗ + se) se |∆ A AA
17
(D.19)
(D.20)
Eventually, we have ¨ ∆> ∇2 L(θ)∆ ≥ k∆A k22 ≥
! ¨∆ e > ∇2 L(θ) e A| |∆ A AA 1− ρ− (s∗ + se) ¨∆ e > ∇2 L(θ) e A| |∆ A AA r ! 7ρ− (s∗ + se) 9ρ+ (s∗ + se) s∗ ∗ ρ (s + s e ) ≥ , 1− − 7ρ− (s∗ + se) se 8
(D.21)
where the last inequality comes from Assumption 4.6. Then by the mean value theorem, we choose some z such that ∗ e − L(θ ∗ ) − ∆ ¨∆ e > ∇L(θ ∗ ) = 1 ∆ e > ∇2 L(θ) e ≥ 7ρ− (s + se) k∆ e S k2 , L(θ) 2 2 16 which implies ∗ e − L(θ ∗ ) ≥ ∆ e A k2 e > ∇L(θ ∗ ) + 7ρ− (s + se) k∆ L(θ) 2 16 ∗ 7ρ− (s + se) e 2 λ e λ e ≥ k∆A k2 − k∆S k1 − k∆ S k1 . 16 8 8 Then by (D.15) and (D.17), we have e S k2 ≤ ρ− (s∗ + se)k∆ e A k2 ≤ 20 λk∆S k1 ρ− (s∗ + se)k∆ 2 2 7 √ 20 ∗ 20 √ ∗ ≤ s λk∆S k2 ≤ s λk∆A k2 , 7 7 which implies √ 20s∗ λ 20 s∗ λ and k∆ k ≤ . (D.22) k∆S k2 ≤ k∆A k2 ≤ 1 S 7ρ− (s∗ + se) 7ρ− (s∗ + se) By Lemma D.1, (D.22) implies √ e k1 e S k1 k ∆ s∗ λ 5k ∆ 24 S e k2 ≤ √ √ k∆ . ≤ = A 5ρ− (s∗ + se) s∗ 3 s∗ Combining the above results, we have √ q 17 s∗ λ 2 2 e e e k∆k2 = k∆A k2 + k∆A k2 ≤ < R. 3ρ− (s∗ + se) where the last inequality comes from the intial condition of θ. This conflicts with our assumption e 2 = R. Therefore we must have kθ − θ ∗ k2 ≤ R. Consequently, we repeat the above proof for θ, k∆k and obtain √ √ 17 s∗ λ 23 s∗ λ and k∆k1 = k∆S k1 + k∆S k1 ≤ . k∆k2 ≤ 3ρ− (s∗ + se) 3ρ− (s∗ + se) We now proceed to characterize the sparsity of θ. By Assumption 4.1 and the intial condition of θ, we have λ = 2λN ≥ 8k∇L(θ ∗ )k∞ , which further implies j |∇j L(θ ∗ )| ≥ λ , j ∈ S = 0. (D.23) 8
We then consider an arbitrary set S 0 such that 5λ 0 ∗ S = j | |∇j L(θ) − ∇j L(θ )| ≥ , j∈S . 8
18
Let s0 = |S 0 |. Then there exists v such that kvk∞ = 1, kvk0 ≤ s0 ,
and
5s0 λ/8 ≤ v> (∇L(θ) − ∇L(θ ∗ )).
Since L(θ) is twice differentiable, then by the mean value theorem, there exists a convex combination of θ and θ ∗ such that for some z1 ∈ [0, 1], we have ¨ θ¨ = z1 θ + (1 − z1 )θ ∗ and ∇L(θ) − ∇L(θ ∗ ) = ∇2 L(θ)∆.
Then we have
q q 5s0 λ ¨ ¨ ¨ ≤ v> ∇2 L(θ)∆ ≤ v> ∇2 L(θ)v ∆> ∇2 L(θ)∆. 8 Since we have kvk0 ≤ s0 , then we obtain √ q 3s0 λ p 0 ≤ ρ+ (s ) s0 ∆> (∇L(θ) − ∇L(θ ∗ )) 4 p √ p ≤ ρ+ (s0 ) s0 k∆k1 · k∇L(θ) − ∇L(θ ∗ )k∞ p √ p ≤ ρ+ (s0 ) s0 k∆k1 (k∇L(θ)k∞ + k∇L(θ ∗ )k∞ ) p √ q 0 e ∞ + k∇L(θ ∗ )k∞ ) ≤ ρ+ (s ) s0 k∆k1 (k∇L(θ) − λξk∞ + λkξk s p √ 115s∗ λ2 . ≤ ρ+ (s0 ) s0 12ρ− (s∗ + se) q √ p 0 ∗ 184ρ+ (s0 ) 0 ∗ By simple manipulation, we have 5 8s ≤ ρ+ (s0 ) 12ρ115s ∗ s) , which implies s ≤ 15ρ (s∗ +e s) · s . − (s +e − Since S 0 is arbitrary defined, by simple manipulation, we have j | |∇j L(θ) − ∇j L(θ ∗ )| ≥ 5λ , j ∈ S ≤ 13κs∗ < se. 8
Therefore (D.23) and (D.24) imply that for any u ∈ Rd satisfying kuk∞ ≤ 1, we have b + λ uj | ≥ 7λ , j ∈ S ∩ A ≤ se. j | |∇j L(θ) 8 8
(D.24)
b + λ uj | ≤ 7λ/8, there exists a ξj ∈ Rqj satisfying Then for any j ∈ S ∩ A satisfying |∇j L(θ) 8 b + λ uj + λξj = 0, |ξj | ≤ 1 and ∇j Lλ (θ) 8 which implies θj = 0. Therefore we have kθA k0 ≤ se. Since θ is sufficiently sparse, we know that the restricted strong convexity holds for θ and θ ∗ . Then we can refine our analysis for θ. By the restricted strong convexity of Feλ (θ), we have ρ− (s∗ + se) e ≥ Feλ (θ) − λ k∆k1 . Feλ (θ ∗ ) − k∆k22 ≥ Feλ (θ) − ∆> (∇L(θ) + λξ) (D.25) 2 8 By simple manipulation, we can rewrite (D.25) as λ L(θ) − L(θ ∗ ) ≤ λkθ ∗ k1 − λkθk1 + k∆k1 . 8 By the restricted strong convexity of L(θ), we have λ λ L(θ) − L(θ ∗ ) − ρ− (s∗ + se)k∆k22 ≥ − k∆S k1 − k∆S k1 , (D.26) 8 8 where the last inequality comes from our assumption λ ≥ 8k∇L(θ ∗ )k∞ . By the decomposability of
19
the `1 norm, we have 1 1 1 kθ ∗ k1 − kθk1 + k∆k1 = kθS∗ k1 − (kθS k1 + k∆S k1 ) + k∆S k1 + k∆S k1 8 8 8 9 9 7 ≤ k∆S k1 − (1 − δ)k∆S k1 ≤ k∆S k1 − k∆S k1 , (D.27) 8 8 8 where the last inequality comes from δ < 1/8 in Assumption 4.1. Combining (D.18) and (D.15) with (D.26) and (D.27), we obtain √ √ 5λ 5λ s∗ 5λ s∗ ∗ 2 ρ− (s + se)k∆k2 ≤ k∆S k1 ≤ k∆S k2 ≤ k∆S k2 , 4 4 4 which implies that √ √ 5λ s∗ 5λs∗ ∗ k∆ k ≤ k∆k2 ≤ and k∆ k ≤ s . S 1 S 2 4ρ− (s∗ + se) 4ρ− (s∗ + se) By (D.18), we further have 8 10λs∗ k∆k1 ≤ k∆S k1 ≤ . (D.28) 3 3ρ− (s∗ + se) Plugging (D.28) into (D.25), we have 8λ2 s∗ . Feλ (θ ∗ ) ≥ Feλ (θ) + 7ρ− (s∗ + se) By the concavity of Hλ (θ) and H¨ older’s inequality, we have Hλ (θ relax ) ≤ Hλ (θ ∗ ) + (θ relax − θ ∗ )> ∇Hλ (θ ∗ ) ≤ Hλ (θ ∗ ) + kθ relax − θ ∗ k1 k∇Hλ (θ ∗ )k∞ .
Since we have kHλ (θ)k∞ ≤ λ, by Lemma 5.3, we have
Hλ (θ relax ) ≤ Hλ (θ ∗ ) + λkθ relax − θ ∗ k1 ≤ Hλ (θ ∗ ) + 4λ0 .
Since Fλ0 (θ) = Feλ0 (θ) + Hλ0 (θ), by Lemma 5.3 again, we have Fλ0 (θ relax ) ≤ Fλ (θ ∗ ) + 4λ0 . Therefore θ relax is a proper initial solution for solving (1.1) with λ0 by PICASSO.
D.3
Proof of Theorem 8.11
e ∞ . By the restricted Proof. Let ξe ∈ ∂kθk1 be a subgradient vector satisfying Kλ (θ) = k∇Leλ (θ)+λξk strong convexity of Leλ0 (θ), we have 0 0 e Fλ0 (θ) − Fλ0 (θ¯λ ) ≤ (θ − θ¯λ )> (∇Lλ0 (θ) + ∇Hλ0 (θ) + λ0 ξ) 0 = (θ − θ¯λ )> (∇Lλ (θ) + ∇Hλ (θ) + λξe − λξe + λ0 ξe − ∇Hλ (θ) + ∇Hλ0 (θ))
(i)
0
e ∞ + (λ − λ0 ) + k∇Hλ (θ) − ∇Hλ0 (θ)k∞ ) ≤ kθ − θ¯λ k1 (k∇Lλ (θ) + ∇Hλ (θ) + λξk
(ii)
0 ≤ (Kλ (θ) + 3(λ − λ0 ))kθ − θ¯λ k1 ,
(D.29)
e ∞ ≤ 1, and (ii) comes from (R.3) of Assumption where (i) comes from H¨older’s inequality and kξk 4.3. Meanwhile, since we have 0 0 kθ¯λ k0 ≤ se, Kλ0 (θ¯λ ) = 0 ≤ λ0 /4, kθ k0 ≤ se, and Kλ (θ) ≤ λ/4, S
S
20
following similar lines to the proof of Theorem 4.10, we have 25λ0 s∗ 25λs∗ 0 ∗ kθ¯λ − θ ∗ k1 ≤ and kθ − θ k ≤ , 1 ρe− (s∗ + se) ρe− (s∗ + se) which further implies 50(λ + λ0 )s∗ 0 0 kθ − θ¯λ k1 ≤ kθ ∗ − θk1 + kθ ∗ − θ¯λ k1 ≤ . ρe− (s∗ + se) Plugging (D.30) into (D.29), we obtain 50(Kλ (θ) + 3(λ − λ0 ))(λ + λ0 )s∗ 0 . Fλ0 (θ) − Fλ0 (θ¯λ ) ≤ ρe− (s∗ + se)
E E.1
(D.30)
Lemmas for Statistical Theory Proof of Lemma 4.14
Proof. By Lemma 8.14, we have 1 1 k∇L(θ ∗ )k∞ = k X> (y − Xθ ∗ )k∞ = kX> k∞ . n n p Since we take λ = 8σ log d/n, combining (E.1) with Lemma 8.14, we obtain 2 P (λ ≥ 4k∇L(θ ∗ )k∞ ) ≤ 1 − 2 . d √ Moreover, for any v ∈ Rd such that kvk0 ≤ s, we have kvk1 ≤ skvk2 . Then (4.3) implies s log d kXvk22 ≥ ψmin kvk22 − γmin kvk22 . n n By simple manipulation, (E.2) implies kXvk22 3ψmin ≥ kvk22 n 4 for n large enough such that s log d ψmin γmin ≤ . n 4 Similarly, we can show that (4.3) implies kXvk22 5ψmax ≤ kvk22 n 4 for n large enough such that s log d ψmax γmax ≤ . n 4 Since v is an arbitrary sparse vector, for α = ψmin /4, (E.3) and (E.5) guarantee ρe− (s) = ρ− (s) − α ≥ ψmin /2
Let s = s∗ + 2e s + 1. (E.7) implies
484κ2 + 100κ ≤ 484 ·
and ρ+ (s) = ρ− (s) ≤ 5ψmax /4. 2 25ψmax 5ψmax + 100 · . 2 2ψmin 4ψmin
21
(E.1)
(E.2)
(E.3)
(E.4)
(E.5)
(E.6) (E.7)
Then we can choose C1 as C1 = 3025 ·
2 ψmax ψmax + 250 · 2 ψmin ψmin
such that se = C1 s∗ ≥ (484κ2 + 100κ)s∗ . Meanwhile, we also need a large enough n satisfying log d ψmin log d ψmax ≤ and ≤ n 4γmin (s∗ + 2C1 s∗ + 1) n 4γmax s∗ + 2C1 s∗ + 1 such that (E.6) and (E.4) hold.
E.2
Proof of Lemma 8.12
b λ, Proof. For notational simplicity, we omit the index N and denote θb{N } , λN , and δN by θ, b 1 be a subgradient vector satisfying b = θb − θ ∗ . Let ξb ∈ ∂kθk and δ respectively. We define ∆ b = k∇Leλ (θ) b + λξk b ∞ ≤ δλ. Then by the restricted strong convexity of Fλ (θ), we have Kλ (θ) ∗ b ≥ Fλ (θ ∗ ) + ∆ e + ρe− (s + se) k∆k b > (∇Leλ (θ ∗ ) + λξ) b 2, Fλ (θ) (E.8) 2 2 ∗ b −∆ b + λξ) b + ρe− (s + se) k∆k b 2, b > (∇Leλ (θ) (E.9) Fλ (θ ∗ ) ≥ Fλ (θ) 2 2 where ξe ∈ ∂kθ ∗ k1 . Combining (E.8) with (E.9), we have b + λξk b ∞−∆ e b 2 ≤ k∆k b 1 k∇Leλ (θ) b > (∇L(θ ∗ ) + ∇Hλ (θ ∗ ) + λξ) ρe− (s∗ + 2e s)k∆k 2
e + δλk∆k b > (∇L(θ ∗ ) + ∇Hλ (θ ∗ ) + λξ)| b . ≤ |∆ | {z } | {z }1 V0
(E.10)
V4
[Bounding V0 ] We consider the following decomposition X e ≤ b > (∇L(θ ∗ ) + ∇Hλ (θ ∗ ) + λξ)| b > (∇A L(θ ∗ ) + ∇A Hλ (θ ∗ ) + λξeA )|, |∆ |∆ A
A∈{S1 ,S2 ,S}
where S1 = {j | |θj∗ | ≥ γλ} and S2 = {j | 0 < |θj∗ | < γλ}. For S, we have k∇S L(θ ∗ )k∞ ≤ λ/4 and ∇S Hλ (θ ∗ ) = 0. Thus there exists some ξeS ∈ ∂kθS∗ k1 such that ∇S L(θ ∗ ) + ∇S Hλ (θ ∗ ) + λξeS = 0, which implies b > (∇ L(θ ∗ ) + ∇ Hλ (θ ∗ ) + λξe )| = 0. |∆ (E.11) S
S
S θj∗ .
|θj∗ |
For all j ∈ S1 , we have > γλ and |θj | is smooth at θj = Thus by (R.2) of Assumption 4.3, ∗ we have ∇S1 Hλ (θ ) + λξeS1 = 0, which implies b > (∇S L(θ ∗ ) + ∇S Hλ (θ ∗ ) + λξeS )| = |∆ b > ∇S L(θ ∗ )| |∆ S1 S1 1 1 1 1 b S k2 k∇S L(θ ∗ )k2 ≤ k∆k b 2 k∇S L(θ ∗ )k2 . ≤ k∆ (E.12) 1
1
We then consider S2 . Then we have b > (∇S L(θ ∗ ) + ∇S Hλ (θ ∗ ) + λξeS )| |∆ S2 2 2 2
b S k1 (k∇S L(θ ∗ )k∞ + k∇S Hλ (θ ∗ )k∞ + kλξeS k∞ ) ≤ 3λ ≤ k∆ 2 2 2 2
Combining (E.11) and (E.12) with (E.13), we have p b 2 + 3λ |S2 |k∆k b 2. V0 ≤ k∇S L(θ ∗ )k2 k∆k 1
1
p b 2. |S2 |k∆k
(E.13) (E.14)
[Bounding V4 ] We then proceed to bound V4 . Since θ satisfies the approximate KKT condition, by 22
√ b 1 ≤ 11 s∗ k∆k b 2 . Therefore by (E.14) into (E.10), we have Theorem 4.10, we have k∆k p √ b 2 + 3λ |S2 |k∆k b 2 ≤ k∇S L(θ ∗ )k2 k∆k b 2 + 11δλ s∗ k∆k b 2. ρe− (s∗ + se)k∆k 2 1 Solving the above inequality, we complete the proof.
E.3
Proof of Lemma 8.13
Proof. For any , e ∈ Rn , we have r s 1 1 > 1 1q > > > k X∗S1 ( − e)k2 = ( − e) X∗S1 X∗S1 ( − e) ≤ Λmax X∗S1 X∗S1 k − ek2 . (E.15) n n n n
Since we have
Λmax then (E.15) implies
1 X∗S1 X> ∗S1 n
= Λmax
1 k X> ( − e)k2 ≤ n ∗S1
r
1 > X X∗S1 n ∗S1
= ρ+ (|S1 |),
ρ+ (|S1 |) k − ek2 . n
Therefore k n1 X> ∗S1 k2 is a Lipschitz continuous function with a Lipschitz constant Talagrand’s inequality (Ledoux and Talagrand, 2011), we have p 1 > nt2 1 > P k X∗S1 k2 ≥ Ek X∗S1 k2 + t ρ+ (|S1 |) ≤ 2 exp − 2 . n n 2σ
q
ρ+ (|S1 |) . n
By
For any vector v ∈ R|S1 | , we define a zero mean Gaussian random variable Z(v) = n1 v> X> ∗S1 . Then we have 1 k X> k2 = max Z(v), n ∗S1 kvk=2 which is the supremum of a Gaussian process, and can be upper bounded by the Gaussian comparison e ∈ R|S1 | satisfying kvk2 ≤ 1 and ke principle. For any v, v vk2 ≤ 1, we have 1 1 > 1 > ρ+ (|S1 |) 2 2 2 e )k2 ≤ Λmax e k22 . e k22 ≤ X∗S1 X∗S1 σ 2 kv − v σ kv − v E(Z(v) − Z(e v)) = 2 k X∗S1 (v − v n n n n We then define a second Gaussian process r ρ+ (|S1 |) > Y (v) = v w, n e , we have where w ∼ N (0, σ 2 I) is standard Gaussian. By construction, for any pair of v and v ρ+ (|S1 |) e k22 ≥ E(Z(v) − Z(e E(Y (v) − Y (e v))2 = kv − v v))2 , n then by the Sudakov-Fernique comparison principle, we have 1 Ek X> k2 = E max Z(v) ≤ E max Y (v). n ∗S1 kvk2 =1 kvk2 =1
23
By definition of Y (v), we have
r
r q ρ+ (|S1 |) ρ+ (|S1 |) E max Y (v) = Ekwk2 = E kwk22 n n kvk2 =1 r r q ρ+ (|S1 |) ρ+ (|S1 |) p 2 ≤ Ekwk2 = σ |S1 |. n n p Taking t = 2σ |S1 |/n, we have ! r 1 > ρ+ (|S1 |)|S1 | P k X∗S1 k2 ≥ 3σ ≤ 2 exp (−2|S1 |) , n n which completes the proof.
E.4
Proof of Lemma 8.15
Proof. We then proceed to establish the error bound of the oracle estimator under the `∞ norm. Since Lemma 4.14 guarantees that ρ− (s) > 0, (4.4) is a strongly convex problem over θS with a unique optimum −1 > θbSo = (X> (E.16) ∗S X∗S ) X∗S y. ( ) r 1 log d Then conditioning on the event E1 = kX> , we can rewrite (E.16) as ∗j k∞ ≤ 2σ n n −1 > > −1 > kθbSo − θS∗ k∞ = k(X> ∗S X∗S ) X∗S (y − Ey)k∞ = k(X∗S X∗S ) X∗S k∞ r 1 > 1 1 > 2σ 1 log d k X∗j k∞ ≤ k X k∞ ≤ . ≤ max j∈S ρ− (s∗ ) n ρ− (s∗ ) n ∗S ρ− (s∗ ) n Since θ ∗ satisfies (4.5), (E.17) implies min |θbo | = min |θbo − θ∗ + θ∗ | ≥ min |θ∗ | − kθbo − θ ∗ k∞ j∈S
j
j∈S
j
j
j
j∈S
S
j
(E.17)
S
r
r C5 σ log d C5 σ log d 2σ 4σ ≥ ≥ , (E.18) − − ∗ ψmin ρ− (s ) n ψmin ψmin n where the last inequality comes from Lemma 4.14. We then take C5 = 260, and (E.18) implies r r C 1 log d 32σ log d 5 min |θbjo | ≥ σ − ≥ = γλ, j∈S 8ψmin 2ψmin n ψmin n where the last equality comes from γ = 1/α = 4/ψmin . Then by (R.2) of Assumption 4.3, we have ∇S Hλ (θbo ) + λ∇kθbo k1 = 0. (E.19)
S
Combining (E.19) with the optimality condition of (4.4), we have 1 X∗S (Y − Xθbo ) + ∇S Hλ (θbo ) + λ∇kθbSo k1 = 0. n
24
(E.20)
E.5
Proof of Lemma 8.16
Proof. We then consider the decomposition (Y − Xθbo )k∞ = kX> (Y − X∗S θbSo )k∞ kX> ∗S ∗S > ∗ > −1 > ∗ = kX∗S X∗S θS + + X∗S (X∗S X∗S ) X∗S (X∗S θS + ) k∞ −1 > > (I − X∗S (X> = kX> ∗S X∗S ) X∗S )k∞ ≤ kU∗S k∞ , ∗S
−1 > where U = X> (I−X∗S (X> ∗S X∗S ) X∗S ). Conditioning on the event E2 =
(
(E.21) implies
1 kU> k∞ n
(E.21) ) r log d , ≤ 2σ n
1 λ k X> (Y − Xθbo )k∞ ≤ . (E.22) ∗S n 4 By (R.3) of Assumption 4.3, we have ∇Hλ (θbSo ) = 0. Since |θj | is non-differentiable when |θj | = 0, then (E.22) implies that there exists some ξbSo ∈ ∂kθbSo k1 such that 1 > X (Y − Xθbo ) + ∇S Hλ (θbo ) + λξbSo = 0. (E.23) n ∗S
E.6
Proof of Theorem 5.5
b λ, and Proof. For notational simplicity, we omit the index N and denote θb{N } , λN , and δN by ∆, ∗ δ respectively. According to Lemma 8.12, we only need to bound k∇S1 L(θ )k2 . We have n n p 1X 1X k∇S1 Lλ (θ ∗ )k2 = k (yi − πi (θ ∗ ))XiS1 k2 ≤ |S1 | · k (yi − πi (θ ∗ ))XiS1 k∞ . n n i=1
i=1
By Lemma 5.2, we obtain
∗
P k∇S1 Lλ (θ )k2 ≤ 4
r
|S1 | log |S1 | n
!
≥1−
3 . |S1 |3
We then follow similar lines in Appendix 8.4 and derive the rate of convergence under the `2 norm.
25