Linear Convergence of Variance-Reduced Projected Stochastic ...

Report 2 Downloads 55 Views
Linear Convergence of Variance-Reduced Projected Stochastic Gradient without Strong Convexity

arXiv:1406.1102v1 [cs.NA] 4 Jun 2014

Pinghua Gong∗

Jieping Ye†

June 5, 2014

Abstract Stochastic gradient algorithms compute the gradient based on only one sample (or just a few samples) and enjoy low computational cost per iteration. They are widely used in large-scale optimization problems. However, stochastic gradient algorithms are usually slow to converge and achieve sub-linear convergence rates, due to the inherent variance in the gradient computation. To accelerate the convergence, some variance-reduced stochastic gradient algorithms have been proposed. Under the strongly convex condition, these variance-reduced stochastic gradient algorithms achieve a linear convergence rate. However, in many machine learning problems, the objective function to be minimized is convex but not strongly convex. In this paper, we propose a VarianceReduced Projected Stochastic Gradient (VRPSG) algorithm, which can efficiently solve a class of constrained optimization problems. As the main technical contribution of this paper, we show that the proposed VRPSG algorithm achieves a linear convergence rate without the strong convexity assumption. To the best of our knowledge, this is the first work that establishes the linear convergence rate for the variance-reduced stochastic gradient algorithm without strong convexity.

1

Introduction

Convex optimization has played an important role in machine learning as many machine learning problems can be cast into a convex optimization problem. Nowadays the emergence of big data makes the optimization problem challenging to solve and first-order stochastic gradient algorithms are often preferred due to their simplicity and low per-iteration cost. The stochastic gradient algorithms compute the gradient based on only one sample or just a few samples, and have been extensively studied in large-scale optimization problems [29, 4, 9, 26, 6, 18, 12, 21]. In general, the standard stochastic gradient algorithm randomly draws only one sample (or just a few samples) at each iteration to compute the gradient and then update the model parameter. The standard stochastic gradient algorithm computes the gradient without involving all samples and the computational cost per iteration is independent of the sample size. Thus, it is very suitable for large-scale problems. However, the standard stochastic gradient algorithms usually suffer from slow convergence. In particular, even under the strongly convex condition, the convergence rates of standard stochastic gradient algorithms are only sub-linear. In contrast, it is well-known that full (proximal) gradient descent algorithms can achieve linear convergence rates with the strongly convex condition [16]. It has been recognized that the slow convergence of the standard stochastic gradient algorithm results from the inherent variance in the gradient evaluation. To this end, some (implicit or explicit) variance-reduced stochastic gradient algorithms have been proposed; examples include Stochastic Average Gradient (SAG) [13], Stochastic Dual Coordinate Ascent (SDCA) [19, 20], Epoch Mixed Gradient Descent (EMGD) [28], Stochastic Variance Reduced Gradient (SVRG) [10], Semi-Stochastic Gradient Descent (S2GD) [11] and Proximal ∗ Computer † Computer

Science and Engineering, Arizona State University, Tempe, AZ 85287, USA ([email protected]) Science and Engineering, Arizona State University, Tempe, AZ 85287, USA ([email protected])

1

Stochastic Variance Reduced Gradient (Prox-SVRG) [27]. Under the strongly convex condition, these variance-reduced stochastic gradient algorithms achieve linear convergence rates. However, in practical problems, many objective functions to be minimized are convex but not strongly convex. For example, in machine learning, the least squares regression and logistic regression problems are extensively studied and both of objective functions are not strongly convex when the dimension d is larger than the sample size n. Moreover, even without the strongly convex condition, linear convergence rates can also be proved for some full (proximal) gradient descent algorithms [24, 23, 8]. This inspires us to address the following question: can some variance-reduced stochastic gradient algorithms achieve a linear convergence rate without strong convexity? In this paper, we give an affirmative answer to this question. Specifically, inspired by variancereduced techniques adopted in SVRG [10] and Prox-SVRG [27], we propose a Variance-Reduced Projected Stochastic Gradient (VRPSG) algorithm to solve a class of constrained optimization problems. In particular, we establish a linear convergence rate for the proposed VRPSG algorithm without strong convexity. One challenge to prove the linear convergence rate for the proposed VRPSG algorithm without the strongly convex condition is that the optimization problem might have an optimal solution set which includes an infinite number of optimal solutions. Although we can establish the recursive relationship between the distance of the current feasible solution to some fixed optimal solution and the distance of the previous feasible solution to the same optimal solution, it is still very difficult to establish the linear convergence rate without strong convexity. We address this problem by establishing the recursive relationship between the distance of the current feasible solution to the optimal solution set and the distance of the previous feasible solution to the optimal solution set. Another challenge to prove the linear convergence rate for the proposed VRPSG algorithm is how to upper bound the distance of any feasible solution to the optimal solution set by the gap of the objective function value at the feasible solution and the optimal objective function value. This upper bound can be easily established under the condition that the objective function is strongly convex. However, without the strongly convex condition, it is not trivial to obtain such an upper bound. In this paper, by making suitable assumptions but without strong convexity, we can address this problem by adopting Hoffman’s bound [7, 14, 25]. To the best of our knowledge, our work establishes the first linear convergence rate for the variance-reduced stochastic gradient algorithm without strong convexity.

2

VRPSG: Variance-Reduced Projected Stochastic Gradient

We first introduce the general optimization problem, mild assumptions about the problem, and some examples that satisfy the assumptions. Then we present the proposed Variance-Reduced Projected Stochastic Gradient (VRPSG) algorithm to solve the optimization problem. Finally, we present a detailed convergence analysis for the VRPSG algorithm.

2.1

Optimization Problems, Assumptions and Examples

We consider the following constrained optimization problem: min {f (w) = h(Xw)} , where w ∈ Rd , X ∈ Rn×d ,

w∈W

and make the following assumptions on the above problem throughout the paper: A1 f (w) is the average of n convex components fi (w), that is, n

1X fi (w), f (w) = n i=1 where ∇f (w) and ∇fi (w) are Lipschitz continuous with constants L and Li , respectively. 2

(1)

A2 The effective domain of h, denoted by dom(h), is open and non-empty. Moreover, h(u) is continuously differentiable on dom(h) and strongly convex on any convex compact subset of dom(h).  A3 The constraint set, denoted by W = w ∈ Rd : Cw ≤ b , C ∈ Rl×d , b ∈ Rl , is a polyhedral set which is compact. Remark 1 Assumption A2 is the same as assumption 2.1 in [8], which indicates that h(u) may not be strongly convex on dom(h) but strictly convex on dom(h). According to Weierstrass’s Theorem (Proposition A.8 in [2]), assumption A3 implies that the optimal solution set of the optimization problem in Eq. (1), denoted by W ⋆ , is non-empty. Notice that f (·) is convex, so W ⋆ must be convex and the Euclidean projection of any w ∈ Rd onto W ⋆ must be unique. Moreover, for any w, u ∈ W, Xw and Xu must belong to a convex compact subset U ⊆ dom(h). Thus, considering assumption A2, there exists a constant µ > 0 such that h(Xw) ≥ h(Xu) + ∇h(Xu)T (Xw − Xu) +

µ kXw − Xuk2 , ∀w, u ∈ W. 2

There are many examples that satisfy assumptions A1-A3, including two popular problems: ℓ1 constrained least squares (i.e., Lasso [22]) and ℓ1 -constrained logistic regression. Specifically, for the 1 kXw − yk2 ; the convex component ℓ1 -constrained least squares: the objective function is f (w) = 2n 1 T 2 T is fi (w) = 2 (xi w − yi ) , where xi is the i-th row of X; the strongly convex function is h(u) = 1 2 2n ku − yk ; the polyhedral set is W = {w : kwk1 ≤ τ } = {w : Cw ≤ b} is compact, where each d 2d row of C ∈ R2 ×d is a d-tuples of the form [±1, · · · , ±1], and each entry the Pnof b ∈ R is τ . For 1 ℓ1 -constrained logistic regression: the objective function is f (w) = n i=1 log(1 + exp(−yi xTi w)); the convex component is fi (w)P= log(1 + exp(−yi xTi w)), where X = [xT1 ; · · · ; xTn ]T ; the strongly n convex function1 is h(u) = n1 i=1 log(1 + exp(−yi ui )); the polyhedral set is the same as the ℓ1 constrained least squares. Additional constraint sets that satisfy assumption A3 include box constraint set W = {w : li ≤ wi ≤ ui } with −∞ < li ≤ ui < +∞ (i = 1, · · · , d) and ℓ1,∞ -ball set W = {w : PT T i=1 kwGi k∞ ≤ τ } with ∪i=1 Gi = {1, · · · , d} and Gi ∩ Gj = ∅ for i 6= j [17].

2.2

Algorithm and Main Result

A standard method for solving Eq. (1) is the projected gradient descent, which generates the sequence {wk } via wk = ΠW (wk−1 − η∇f (wk−1 )) = arg min w∈W

1

w − (wk−1 − η∇f (wk−1 ) 2 . 2

(2)

Assuming that η < 1/L and assumptions A1 − A3 hold, the objective function sequence {f (wk )} generated by Eq. (2) has a linear convergence rate [25]. At each iteration of the projected gradient descent, a full gradient involving all samples is required. Thus, the computational burden per iteration is heavy when the sample size n is large. To reduce the per-iteration cost, the projected stochastic gradient algorithm can be adopted to generate the sequence {wk } as follows: wk = ΠW (wk−1 − ηk ∇fik (wk−1 )),

(3)

where ik is randomly drawn from {1, · · · , n} in uniform. At each iteration, the projected stochastic gradient algorithm computes the gradient involving only a single sample and thus is suitable for large-scale problems with large n. Although we at each step, i.e.,    have an unbiased gradient estimate  E ∇fik (wk−1 ) = ∇f (wk−1 ), the variance E k∇fik (wk−1 ) − ∇f (wk−1 )k2 introduced by sampling 1 The

function h(u) = compact subset of Rn .

1 n

Pn

i=1

log(1 + exp(−yi ui )) is strictly convex on Rn and strongly convex on any convex

3

makes the step size ηk diminishing to guarantee convergence, which finally results in the slow convergence. Therefore, the key for improving the convergence rate of the projected stochastic gradient algorithm is to reduce the variance by sampling. Inspired by the variance-reduce techniques [10, 27] and the linear convergence result of full gradient algorithms without strong convexity [25], we propose a Variance-Reduced Projected Stochastic Gradient (VRPSG) algorithm (in Algorithm 1) to efficiently solve Eq. (1). Specifically, we find an unbiased gradient estimate which can reduce the variance in a multi-stage manner. To see how the unbiased gradient estimation is constructed and how the variance is reduced, please refer to Algorithm 1, Lemma 2 and Remark 2. The main technical contribution of this paper lies in the linear convergence rate analysis for the proposed VRPSG algorithm without strong convexity (summarized in Theorem 1). Note that the strong convexity is required for all other variance-reduced stochastic gradient algorithms to achieve linear convergence rates. Algorithm 1: VRPSG: Variance-Reduced Projected Stochastic Gradient 1 2 3 4 5 6 7 8 9 10 11 12 13

Choose the update frequency m and the learning rate η; ˜ 0 ∈ W; Initialize w Pn Choose pi ∈ (0, 1) for i ∈ {1, · · · , n} such that i=1 pi = 1; for k = 1, 2, · · · do ˜ k−1 ); ξ˜k−1 = ∇f (w k k−1 ˜ w0 = w ; for t = 1, 2, · · · , m do Randomly pick ikt ∈ {1, · · · , n} according to the probability P = {p1 , , · · · , pn }; k ˜ k−1 ))/(npikt ) + ξ˜k−1 ; vtk = (∇fikt (wt−1 ) − ∇fikt (w k k k k wt = ΠW (wt−1 − ηvt ) = arg minw∈W 12 kw − (wt−1 − ηvtk )k2 ; end Pm 1 k ˜k = m w t=1 wt ; end

Theorem 1 Let w⋆ ∈ W ⋆ be any optimal solution to Eq. (1), f ⋆ = f (w⋆ P ) be the optimal objective n function value in Eq. (1) and LP = maxi∈{1,··· ,n} [Li /(npi )] with pi ∈ (0, 1), i=1 pi = 1. In addition, let 0 < η < 1/(4LP ) and m be sufficiently large such that ρ=

4LP η(m + 1) β + < 1, (1 − 4LP η)m µη(1 − 4LP η)m

(4)

where µ > 0, β > 0 are constant. Then under assumptions A1 − A3, the VRPSG algorithm (summarized in Algorithm 1) achieves a linear convergence rate in expectation:   ˜ k ) − f ⋆ ≤ ρk (f (w ˜ 0 ) − f ⋆ ), EFm k f (w

˜ k is defined in Algorithm 1 and EFm where w k [·] denotes the expectation with respect to the random k variable Fm with Ftk (1 ≤ t ≤ m) being defined as k−1 k Ftk = {i11 , · · · , i1m , i21 , · · · , i2m , · · · , i1k−1 , · · · , im , i1 , · · · , ikt }, k−1 and F0k = Fm , where ikt is the sampling random variable in Algorithm 1.

We have the following remarks on the convergence result above: • The linear convergence rate ρ in Eq. (4) is similar to that of the Prox-SVRG [27]. The difference is that an additional constant β > 0 is introduced in the numerator of the second term, which is needed since our proposed algorithm does not require the strongly convex condition. 4

• Let η = γ/LP with 0 < γ < 1/4. When m is sufficiently large, we have ρ≈

4γ βLP /µ + , γ(1 − 4γ)m 1 − 4γ

where βLP /µ can be treated as a pseudo condition number of the problem in Eq. (1). If we choose γ = 0.1 and m = 100βLP /µ, then ρ ≈ 5/6. Notice that at each outer iteration of Algorithm 1, n + 2m gradient evaluations (computing the gradient on a single sample counts as one gradient  ˜ k ) − f ⋆ ≤ ǫ), we evaluation) are required. Thus, to obtain an ǫ-accuracy solution (i.e., EFm f (w k need O(n + βLP /µ) log(1/ǫ) gradient evaluations by setting m = P Θ(βLP /µ). In particular, the n complexity becomes O(n+ βLavg /µ) log(1/ǫ) if we choose pi = Li / i=1 Li for all i ∈ {1, P · · · , n}, and O(n+βLmax /µ) log(1/ǫ) if we choose pi = 1/n for all i ∈ {1, · · · , n}, where Lavg = ni=1 Li /n and Lmax = maxi∈{1,··· ,n} Li . Notice that Lavg ≤ Lmax . Thus, sampling in proportion to the Lipschitz constant is better than sampling uniformly. • At each outer iteration of VRPSG, the number of gradient evaluations is similar to that of full gradient methods. However, the overall complexity of VRPSG is superior over full gradient methods. Specifically, based P on the last remark and Remark 3, if f is strongly convex with parameter µ ˜ and pi = Li / ni=1 Li , VRPSG has the same complexity as Prox-SVRG [27], that is, VRPSG needs O(n+Lavg /˜ µ) log(1/ǫ) gradient evaluations to obtain an ǫ-accuracy solution. In contrast, full gradient methods require O(nL/˜ µ) log(1/ǫ) gradient evaluations to obtain a solution of the same accuracy. Obviously, the complexity of O(n + Lavg /˜ µ) log(1/ǫ) is far superior over O(nL/˜ µ) log(1/ǫ) when the sample size n and the condition number L/˜ µ are very large. ˆi • If the Lipschitz constant Li is unknown and difficult to compute, we can use an upper bound L ˆ instead of Li to define LP = maxi∈{1,··· ,n} [Li /(npi )] and the theorem still holds. • We can obtain a convergence rate with high probability. According to Markov’s inequality with ˜ k ) − f ⋆ ≥ 0, Theorem 1 implies that f (w   ˜ k) − f ⋆ EFm k f (w ˜ 0 ) − f ⋆) ρk (f (w k ⋆ ˜ ) − f ≥ ǫ) ≤ Pr(f (w ≤ . ǫ ǫ   ˜ 0 )−f ⋆ ˜ k ) − f ⋆ ≤ ǫ) ≥ 1 − δ, if k ≥ log f (w / log(1/ρ). Therefore, we have Pr(f (w δǫ

3

Technical Proof

In this section, we first provide several fundamental lemmas, based on which we complete the proof of Theorem 1. The key idea of the convergence proof in [10, 27] is to establish the recursive relationship between the distance of the current feasible solution to a unique optimal solution and the distance of the previous feasible solution to the same optimal solution. Different from [10, 27], we prove the linear convergence rate by establishing the recursive relationship between the distance of the current feasible solution to the optimal solution set and the distance of the previous feasible solution to the optimal solution set, due to the lack of strong convexity. Note that Lemmas 1, 2, 3 are established for constrained optimization problems which are adapted from Lemmas 1, 3 and Corollary 3 for regularized optimization problems in [27]. Lemma 4 establishes an upper bound of the distance of any feasible solution to the optimal solution set by the gap of the objective function value at the feasible solution and the optimal objective function value, which is a key to establish the linear convergence rate for the proposed VRPSG algorithm. It is well-known that the bound in Lemma 4 holds under the strongly convex condition. However, without the strongly convex condition, it is non-trivial to obtain this bound. To address this problem, we make suitable assumptions but without strong convexity to establish this inequality by adopting Hoffman’s bound [25]. Note that although Lemmas 1, 2, 3 for constrained optimization problems can be adapted from regularized optimization problems. Lemma 4 5

may not be easily extended to regularized optimization problems. Besides, Lemma 4 may not be easily extended to non-polyhedral constrained optimization problems (please refer to Section 4 for more details). The following lemma establishes a relation between the difference of gradients on components and the difference of objective functions. Lemma 1 Let w⋆ ∈ W ⋆ be any optimal solution to the problem in Eq. (1), f ⋆ = f (w⋆ ) be optimal Pthe n objective function value in Eq. (1) and LP = maxi∈{1,··· ,n} [Li /(npi )] with pi ∈ (0, 1), i=1 pi = 1. Then under assumptions A1-A3, for all w ∈ W, we have n

1X 1 2 k∇fi (w) − ∇fi (w⋆ )k ≤ 2LP [f (w) − f ⋆ ]. n i=1 npi Proof For any i ∈ {1, · · · , n}, we consider the following function φi (w) = fi (w) − fi (w⋆ ) − ∇fi (w⋆ )T (w − w⋆ ). It follows from the convexity of φi (w) and ∇φi (w⋆ ) = 0 that minw∈Rd φi (w) = φi (w⋆ ) = 0. Recalling that ∇φi (w) = ∇fi (w) − ∇fi (w⋆ ) is Li -Lipschitz continuous, we have for all w ∈ W:   Li η 2 0 = φi (w⋆ ) ≤ min φi (w − η∇φi (w)) ≤ min φi (w) − ηk∇φi (w)k2 + k∇φi (w)k2 η∈R η∈R 2 1 1 k∇φi (w)k2 = φi (w) − k∇fi (w) − ∇fi (w⋆ )k2 , = φi (w) − 2Li 2Li which implies for all w ∈ W: k∇fi (w) − ∇fi (w⋆ )k2 ≤ 2Li φi (w) = 2Li (fi (w) − fi (w⋆ ) − ∇fi (w⋆ )T (w − w⋆ )). Dividing the above inequality by n2 pi and summing over i = 1, · · · , n, we have n

1X 1 k∇fi (w) − ∇fi (w⋆ )k2 ≤ 2LP (f (w) − f (w⋆ ) − ∇f (w⋆ )T (w − w⋆ )), n i=1 npi

(5)

Pn where we use Lavg = i=1 Li /n ≤ LP = maxi∈{1,··· ,n} [Li /(npi )] (see Lemma 5 in the Appendix) and P n f (w) = n1 i=1 fi (w). Recalling that w⋆ ∈ W ⋆ is an optimal solution to Eq. (1) and w ∈ W, it follows from the optimality condition of Eq. (1) that ∇f (w⋆ )T (w − w⋆ ) ≥ 0, which together with Eq. (5) and f ⋆ = f (w⋆ ) immediately proves the lemma.



Based on Lemma 1, we bound the variance of vtk in terms of the difference of objective functions. Lemma 2 Let w⋆ ∈ W ⋆ be any optimal solution to the problem in Eq. (1), f ⋆ = f (w⋆ ) be the optimal objective function value in Eq. (1). Then under assumptions A1-A3, we have   k k EFtk vtk | Ft−1 = ∇f (wt−1 ), (6) i h

 2 k k k ˜ k−1 ) − f ⋆ , ) | Ft−1 ≤ 4LP f (wt−1 ) − f ⋆ + f (w (7) EFtk vtk − ∇f (wt−1

k ˜ k−1 are defined in Algorithm 1; LP = maxi∈{1,··· ,n} [Li /(npi )]. where Ftk is defined in Theorem 1; vtk , wt−1 ,w

6

k k Proof Taking expectation with respect to Ftk conditioned on Ft−1 and noticing that Ftk = Ft−1 ∪{ikt }, we have " # n X 1 pi k k k k ∇fikt (wt−1 ) | Ft−1 = ∇fi (wt−1 ) = ∇f (wt−1 ), EFtk npikt np i i=1 " # n X 1 pi k−1 k ˜ k−1 ) = ∇f (w ˜ k−1 ). ˜ ∇fikt (w EFtk ∇fi (w ) | Ft−1 = npikt np i i=1

It follows that EFtk



vtk

|

k Ft−1



= EFtk

"

# 1 k k−1 k−1 k k ˜ ˜ (∇fikt (wt−1 ) − ∇fikt (w )) + ∇f (w ) | Ft−1 = ∇f (wt−1 ). npikt

We next prove Eq. (7) as follows: h i

2 k k ) | Ft−1 EFtk vtk − ∇f (wt−1  

2

1   

k  k k ˜ k−1 ) − ∇f (wt−1 ˜ k−1 ) | Ft−1 ) − ∇f (w ) − ∇fikt (w ∇fikt (wt−1 =EFtk 

npikt  

1   2

2

k k k  ˜ k−1 ) ˜ k−1 ) | Ft−1 =EFtk  − ∇f (wt−1 ) − ∇f (w ∇fikt (wt−1 ) − ∇fikt (w

npikt

 

1   2

k  k ˜ k−1 ) | Ft−1 ) − ∇fikt (w ∇fikt (wt−1 ≤EFtk 

npikt  

1   2

⋆ k k ≤2EFtk  ∇fikt (wt−1 ) − ∇fikt (w ) | Ft−1 

npikt

 

1   2

k  ˜ k−1 ) − ∇fikt (w⋆ ) | Ft−1 ∇fikt (w + 2EFtk 

npikt =2

n X i=1

≤4LP

=4LP

n X

2 pi pi k ⋆ 2

∇fi (wt−1

∇fi (w ˜ k−1 ) − ∇fi (w⋆ ) ) − ∇f (w ) + 2 i 2 2 (npi ) (npi ) i=1  k ˜ k−1 ) − f (w⋆ ) f (wt−1 ) − f (w⋆ ) + f (w  k ˜ k−1 ) − f ⋆ , f (wt−1 ) − f ⋆ + f (w

where the second equality is due to " #  1  k k−1 k ˜ ˜ k−1 ) EFtk ∇fikt (wt−1 ) − ∇fikt (w ) | Ft−1 = ∇f (wk ) − ∇f (w npikt     and E kξ − E [ξ] k2 = E kξk2 − kE [ξ] k2 for all random vector ξ ∈ Rd ; the second inequality is due k ˜ k−1 ∈ W, where to kx + yk2 ≤ 2kxk2 + 2kyk2 ; the third inequality is due to Lemma 1 with wt−1 ,w k k−1 k−1 ˜ ˜ wt−1 ∈ W is obvious and w ∈ W follows from the fact that w is a convex combination of vectors in the convex set W.  k Remark 2 Eq. (6) implies that vtk is an unbiased estimate ofh∇f (wt−1 ). To see that thei variance is

k

2 k k ) | Ft−1 approaches reduced, we notice that, according to Eq. (7), the variance EFtk vt − ∇f (wt−1

k ˜ k−1 and wt−1 zero when both w converge to any optimal solution w⋆ .

7

The following lemma presents a bound independent of the algorithm. The terms in the left-hand side of the bound will appear in the proof of Theorem 1. Lemma 3 Let w⋆ ∈ W ⋆ be any optimal solution to the problem in Eq. (1), f ⋆ = f (w⋆ ) be the optimal k k objective function value in Eq. (1), δtk = ∇f (wt−1 ) − vtk , gtk = (wt−1 − wtk )/η and 0 < η ≤ 1/L. Then we have T T k η k 2 k gt + gt ≤ f ⋆ − f (wtk ) − w⋆ − wtk δtk . w⋆ − wt−1 2

k Proof We know that w⋆ ∈ W ⋆ ⊆ W. Thus, by the optimality condition of wtk = ΠW (wt−1 − ηvtk ) = 1 k k 2 arg minw∈W 2 kw − (wt−1 − ηvt )k , we have k (wtk − wt−1 + ηvtk )T (w⋆ − wtk ) ≥ 0, k which together with gtk = (wt−1 − wtk )/η implies that

(w⋆ − wtk )T vtk ≥ (w⋆ − wtk )T gtk .

(8)

k k k f (w⋆ ) ≥ f (wt−1 ) + ∇f (wt−1 )T (w⋆ − wt−1 ).

(9)

By the convexity of f (·), we have

Recalling that f (·) is L-Lipschitz continuous gradient, we have k k k )− )T (wtk − wt−1 f (wt−1 ) ≥ f (wtk ) − ∇f (wt−1

which together with Eq. (9) implies that

L

wk − wk 2 , t t−1 2

L k 2 k k

wtk − wt−1 + ∇f (wt−1 )T (w⋆ − wt−1 ) 2

Lη 2 k

gtk 2 =f (wtk ) + ∇f (wt−1 )T (w⋆ − wtk ) − 2

Lη 2

gtk 2 =f (wtk ) + (w⋆ − wtk )T δtk + (w⋆ − wtk )T vtk − 2

Lη 2 k ⋆ k T k ⋆ k T k

gtk 2 ≥f (wt ) + (w − wt ) δt + (w − wt ) gt − 2

Lη 2 k k

gtk 2 =f (wtk ) + (w⋆ − wtk )T δtk + (w⋆ − wt−1 + wt−1 − wtk )T gtk − 2

2 η k =f (wtk ) + (w⋆ − wtk )T δtk + (w⋆ − wt−1 )T gtk + (2 − Lη) gtk 2

2 η k ⋆ k T k ⋆ k T k ≥f (wt ) + (w − wt ) δt + (w − wt−1 ) gt + gtk , 2

k k f (w⋆ ) ≥f (wtk ) − ∇f (wt−1 )T (wtk − wt−1 )−

k where the first and fourth equalities are due to gtk = (wt−1 − wtk )/η; the second equality is due to k k k δt = ∇f (wt−1 ) − vt ; the second inequality is due to Eq. (8); the last inequality is due to 0 < η ≤ 1/L. Rearranging the above inequality by noticing that f ⋆ = f (w⋆ ), we prove the lemma. 

The following lemma presents an upper bound of the distance of any feasible solution to the optimal solution set by the gap of the objective function value at the feasible solution and the optimal objective function value, which is the key to establish the linear convergence without strong convexity. ¯ = ΠW ⋆ (w) and f ⋆ be the optimal objective function value Lemma 4 Let w ∈ W = {w : Cw ≤ b}, w in Eq. (1). Then under assumptions A1-A3, there exist constants µ > 0 and β > 0 such that f (w) − f ⋆ ≥

µ ¯ 2 , ∀w ∈ W. kw − wk 2β 8

¯ = w and the inequality holds for any constants µ > 0 and β > 0. We Proof If w ∈ W ⋆ , then w next prove the inequality for w ∈ W, w ∈ / W ⋆ . According to Lemma 6 in the Appendix, we know that ⋆ ⋆ there exists a unique r such that W = {w⋆ : Cw⋆ ≤ b, Xw⋆ = r⋆ } which is non-empty. For any w ∈ W = {w : Cw ≤ b}, the Euclidean projection of Cw − b onto the non-negative orthant, denoted by [Cw − b]+ , is 0. Considering the Hoffman’s bound in Lemma 7, for w ∈ W = {w : Cw ≤ b}, there exist a w⋆ ∈ W ⋆ and a constant θ > 0 such that kw − w⋆ k ≤ θkXw − r⋆ k. ¯ = ΠW ⋆ (w), we have kw − wk ¯ ≤ kw − w⋆ k and X w ¯ = r⋆ . Thus, Noticing that w ¯ 2 = kXw − r⋆ k2 ≥ kXw − X wk

1 1 ¯ 2, kw − w⋆ k2 ≥ kw − wk β β

(10)

where β = θ2 . By assumption A3, we know that W is compact. Thus, for any w ∈ W, both Xw and ¯ belong to some convex compact subset U ⊆ Rn . By the strong convexity of h(·) on the subset U, Xw there exists a constant µ > 0 such that ¯ ≥ ∇h(X w) ¯ T (Xw − X w) ¯ + h(Xw) − h(X w)

µ ¯ 2, kXw − X wk 2

which together with f (w) = h(Xw) implies that ¯ ≥ ∇f (w) ¯ T (w − w) ¯ + f (w) − f (w)

µ ¯ 2. kXw − X wk 2

(11)

¯ ∈ W ⋆ , we have Noticing that w ∈ W and w ¯ T (w − w) ¯ ≥ 0, ∇f (w) which together with Eqs. (10), (11) proves the lemma.



We are now ready to complete the proof of Theorem 1 as follows: Proof of Theorem 1 Different from the convergence proof in [10, 27], we begin the proof by establishing the recursive relationship between the distance of the current feasible solution to the optimal solution set and the distance of the previous feasible solution to the optimal solution set. Let k ¯ t−1 ¯ tk ¯ tk = ΠW ⋆ (wtk ) for all k, t ≥ 0. Then we have w ∈ W ⋆ , which together with the definition of w w k k k and gt = (wt−1 − wt )/η implies that

k

2

k

k 2 k 2

w − w ¯ tk ≤ wtk − w ¯ t−1 ¯ t−1 = wt−1 − ηgtk − w t

k

2 k 2 k k ¯ t−1 ¯ t−1 = wt−1 −w + 2η(w − wt−1 )T gtk + η 2 gtk

k

 k 2 k ¯ t−1 ¯ t−1 ≤ wt−1 −w + 2η f ⋆ − f (wtk ) − (w − wtk )T δtk ,

(12)

k ¯ t−1 where the last inequality is due to Lemma 3 with w ∈ W ⋆ and 0 < η < 1/(4LP ) < 1/(LP ) ≤ 1/L k k T ¯ t−1 − wt ) δtk , we define an auxiliary vector as (see Lemma 5). To bound the quantity −(w k k ˆ tk = ΠW (wt−1 w − η∇f (wt−1 )).

Thus, we have k ¯ t−1 − w − wtk

T

k ˆ tk + w ˆ tk − w ¯ t−1 δtk = (wtk − w )T δtk

k ˆ tk kkδtk k + (w ˆ tk − w ¯ t−1 ≤ kwtk − w )T δtk

k k k k ˆ tk − w ¯ t−1 ≤ kwt−1 − ηvtk − (wt−1 − η∇f (wt−1 ))kkδtk k + (w )T δtk

k ˆ tk − w ¯ t−1 = ηkδtk k2 + (w )T δtk ,

9

where the second inequality is due to the non-expansive property of projection (Proposition B.11(c) in [2]). The above inequality and Eq. (12) imply that

k

2 k

 k 2 k

wt − w ¯ tk ≤ wt−1 ¯ t−1 ˆ tk − w ¯ t−1 −w )T δtk . (13) − 2η f (wtk ) − f ⋆ + 2η 2 kδtk k2 + 2η(w

k k ˆ tk − w ¯ t−1 Considering Lemma 2 with δtk = ∇f (wt−1 )−vtk and that is independent of the ran-   w  noticing k k 2 k k k k k ˜ k−1 ) − f ⋆ ) − f ⋆ + f (w dom variable it and Ft = Ft−1 ∪{it }, we have EFtk kδt k | Ft−1 ≤ 4LP f (wt−1     k k k k k ˆ tk − w ¯ t−1 ˆt − w ¯ t−1 = 0. Taking expectation with re= (w )T EFtk δtk | Ft−1 )T δtk | Ft−1 and EFtk (w k k spect to Ft conditioned on Ft−1 on both sides of Eq. (13), we have i h

2   k k k 2 k ¯ t−1 ¯ tk | Ft−1 ≤ wt−1 −w − 2ηEFtk f (wtk ) − f ⋆ | Ft−1 EFtk wtk − w     k k k ˆ tk − w ¯ t−1 )T EFtk δtk | Ft−1 + 2η(w + 2η 2 EFtk kδtk k2 | Ft−1

k

  k 2 k ¯ t−1 ≤ wt−1 −w − 2ηEFtk f (wtk ) − f ⋆ | Ft−1  k ˜ k−1 ) − f ⋆ . ) − f ⋆ + f (w + 8LP η 2 f (wt−1

k Taking expectation to Fii above inequality and considering the fact t−1 on both h sides of the h with respect h

2 i

2 k k k k k



¯t ¯ t | Ft−1 = EF k wt − w , we have that EF k EF k wt − w t−1

t

t

h h

2 i

i   k k 2

wt−1 ¯ tk ≤EFt−1 ¯ t−1 − 2ηEFtk f (wtk ) − f ⋆ −w EFtk wtk − w k   k ˜ k−1 ) − f ⋆ . + 8LP η 2 EFt−1 k f (wt−1 ) − f ⋆ + f (w

k−1 Summing the above inequality over t = 1, 2, · · · , m by noticing that F0k = Fm , we have m h X

i   k k 2

wm ¯m + 2η EFtk f (wtk ) − f ⋆ EFm k −w t=1

m h X

2 i     k

w0k − w ¯ 0k + 8LP η 2 ˜ k−1 ) − f ⋆ ) , ≤EFm EFt−1 f (wt−1 ) − f ⋆ + 8LP η 2 mEFt−1 f (w k k−1 k t=1

Thus, we have

m−1 h X

i     k k k 2

wm ¯m EFtk f (wtk ) − f ⋆ + 2ηEFm EFm f (wm ) − f ⋆ + 2η(1 − 4LP η) −w k k t=1

h

2 i  

w0k − w ˜ k−1 ) − f ⋆ ) , ¯ 0k + 8LP η 2 EFt−1 f (w0k ) − f ⋆ + m(f (w ≤EFm k k−1

h

i   k k k 2

wm ¯m ≥ 0, 2ηEFm which together with EFm f (wm ) − f ⋆ ≥ 0, 2η > 2η(1 − 4LP η) > 0 and −w k k ˜ k−1 implies that w0k = w

2η(1 − 4LP η)

m X t=1

  EFm k f (wtk ) − f ⋆

h

2 i  

w0k − w ˜ k−1 ) − f ⋆ , ¯ 0k + 8LP η 2 (m + 1)EFm ≤EFm k−1 f (w k−1

(14)

      ˜ k−1 ) − f ⋆ . By ˜ k−1 ) − f ⋆ = EFm where we use the fact that EFt−1 k f (w0k ) − f ⋆ = EFt−1 k f (w k−1 f (w the convexity of f (·), we have ! m m 1 X k 1 X k ˜ )=f f (w wt ≤ f (wtk ). m t=1 m t=1 10

Thus, we have m   X ˜ k) − f ⋆ ≤ f (wtk ) − f ⋆ , m f (w

(15)

t=1

˜ k−1 = w0k ∈ W and w ¯ 0k = ΠW ⋆ (w0k ), we have Considering Lemma 4 with w

2 µ

w0k − w ¯ 0k , ˜ k−1 ) − f ⋆ = f (w0k ) − f ⋆ ≥ f (w 2β

which together with Eqs. (14), (15) implies that

h

2 i  

wk − w ¯ 0k ˜ k ) − f ⋆ ≤ EFm 2η(1 − 4LP η)mEFm k k−1 f (w 0

  ˜ k−1 ) − f ⋆ + 8LP η 2 (m + 1)EFm k−1 f (w     2β 2 ˜ k−1 ) − f ⋆ . ≤ 8LP η (m + 1) + EFm k−1 f (w µ

Thus, we have 

k

˜ )−f EFm k f (w









β 4LP η(m + 1) + (1 − 4LP η)m µη(1 − 4LP η)m



  ˜ k−1 ) − f ⋆ . EFm k−1 f (w

Using the above recursive relation and considering the definition of ρ in Eq. (4), we complete the proof of the theorem.  Remark 3 If f is strongly convex with parameter µ ˜, then the inequality in Lemma 4 holds with β = 1 and µ = µ ˜. Therefore, we can easily obtain from the proof of Theorem 1 that k    1 4LP η(m + 1) k ⋆ ˜ 0 ) − f ⋆ ), ˜ )−f ≤ (f (w + EFm k f (w (1 − 4LP η)m µ ˜η(1 − 4LP η)m which has the same convergence rate as [27].

4

Discussion

Recall that one of the assumptions for the convergence analysis in Theorem 1 is that the constraint set is polyhedral. An interesting question is whether we can extend the linear convergence rate in Theorem 1 to constrained optimization problems beyond polyhedral sets. Let us consider the following sphere constrained optimization problem: min {f (w)

w∈Rd

s.t. w ∈ W = {w : kwk ≤ τ }} ,

(16)

where f (w) = h(Xw) satisfies assumptions A1, A2 and τ√> 0. Obviously the sphere constrained set does not satisfy assumption A3. Let f (w) = (w1 + w2 − 2) and τ = 1. Then the optimal solution set of Eq. (16) is o n√ √ W ⋆ = [ 2/2, 2/2]T . √ ¯ 2 = (cos(ω) Let w = [cos(ω), sin(ω)]T . It is easy to obtain that w ∈ W, kw − wk − 2/2)2 + (sin(ω) − √ √ √ √ 2/2)2 = − 2(cos(ω) + sin(ω) − 2) and f (w) − f ⋆ = (cos(ω) + sin(ω) − 2)2 . Thus, we have f (w) − f ⋆ = 0, ¯ 2 ω→π/4 kw − wk lim

11

which implies that Lemma 4 does not hold for the sphere constrained optimization problem in Eq. (16). Notice that Lemma 4 may not be a necessary condition of the linear convergence analysis in Theorem 1. So we can not conclude that it is impossible to extend the linear convergence rate in Theorem 1 to the sphere constrained optimization problem in Eq. (16). However, this simple example illustrates that the extension of the linear convergence analysis to the non-polyhedral constrained optimization problem may not be easy. It is well-known that the constrained optimization problem in Eq. (1) is equivalent to some regularized optimization problem under certain conditions. A natural question is whether the convergence analysis in Theorem 1 can be extended to the equivalent counterpart of Eq. (1) [i.e., the regularized form of Eq. (1)]. To be specific, let us consider the following ℓ1 -constrained and ℓ1 -regularized optimization problems: min {f (w)

s.t. kwk1 ≤ τ } ,

(17)

min {F (w) = f (w) + λkwk1 } ,

(18)

w∈Rd w∈Rd

where f (w) = h(Xw) satisfies assumptions A1, A2. It is well-known that Eq. (17) and Eq. (18) have the same optimal solution set when τ and λ choose appropriate values. In the following, we focus on the case where Eq. (17) and Eq. (18) have the same optimal solution set. It is easy to verify that the ℓ1 -constrained problem in Eq. (17) satisfies assumptions A1-A3 and thus Theorem 1 is applicable to Eq. (17). It is known that we can use Algorithm 1 to solve the ℓ1 -constrained problem in Eq. (18) by replacing the projection step in Algorithm 1 (Line 10) by the proximal step. But the question is if we can extend the convergence analysis in Theorem 1 with respect to F (·). One key building block to establish a similar linear convergence rate as in Theorem 1 is to prove a bound similar to Lemma 4. Specifically, is there a constant θ > 0 such that ¯ 2 F (w) − F ⋆ ≥ θkw − wk

(19)

holds for all w ∈ Rd [where F ⋆ is the optimal objective function value in Eq. (18)]? Let us consider the following example by setting f (w) = (w1 + w2 − 1)2 , τ = 0.5 and λ = 1. It is easy to verify that Eq. (17) and Eq. (18) have the same optimal solution set W ⋆ = {w1 + w2 = 0.5, w1 ≥ 0, w2 ≥ 0} . ¯ = ΠW ⋆ (w) = Let w = [w1 , w2 ]T with w1 + w2 = 0.5 and w1 > 0, w2 < 0. It is easy to obtain that w ¯ 2 = (w1 − 0.5)2 + w22 = 2w22 > 0 and F (w) − F ⋆ = w1 − w2 − 0.5 = −2w2 > 0. Thus, [0.5, 0]T , kw − wk we have F (w) − F ⋆ = 0, w2 →−∞ kw − wk ¯ 2 lim

which implies that there does not exist a constant θ > 0 such that Eq. (19) holds for all w ∈ Rd . Since Eq. (19) may not be a necessary condition of the linear convergence rate for solving Eq. (18), the example above only shows that the convergence analysis in Theorem 1 may not be extended to regularized optimization problems. However, such an example illustrates that the analysis may be highly non-trivial even if the extension is possible.

5

Experiments

In this section, we validate the effectiveness of VRPSG by solving the following ℓ1 -constrained logistic regression problem: ( ) n 1X T min f (w) = log(1 + exp(−yi xi w)) , s.t. kwk1 ≤ τ, n i=1 w∈Rd 12

where n is the number of samples; τ > 0 is the constrained parameter; xi ∈ Rd is the i-th sample; yi ∈ {1, −1} is the label of the sample xi . For the above problem, it is easy to obtain that the convex component is fi (w) = log(1 + exp(−yi xTi w)) and the Lipschitz constant of ∇fi (w) is kxi k2 /4. We conduct experiments on three real-world data sets: classic (n = 7094, d = 41681), reviews (n = 4069, d = 18482) and sports (n = 8580, d = 14866). The three data sets are multi-class sparse text data and can be downloaded from http://www.shi-zhong.com/software/docdata.zip. To adapt the data to the two-class logistic regression problem, we transform the multi-class data into two-class by labeling the first half of all classes as positive class, and the remaining classes as the negative class.

5.1

Sensitivity Studies for VRPSG

We conduct sensitivity studies for VRPSG on the sampling distribution parameter p = [p1 , · · · , pn ]T , the inner iterative number m and the step size η by varying one parameter and keeping the other two parameters fixed. Notice that the projection (line 10 in Algorithm 1) onto the ℓ1 -ball is easy to solve [3, 15, 5] and thus the dominant computational cost is to compute the gradient. To provide an ˜ k ) vs. the number of implementation independent result, we report the objective function value f (w 2 gradient evaluations (♯grad/n) plots in Figure 1, Figure 2 and Figure 3. From these results, weP have the n following observations: (a) The VRPSG algorithm with non-uniform sampling (i.e., pi = Li / i=1 Li ) is much more efficient than that with uniform sampling (i.e., pi = 1/n), which is consistent with the analysis in the remarks of Theorem 1. (b) In general, the VRPSG algorithm by setting m = 0.5n, n has the most stable performance, which indicates that a small or large m will degrade the performance of the VRPSG algorithm. (c) The optimal step sizes of the VRPSG algorithm on different data sets are slightly different. Moreover, the VRPSG algorithm with step sizes η = 1/LP and η = 5/LP converges quickly, which demonstrates that the VRPSG algorithm still performs well even if the step size is much larger than that required in the theoretical analysis (η < 0.25/LP is required in Theorem 1). This shows the robustness of the VRPSG algorithm.

−0.2

10

−0.3

10

−0.4

10

5

10

15 #grad/n

20

25

30

−0.2

Objective function value (logged scale)

Uniform Non−uniform

0

sports

reviews Objective function value (logged scale)

Objective function value (logged scale)

classic

Uniform Non−uniform

10

−0.3

10

−0.4

10

−0.5

10

−0.6

10

0

5

10

15 #grad/n

20

25

30

Uniform Non−uniform −0.3

10

−0.5

10

−0.7

10

0

5

10

15 #grad/n

20

25

30

Figure 1: Sensitivity study of VRPSG on the parameter p = [p1 , · · · , pn ]T : the objective function ˜ k ) vs. the number of gradient evaluations (♯grad/n) value f (w P plots (averaged on 10 runs). “Uniform” and “Non-uniform” indicate that pi = 1/n and pi = Li / ni=1 Li , respectively. Other parameters are set as τ = 10, m = n, η = 1/LP .

5.2

Comparison with Other Algorithms

We conduct comparison by including the following algorithms3 : • AFG: the accelerated full gradient algorithm proposed in [1], where the adaptive line search scheme is used. 2 Computing

the gradient on a single sample counts as one gradient evaluation. do not include SAG [13] and SDCA [19] in comparison, since SAG is only applicable to unconstrained optimization problems and SDCA is adopted to solve regularized optimization problems. 3 We

13

reviews m=0.2n m=0.5n m=1n m=2n m=4n

10

−0.3

10

−0.4

10

sports

−0.2

m=0.2n m=0.5n m=1n m=2n m=4n

10

−0.3

10

Objective function value (logged scale)

Objective function value (logged scale)

Objective function value (logged scale)

classic

−0.2

−0.4

10

−0.5

10

−0.6

10

−0.2

m=0.2n m=0.5n m=1n m=2n m=4n

10

−0.3

10

−0.4

10

−0.5

10

−0.6

10

−0.7

10 0

2

4

6

8

10

12

14

0

2

4

6

#grad/n

8

10

12

14

0

2

4

6

#grad/n

8

10

12

14

#grad/n

˜ k ) vs. Figure 2: Sensitivity study of VRPSG on the parameter m: the objective function value f (w the number of gradient Pn evaluations (♯grad/n) plots (averaged on 10 runs). Other parameters are set as τ = 10, pi = Li / i=1 Li , η = 1/LP . reviews

η=5/LP η=1/LP η=0.2/LP η=0.04/LP

−0.3

sports

η=10/LP

−0.2

P

10

−0.4

10

10

η=5/L

P

η=1/L

−0.3

10

P

η=0.2/L

P

η=0.04/L

−0.4

10

η=10/L

−0.2

Objective function value (logged scale)

η=10/L

Objective function value (logged scale)

Objective function value (logged scale)

classic

−0.2

10

P

−0.5

10

−0.6

10

10

P

η=5/LP η=1/LP

−0.3

10

η=0.2/LP η=0.04/LP

−0.4

10

−0.5

10

−0.6

10

−0.7

10 0

5

10

15 #grad/n

20

25

30

0

5

10

15 #grad/n

20

25

30

0

5

10

15 #grad/n

20

25

30

˜ k ) vs. the Figure 3: Sensitivity study of VRPSG on the parameter η: the objective function value f (w number of gradient evaluations (♯grad/n) plots (averaged on 10 runs). Other parameters are set as Pn τ = 10, m = n, pi = Li / i=1 Li . • SGD: the stochastic √ gradient descent algorithm in Eq. (3). As suggested by [4], we set the step size as ηk = η0 / k, where η0 is an initial step size. • VRPSG: the variance-reduced projected stochastic gradient algorithm proposed in this paper. • VRPSG2: a hybrid algorithm by executing SGD for one pass over the data and then switching to the VRPSG algorithm (similar schemes are also adopted in [10, 27]). Notice that SGD is sensitive to the initial step size η0 [4]. To have a fair comparison of different algorithms, we set different values of η0 for SGD to obtain the best performance (η0 = 5, 1, 0.2, 0.04). To comprehensively show the convergence behaviors of different algorithms, we report the objective ˜ k ) and the objective function value gap f (w ˜ k ) − f ⋆ vs. the number of gradient evalfunction value f (w uations (♯grad/n) plots in Figure 4, from which we have the following observations: (a) Both stochastic algorithms (VRPSG and SGD with a proper initial step size) outperform the full gradient algorithm (AFG). (b) SGD quickly decreases the objective function value in the beginning and gradually slows down in the proceeding iterations. In contrast, VRPSG decreases the objective function value quickly. This phenomenon is commonly expected due to the sub-linear convergence rate of SGD and the linear convergence rate of VRPSG. (c) VRPSG2 performs slightly better than VRPSG, which demonstrates that the hybrid scheme can empirically improve the performance (similar results are also reported in [10, 27]).

6

Conclusion

In this paper, we propose a Variance-Reduced Projected Stochastic Gradient (VRPSG) algorithm to efficiently solve a class of constrained optimization problems. Our main technical contribution is to establish a linear convergence rate for the VRPSG algorithm without strong convexity. To our best knowledge, this is the first linear convergence result of variance-reduced stochastic gradient 14

sports

reviews VRPSG VRPSG2 SGD(η =5)

10

0

SGD(η0=1) SGD(η =0.2)

−0.3

0

10

SGD(η =0.04) 0

AFG

−0.4

10

−0.1

Objective function value (logged scale)

Objective function value (logged scale)

Objective function value (logged scale)

classic

−0.2

10

−0.2

10

−0.3

10

−0.4

10

−0.5

10

−0.6

10

−0.2

10

−0.3

10

−0.4

10

−0.5

10

−0.6

10

−0.7

10 0

5

10

15 #grad/n

20

25

30

0

classic

5

10

15 #grad/n

20

25

30

0

reviews

0

SGD(η0=1) SGD(η =0.2) 0

SGD(η0=0.04)

−10

10

AFG

15 #grad/n

20

25

30

20

25

30

sports

10

Objective Function Value Gap

0

Objective Function Value Gap

Objective Function Value Gap

VRPSG VRPSG2 SGD(η =5)

−5

10

0

10

10

5

−1

10

−2

10

−3

10

−4

−2

10

−4

10

−6

10

10

−8

0

5

10

15 #grad/n

20

25

30

0

5

10

15 #grad/n

20

25

30

10

0

5

10

15 #grad/n

˜ k ) (first row) and Figure 4: Comparison of different algorithms: the objective function value f (w k ⋆ ˜ ) − f vs. the number of gradient evaluations (♯grad/n) plots the objective function value gap f (w (averaged on 10 runs). The parameter of VRPSG√are set as τ = 10, η = 1/LP , m = n, pi = Pn Li / i=1 Li ; the step size of SGD is set as ηk = η0 / k.

algorithms without the strongly convex condition. In the future work, we will try to develop a more general convergence analysis for a wider range of problems including both non-polyhedral constrained optimization problems and regularized optimization problems.

Appendix Lemma 5 Let LP and Li be the Lipschitz constants of ∇f (w) and ∇fi (w), respectively. Moren over, let L = avg i=1 Li /n, Lmax = maxi∈{1,··· ,n} Li and LP = maxi∈{1,··· ,n} [Li /(npi )] with pi ∈ Pn (0, 1), i=1 pi = 1. Then we have L ≤ Lavg ≤ LP and Lavg ≤ Lmax .

Proof Based on the definition of Lipschitz continuity, we obtain that L and Li are the smallest positive constants such that for all w, u ∈ Rd : k∇f (w) − ∇f (u)k ≤ Lkw − uk, k∇fi (w) − ∇fi (u)k ≤ Li kw − uk.

(20) (21)

Dividing Eq. (21) by n and summing over i = 1, · · · , n, we have n

n

1X 1X k∇fi (w) − ∇fi (u)k ≤ Li kw − uk. n i=1 n i=1 P Based on the triangle inequality and ∇f (w) = n1 ni=1 ∇fi (w) we have n

(22)

1X k∇fi (w) − ∇fi (u)k, k∇f (w) − ∇f (u)k ≤ n i=1 Pn which together with Lavg = i=1 Li /n and Eqs. (20), (22) implies that L ≤ Lavg . Pn Define s = [L1 /p1 , · · · , Ln /pn ]T . Noticing that LP = maxi∈{1,··· ,n} [Li /(npi )] with pi ∈ (0, 1), i=1 pi = 1 and considering the definition of the dual norm, we have n

nLP =

X Li Li pi , = ksk∞ = sup tT s ≥ pi i∈{1,··· ,n} pi ktk1 ≤1 i=1 max

15

Pn which together with Lavg =P i=1 Li /n immediately implies that Lavg ≤ LP . Lavg ≤ Lmax is obvious by the definition of Lavg = ni=1 Li /n and Lmax = maxi∈{1,··· ,n} Li .  Lemma 6 Under assumptions A1-A3, for all w⋆ ∈ W ⋆ , there exists a unique r⋆ such that Xw⋆ = r⋆ . Moreover, W ⋆ = {w⋆ : Cw⋆ ≤ b, Xw⋆ = r⋆ }. Proof By assumption A3, we know that W ⋆ is not empty. Assume that there are w1⋆ , w2⋆ ∈ W ⋆ such that Xw1⋆ 6= Xw2⋆ . Then, the optimal objective function value is f ⋆ = f (w1⋆ ) = f (w2⋆ ). Due to w1⋆ , w2⋆ ∈ W ⋆ and the convexity of W ⋆ , we have (w1⋆ + w2⋆ )/2 ∈ W ⋆ . Therefore,     1 1 1 ⋆ (w1 + w2⋆ ) = h Xw1⋆ + Xw2⋆ . (23) f⋆ = f 2 2 2 On the other hand, the strong convexity of h(·) implies that   1 1 1 1 1 h Xw1⋆ + Xw2⋆ < h(Xw1⋆ ) + h(Xw2⋆ ) = (f (w1⋆ ) + f (w2⋆ )) = f ⋆ , 2 2 2 2 2 leading to a contradiction with Eq. (23). Thus, there exists a unique r⋆ such that for all w⋆ ∈ W ⋆ , Xw⋆ = r⋆ and f ⋆ = h(r⋆ ). If w⋆ ∈ W ⋆ , then w⋆ ∈ W and Xw⋆ = r⋆ , that is, w⋆ ∈ {w⋆ : Cw⋆ ≤ b, Xw⋆ = r⋆ } and hence ⋆ W ⊆ {w⋆ : Cw⋆ ≤ b, Xw⋆ = r⋆ }. If w⋆ ∈ {w⋆ : Cw⋆ ≤ b, Xw⋆ = r⋆ }, then w⋆ is a feasible solution and f (w⋆ ) = h(Xw⋆ ) = h(r⋆ ) = f ⋆ , that is, w⋆ ∈ W ⋆ and hence {w⋆ : Cw⋆ ≤ b, Xw⋆ = r⋆ } ⊆ W ⋆ . Therefore, we have W ⋆ = {w⋆ : Cw⋆ ≤ b, Xw⋆ = r⋆ }.  Lemma 7 (Hoffman’s bound, Lemma 4.3 [25]) Let V = {w : Cw ≤ b, Xw = r} be a non-empty polyhedron. Then for any w ∈ Rd , there exist a feasible point w⋆ of V and a constant θ > 0 such that

[Cw − b]+

kw − w⋆ k ≤ θ

Xw − r ,

where [Cw − b]+ denotes the Euclidean projection of Cw − b onto the non-negative orthant and θ only depends on C and X.

References [1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. [2] D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999. [3] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the ℓ1 -ball for learning in high dimensions. In ICML, pages 272–279, 2008. [4] J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899–2934, 2009. [5] P. Gong, K. Gai, and C. Zhang. Efficient euclidean projections via piecewise root finding and its application in gradient projection. Neurocomputing, 74(17):2754–2766, 2011. [6] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. Journal of Machine Learning Research-Proceedings Track, 19:421–436, 2011. [7] A. Hoffman. On approximate solutions of systems of linear inequalities. Journal of Research of the National Bureau of Standards, 49(4):263–265, 1952. [8] K. Hou, Z. Zhou, A. So, and Z. Luo. On the linear convergence of the proximal gradient method for trace norm regularization. In NIPS, pages 710–718, 2013.

16

[9] C. Hu, J. Kwok, and W. Pan. Accelerated gradient methods for stochastic optimization and online learning. In NIPS, volume 22, pages 781–789, 2009. [10] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013. [11] J. Koneˇcn` y and P. Richt´ arik. Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666, 2013. [12] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(12):365–397, 2012. [13] N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2672–2680, 2012. [14] W. Li. Sharp lipschitz constants for basic optimal solutions and basic feasible solutions of linear programs. SIAM Journal on Control and Optimization, 32(1):140–153, 1994. [15] J. Liu and J. Ye. Efficient euclidean projections in linear time. In ICML, pages 657–664, 2009. [16] Y. Nesterov. Gradient methods for minimizing composite functions. 140(1):125–161, 2013.

Mathematical Programming,

[17] A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efficient projection for ℓ1,∞ regularization. In ICML, pages 857–864, 2009. [18] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011. [19] S. Shalev-Shwartz and T. Zhang. arXiv:1211.2717, 2012.

Proximal stochastic dual coordinate ascent.

arXiv preprint

[20] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013. [21] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In ICML, pages 71–79, 2013. [22] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. [23] P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 125(2):263–295, 2010. [24] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1-2):387–423, 2009. [25] P. Wang and C. Lin. Iteration complexity of feasible descent methods for convex optimization. Department of Computer Science, National Taiwan University, Tech. Rep, 2013. [26] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(4):2543–2596, 2010. [27] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. arXiv preprint arXiv:1403.4699, 2014. [28] L. Zhang, M. Mahdavi, and R. Jin. Linear convergence with condition number independent access of full gradients. In NIPS, pages 980–988, 2013. [29] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML, 2004.

17