Katyusha: Accelerated Variance Reduction for Faster SGD

Report 3 Downloads 42 Views
Katyusha: Accelerated Variance Reduction for Faster SGD

arXiv:1603.05953v1 [math.OC] 18 Mar 2016

Zeyuan Allen-Zhu [email protected] Princeton University March 18, 2016 Abstract We consider minimizing f (x) that is an average of n convex, smooth functions fi (x), and provide the first direct stochastic gradient method Katyusha that has the accelerated conver√ (x∗ ) ) gence rate. It converges to an ε-approximate minimizer using O((n + nκ) · log f (x0 )−f ε stochastic gradients where κ is the condition number. Katyusha is a primal-only method, supporting proximal updates, non-Euclidean norm smoothness, mini-batch sampling, as well as non-uniform sampling. It also resolves the following open questions in machine learning • If f (x) is not strongly convex (e.g., Lasso, logistic √ regression), Katyusha gives the first stochastic method that achieves the optimal 1/ ε rate. • If f (x) is strongly convex and each fi (x) is “rank-one” √ (e.g., SVM), Katyusha gives the first stochastic method that achieves the optimal 1/ ε rate. • If f (x) is not strongly convex and each fi (x) is “rank-one” (e.g., L1SVM), Katyusha gives the first stochastic method that achieves the optimal 1/ε rate. The main ingredient in Katyusha is a novel “negative momentum on top of momentum” that can be elegantly coupled with the existing variance reduction trick for stochastic gradient descent. As a result, since variance reduction has been successfully applied to fast growing list of practical problems, our paper implies that one had better hurry up and give Katyusha a hug in each of them, in hoping for a faster running time also in practice.

1

Introduction

Consider the following composite convex minimization problem n n o X def def 1 fi (x) + ψ(x) . min F (x) = f (x) + ψ(x) = n x∈Rd

(1.1)

i=1

P Here, f (x) = n1 ni=1 fi (x) is a convex function that is a finite average of n smooth, convex functions fi (x), and ψ(x) is convex, lower semicontinuous (but possibly non-differentiable) function, sometimes referred to as the proximal function. We shall mostly focus on the case when ψ(x) is σ-strongly convex in this paper. Both the smoothness assumptions on fi (x) and the strong convexity assumption on ψ(x) can be relaxed, see Section 1.2. We are interested in finding an approximate minimizer x ∈ Rd satisfying F (x) ≤ F (x∗ ) + ε, where x∗ is a minimizer of F (x). Problem (1.1) arises in many places in machine learning, statistics, and operations research. For instance, all convex regularized empirical risk minimization (ERM) problems fall into this category,

see Section 1.2 for details. It has also been recently observed that, efficient stochastic algorithms for solving (1.1) gives rise to fast training algorithms for neural nets [2, 15]. Perhaps the simplest first-order method to solve (1.1) is by proximal gradient descent: n1 o xk+1 ← arg min ky − xk k22 + h∇f (xk ), yi + ψ(y) . 2η y∈Rd Above, η is the step length, and if the proximal function ψ(y) equals zero, the update simply reduces to xk+1 ← xk −η∇f (xk ). Since computing the full gradient ∇f (·) is usually very expensive, stochastic gradient update rules have been proposed instead: n1 o e k , yi + ψ(y) , xk+1 ← arg min ky − xk k22 + h∇ 2η y∈Rd e k is a random vector satisfying E[∇ e k ] = ∇f (xk ) and is referred to as the gradient estimator. where ∇ P Given the “finite average” structure f (x) = n1 ni=1 fi (x), a popular choice for the gradient e k = ∇fi (xk ) for some random index i ∈ [n] per iteration. Methods based on estimator is to set ∇ this choice are known as stochastic gradient descent (SGD) methods [9, 34]. As the computation of ∇fi (x) is usually n times faster than that of ∇f (x), SGD is suitable for large-scale machine learning tasks.

Variance Reduction. Recently, the convergence speed of SGD has been improved with the variance-reduction technique [8, 10, 11, 15, 22, 23, 28–30, 32, 33]. In all of these cited results, the authors have, in one way or another, shown that SGD converges much faster if one makes a better e k so that its variance E[k∇ e k − ∇f (xk )k2 ] reduces as k increases. choice of the gradient estimator ∇ 2 One particular way to choose this estimator can be described as follows. Keep a snapshot x e = xk after every m stochastic update steps (where m is some parameter that is usually on ek = the order of n), and compute the full gradient ∇f (e x) only for such snapshots. Then, set ∇ e k, ∇fi (xk ) − ∇fi (e x) + ∇f (e x) as the gradient estimator. One can verify that, under this choice of ∇ 2 e e it satisfies E[∇k ] = ∇f (xk ) and limk→∞ E[k∇k − ∇f (xk )k2 ] = 0. Unfortunately, all of these cited results on variance reduction provide non-accelerated convergence rates for solving (1.1). For instance, the widely-used SVRG and SAGA algorithms obtain  (x∗ )  ε-approximate minimizers for (1.1) in O n + Lσ log F (x0 )−F iterations of stochastic gradient ε def updates. It is often denoted by κ = L/σ the condition number of the problem, and it is an open question regarding how to obtain an accelerated method that gives the optimal square-root dependence on κ, rather than the linear dependence on κ. The recent work of Catalyst [13, 19] by two independent groups of researchers partially answered √  this open question. They demonstrate that one can solve (1.1) in only O n + nκ log κ log 1ε stochastic iterations, through a black-box reduction (that they refer to as Catalyst) to non-accelerated methods. Their result is still imperfect at least for the following reasons: • Optimality. It does not match the optimal dependence on κ. It does not give the optimal √ √ rate 1/ ε if F (·) is not strongly convex. It does not give the optimal rate 1/ ε if f (·) is non-smooth. It does not give the optimal rate 1/ε if both F (·) is not strongly convex and f (·) is non-smooth. To the best of our knowledge, it does not support non-Euclidean norm smoothness on fi (·). • Practicality. Catalyst is not very practical since each of its inner iteration needs to be very accurately executed. This makes the stopping criterion hard to be tuned, and makes Catalyst slower than its competitors for several subclass of problem (1.1) (such as the famous ERM problems, see Section 1.2). 2

• Non-convexity. The non-accelerated variance-reduction algorithms apply very well even to non-convex problems in practice (such as training neural nets both empirically [15] and theoretically [2]). Therefore, it is very desirable to develop a direct accelerated variance-reduction method due to its potential applicability to non-convex problems as well. Unfortunately, the Catalyst reduction does not seem to help in non-convex cases.

1.1

Our Results

In this paper, we provide a direct, accelerated stochastic gradient method Katyusha for solving (1.1) √  (x∗ )  in O n + nκ log F (x0 )−F iterations, where x0 is the given starting vector. Each iteration of ε Katyusha requires only the computation of O(1) stochastic gradients ∇fi (x). This gives both the optimal dependence on κ and on ε which, to the best of our knowledge, was never obtained before among the class of stochastic gradient methods.1 If F (·) is not strongly √convex, our Katyusha can also work for such objectives and needs a total √ nL·kx0 −x∗ k  (x∗ ) √ of O n log F (x0 )−F + iterations. This gives the optimal rate 1/ ε which, to the ε ε best of our knowledge, was never obtained before among the class of stochastic gradient methods.2 Our Algorithm. If ignoring the proximal term ψ(·), our Katyusha method iteratively updates: • xk+1 ← τ1 zk + τ2 x e + (1 − τ1 − τ2 )yk ; e k+1 ← ∇f (e • define ∇ x) + ∇fi (xk+1 ) − ∇fi (e x) where i is a random index in [n]; 1 e • yk+1 ← xk+1 − 3L ∇k+1 , and e k+1 . • zk+1 ← zk − α∇ e k+1 is the gradient estimator Above, x e is a snapshot point which is updated every n iterations. ∇ e k+1 ] = ∇f (xk+1 ) and is defined in the variance-reduction manner (similar to that satisfies Ei [∇ known non-accelerated variance-reduction methods). The reason for keeping a sequence of three vectors (xk , yk , zk ) is a common ingredient that can be found in all existing accelerated methods.3 Our New Technique – Negative Momentum. The most surprising part of Katyusha is the novel choice of xk+1 which is a convex combination of three vectors: yk , zk , and x e. Our theoretical √ analysis suggests the parameter choices τ2 = 0.5 and τ1 = min{ nσL, 0.5}. To properly explain this novel combination, let us recall a “momentum” view of accelerated methods. In a classical accelerated non-stochastic gradient method, xk+1 is only a convex combination of yk and zk (see for instance [4]). In fact, zk plays the role of “momentum” which adds a weighted sum of the history of the gradients into yk+1 . As an illustrative example, suppose that τ2 = 0, τ1 = τ , and x0 = y0 = z0 . Then, one can compute that  1 e  k = 1; 1,  x0 − 3L ∇  1 e 1−τ e x0 − 3L ∇2 − 3L + τ α ∇1 , k = 2; yk =     (1−τ )2 1 e 1−τ 2 e e x0 − 3L ∇3 − 3L + τ α ∇2 − 3L + (1 − (1 − τ ) )α ∇1 , k = 3. Since the parameter α is usually much larger than 1/3L, the above recursion suggests that one can gradually increase the weight of gradients from earlier iterations, and this is known as “momentum” which is at the heart of accelerated first-order methods. Unfortunately, momentum is very dangerous for stochastic gradients. For instance, if one of the e t is somewhat inaccurate (i.e., it is very different from ∇f (xt )), then historical gradient estimator ∇ 1

Of course, the non-stochastic full-gradient method of Nesterov can achieve such optimal dependence on κ and ε. √ Of course, the non-stochastic full-gradient method of Nesterov can achieve such optimal convergence rate 1/ ε. 3 One can of course rewrite the algorithm and keep track of only two vectors per iteration. However, that will make the statement of the algorithm less clean so we refrain from doing so in this paper. 2

3

further moving in this direction may put us in trouble and not decrease the objective anymore. This is one of the major reasons that a majority of the researchers working on stochastic gradient descent have found acceleration / momentum not very useful in practice. In Katyusha, we put a “magnet” around x e, which we define it to be essentially the average yk of the most recent n iterations. Whenever we define xk+1 , it will be attracted by the magnet x e and we define the weight τ2 = 0.5. This is a very strong magnet: it ensures that xk+1 is not too far away from x e so the gradient estimator remains somewhat accurate; at the same time, it retracts xk+1 back to x e, which can be understood as introducing a negative momentum which removes a fraction of the past stochastic gradients. This summarizes the high-level idea behind Katyusha, and its formal convergence analysis can be found in the subsequent sections. Comparison with Other Accelerated Gradient Methods. For smooth convex minimization problems, gradient descent converges at a rate Lε —or Lσ log 1ε if the objective is σ-strongly convex. This is not optimal among √ the class √ of first-order methods. In 1983, Nesterov showed that the L optimal rate should be √ε —or √Lσ log 1ε if the objective is σ-strongly convex— and this was achieved by his celebrated accelerated gradient descent method [24]. Randomized Coordinate Descent. An alternative way to define the gradient estimator is to set e k = d∇i f (xk ) where ∇i f (xk ) is the coordinate gradient and i is randomly chosen in {1, 2, . . . , d}. ∇ This is known as (randomized) coordinate descent as opposed to stochastic gradient descent. We emphasize here that designing accelerated methods for coordinate descent is significantly easier than designing that for stochastic gradient descent, and this has indeed been done in many previous results including [7, 12, 18, 20, 21, 26].4 The state-of-the-art accelerated coordinate descent method is NUACDM [7] Linear Coupling. In a recent work by Allen-Zhu and Orecchia, the authors have proposed a new framework called linear coupling that facilitates the design of accelerated gradient methods [4]. Their new framework not only reconstructs Nesterov’s accelerated (full-)gradient method [4], provides even faster accelerated coordinate descent method [7], but also leads to many recent breakthroughs for designing accelerated methods on non-smooth problems (such as positive LP [5, 6] and positive SDP [3]) or even general non-convex problems [2]. This present paper also falls into this linear-coupling framework.

1.2

Optimal Convergence Rates for Empirical Risk Minimization Problems

There are a few interesting subcategories of problem (1.1) and each of them correspond to some well-known training problem in machine learning. Suppose we are given n vectors a1 , . . . , an ∈ Rd that are the feature vectors of n samples. Then, it is interesting to study special case of (1.1) where def each fi (x) is “rank-one” structured: that is, fi (x) = fi (hai , xi). (Assuming “rank-one” simplifies the notations; all of the results stated in this subsection generalize to rank-O(1) structured fi (x)’s.) In such a case, we rewrite (1.1) as n n o X def def 1 min F (x) = f (x) + ψ(x) = fi (hai , xi) + ψ(x) . n x∈Rd

(1.2)

i=1

4 The reason behind this can be understood as follows. If a function f (·) is L smooth with respect to each coordinate i, then a constant-step update x0 ← x − L1 ∇i f (x)ei at least guarantees that it decreases the objective, i.e., f (x + L1 ∇i f (x)ei ) < f (x). Decreasing the objective value is usually viewed as an important component in existing accelerated methods (see for instance the gradient descent step summarized in [4]). Unfortunately, this property is e k ) may be even larger than f (xk ) even for very small step false for stochastic gradient descent, because f (xk − η ∇ length η > 0.

4

Without loss of generality, we assume that each ai has Euclidean norm 1. Denoting by li be the training label of data sample ai , one can consider the following 4 interesting classes of (1.2): Case 1: ψ(x) is σ strongly convex and f (x) is L-smooth. Examples: 1 Pn σ 2 2 • ridge regression: f (x) P = 2n i=1 (hai , xi − li ) and ψ(x) = 2 kxk2 . n 1 σ 2 2 • elastic net: f (x) = 2n i=1 (hai , xi − li ) and ψ(x) = 2 kxk2 + λkxk1 .

Case 2: ψ(x) is not strongly convex and f (x) is L-smooth. Examples: 1 Pn • Lasso: f (x) = 2n li )2 and ψ(x) = λkxk1 . i=1 (hai , xi −P • `1 logistic regression: f (x) = n1 ni=1 log(1 + exp(−li hai , xi)) and ψ(x) = λkxk1 .

Case 3: ψ(x) is σ strongly convex and f (x) is non-smooth but Lipschitz continuous. Examples: P • SVM : f (x) = n1 ni=1 max{0, 1 − li hai , xi} and ψ(x) = σkxk22 .

Case 4: ψ(x) is not strongly convex and f (x) is non-smooth but Lipschitz continuous. Examples: P • `1 -SVM : f (x) = n1 ni=1 max{0, 1 − li hai , xi} and ψ(x) = λkxk1 .

For all of the four classes above, accelerated stochastic methods have already been introduced in the literature, most notably AccSDCA [31], APCG [20], SPDC [35]. However, to the best of our knowledge, all known accelerated methods have suboptimal convergence rates for Case 2, 3 and 4.5 √ In particular, the best known convergence rate was log(1/ε) for Case 2 and 3, and was log(1/ε) . This ε ε is a factor log 1ε worse than the optimal rate for each of the three classes, and is especially interesting to the optimization community because obtaining the “optimal rate” is one of the ultimate goals in optimization. (See for instance an interesting attempt by Lan and Zhou [17] trying to fix this log factor, as well as successful attempts on fixing a similar log factor but in online learning which is a different problem [14, 16, 27]). Our Katyusha algorithm simultaneously closes the gap for all of the three classes of problems with the help from the recent optimal reductions by Allen-Zhu and Hazan [1]. In short, we obtain √ √   1 nL an ε-approximate minimizer for Case 2 in O n log ε + ε iterations, for Case 3 in O n log 1ε + σεn √  iterations, and for Case 4 in O n log 1ε + √nL iterations. In contrast, none of the existing accelerated σε methods can lead to such optimal rates even if the optimal reductions of [1] are used, see discussions in Section 5. Despite obtaining the optimal methods for the mentioned classes, our algorithm is also a primalonly method because we only need to compute ∇fi (x) at different points x and for different indices i. None of the existing accelerated stochastic methods for solving ERM problems are primal-only, and this is essentially the reason that their convergence rates are suboptimal.6

1.3

Other Extensions

Mini-batch. Katyusha naturally extends to the minibatch scenario. Instead of using a singlePstochastic gradient ∇fi (·) per iteration, one can use the average of b stochastic gradients 1 j∈S ∇fi (·) where S is a random subset of [n] with cardinality b. All of our result extends b 5

In fact, they also have the suboptimal dependence on the condition number L/σ for Case 1. This is so because even for Case 1, converting an ε approximate maximizer for the dual objective to the primal, one only obtains an nκε approximate minimizer on the primal objective. As a result, algorithms like APCG who directly work on the dual, algorithms like SPDC who maintain both primal and dual variables, and algorithms like RPDG [17] that are primal-like but still use dual analysis, have to suffer from this log loss in the convergence rates. 6

5

to this setting, where the only change needed in the algorithm is to re-compute the snapshot every n/b iterations rather than every n iterations. Non-Uniform Sampling. If each fi (·) has a different smooth parameter Li , then one has to select the random index i from a non-uniform distribution in order to obtain the fastest running time. This can be done following the same techniques proposed in [7], but will make the notations significantly heavier especially with the presence of the proximal term ψ(·). We refrain from doing so in this version of the paper. Non-Euclidean Norms. If the smoothness of the functions fi (x) are with respect to a nonEuclidean norm (such as the well known `1 norm case over the simplex), our results in this paper still hold. In particular, our update on the yk+1 side becomes the non-Euclidean norm gradient descent, and the update on the zk+1 side becomes the non-Euclidean norm mirror descent with respect to the Bregman divergence term of a strongly convex potential function. Our analysis in this paper can be translated into this more general scenario following the techniques of [4]. Since this extension is simple but complicates the notations, we defer it to a full version of this paper.

1.4

Conclusion

Roadmap. We provide necessary notations and useful theorems in Section 2. In Section 3, we focus on analyzing one single iteration of Katyusha, and in Section 4 we provide the convergence analysis on Katyusha for the strongly convex case of problem (1.1). In Section 5, we apply Katyusha to non-strongly convex or non-smooth objectives by applying the optimal reductions in [1]. In Section 6, we provide a direct algorithm for solving the non-strongly case of problem (1.1) with the √ optimal 1/ ε rate, and compare it with the literature. We shall include experimental results in a next version of this paper.

2

Preliminaries

Throughout this paper, we denote by k · k the Euclidean norm. We denote by ∇f (x) the full gradient vector of function f if it is differentiable, or the subgradient vector if f is only Lipschitz continuous. Recall some classical definitions on strong convexity and smoothness. Definition 2.1 (Smoothness and strong convexity). For a convex function f : Rn → R, • We say f is σ-strongly convex if ∀x, y ∈ Rn , it satisfies f (y) ≥ f (x)+h∇f (x), y −xi+ σ2 kx−yk2 . • We say f is L-smooth if ∀x, y ∈ Rn , it satisfies k∇f (x) − ∇f (y)k ≤ Lkx − yk. We also need to use the following definition by Allen-Zhu and Hazan: Definition 2.2 ([1]). An algorithm solving the strongly convex case of problem (1.1) satisfies the homogenous objective decrease (HOOD) property with time Time(L, σ), if for every starting point   (x∗ ) x0 , it produces an output x0 satisfying E F (x0 ) − F (x∗ ) ≤ F (x0 )−F in time at most Time(L, σ). 4 Allen-Zhu and Hazan provided three black-box reductions algorithms AdaptReg, AdaptSmooth, and JointAdaptRegSmooth in their paper [1] to convert any algorithm satisfying the HOOD property with time Time(L, σ) respectively to (1) the non-strongly convex but smooth case, (2) the strongly convex but non-smooth case, and (3) the non-strongly convex and non-smooth case. We simplify and restate their theorems as follows:

6

Algorithm 1 Katyusha(x0 , S, σ, L) 1: m ← n;  the time window for re-computing the snapshot x e  √mσ 1 1 1 √ 2: τ2 ← 2 , τ1 ← min , , α ← 3τ1 L ;  parameters 3L 2 0 3: y0 = z0 = x e ← x0 ;  initial vectors 4: for s ← 1 to S do 5: µs ← ∇f (e xs );  compute the full gradient only once every m iterations 6: for j ← 0 to m − 1 do 7: k ← (sm) + j; 8: xk+1 ← τ1 zk + τ2 x es + (1 − τ1 − τ2 )yk ; s e 9: ∇k+1 ← µ + ∇fi (xk+1 ) − ∇fi (e xs ) where i is randomly chosen from {1, 2, . . . , n}; 1 2 e k+1 , zi + ψ(z) ; 10: zk+1 = arg minz 2α kz − zk k + h∇  2 e 11: Option I: yk+1 ← arg miny 3L 2 ky − xk+1 k + h∇k+1 , yi + ψ(y) ; 12: Option II: yk+1 ← xk+1 + τ1 (zk+1 − zk )  we analyze only Option I in this paper, but Option II also works 13: end for P   Pm−1 m−1 j j −1 · 14: x es+1 ← j=0 (1 + ασ) · xsm+j+1 ; j=0 (1 + ασ)  weighted average of the previous m iterations 15: end for 16: return x eS . Theorem 2.3 (AdaptReg). Suppose that in problem (1.1)  f (·) is L-smooth and x0 is a starting vector. Then, AdaptReg produces an output x satisfying E F (x) − F (x∗ ) ≤ O(ε) in a total running ∗ P −1 F (x0 )−F (x∗ ) 0 )−F (x ) Time(L, σ0 · 2−t ) where σ0 = F (x time of Tt=0 and T = log . ∗ 2 2 ε kx0 −x k

Theorem 2.4 (AdaptSmooth). Suppose that in problem (1.2), ψ(·) is σ strongly convex, each fi (·) is G-Lipschitz and x0 is a starting vector. Then, AdaptSmooth produces an output x  continuous,  P −1 Time(2t /λ0 , σ) where λ0 = satisfying E F (x) − F (x∗ ) ≤ O(ε) in a total running time of Tt=0 F (x0 )−F (x∗ ) (x∗ ) . and T = log2 F (x0 )−F ε G2 Theorem 2.5 (JointAdaptRegSmooth). Suppose that in problem (1.2), each fi (·) is G-Lipschitz continuous an output x sat and x0 is a∗ starting vector. Then, JointAdaptRegSmooth PT −1 produces t −t isfying E F (x) − F (x ) ≤ O(ε) in a total running time of t=0 Time(2 /λ0 , σ0 · 2 ) where F (x0 )−F (x∗ ) F (x0 )−F (x∗ ) F (x0 )−F (x∗ ) λ0 = , σ0 = kx0 −x∗ k2 and T = log2 kx0 −x∗ k2 . G2

3

One-Iteration Analysis

In this section, we focus on analyzing the behavior of Katyusha (see Algorithm 1) in a single iteration (i.e., for a fixed k). We view yk , zk and xk+1 as fixed in this section so the only randomness comes from the choice of i in iteration k. We abbreviate in this section by x e=x es where s is the def 2 e k+1 k2 so E[σk+1 ] is the epoch that iteration k belongs to, and denote by σk+1 = k∇f (xk+1 ) − ∇ e k+1 in this iteration. variance of the gradient estimator ∇ Our first lemma lower bounds the expected objective decrease F (xk+1 ) − E[F (yk+1 )]. Our Prog(xk+1 ) defined below is a non-negative, classical quantity that would be a lower bound on the e k+1 were equal to ∇f (xk+1 ), see for instance [4]. However, since amount of objective decrease if ∇ 2 the variance σk+1 is non-zero, this lower bound must be compensated by a negative term that 2 ]. depends on E[σk+1 7

Lemma 3.1 (proximal gradient descent). If  3L e k+1 , y − xk+1 i + ψ(y) − ψ(xk+1 ) , yk+1 = arg min ky − xk+1 k2 + h∇ and 2 y  3L def e k+1 , y − xk+1 i + ψ(y) − ψ(xk+1 ) ≥ 0 , Prog(xk+1 ) = − min ky − xk+1 k2 + h∇ y 2 we have     1  2  F (xk+1 ) − E F (yk+1 ) ≥ E Prog(xk+1 ) − E σk+1 . 4L Proof. 3L e k+1 , y − xk+1 i + ψ(y) − ψ(xk+1 ) Prog(xk+1 ) = − min{ ky − xk+1 k2 + h∇ y 2  3L  ¬ e k+1 , yk+1 − xk+1 i + ψ(yk+1 ) − ψ(xk+1 ) =− kyk+1 − xk+1 k2 + h∇ 2  L = − kyk+1 − xk+1 k2 + h∇f (xk+1 ), yk+1 − xk+1 i + ψ(yk+1 ) − ψ(xk+1 ) 2  e k+1 , yk+1 − xk+1 i − Lkyk+1 − xk+1 k2 + h∇f (xk+1 ) − ∇

  ­ 1 e k+1 k2 . ≤ − f (yk+1 ) − f (xk+1 ) + ψ(yk+1 ) − ψ(xk+1 ) + k∇f (xk+1 ) − ∇ 4L Above, ¬ is by the definition of yk+1 , and ­ uses the smoothness of function f (·), as well as the inequality ha, bi − 12 kbk2 ≤ 21 kak2 . Taking expectation on both sides we arrive at the desired result.  The following lemma provides a novel upper bound on the expected variance of the gradient estimator. Note that all known variance reduction analysis for convex optimization, in one way or another, upper bounds this variance essentially by 4L · (f (e x) − f (x∗ )), the objective distance to the minimizer (c.f. [10, 15]). The recent breakthrough of Allen-Zhu and Hazan [1] upper bounds it by the point distance kxk+1 − x ek2 for non-convex objectives, which is tighter if x e is close to xk+1 but unfortunately not enough for the purpose of this paper. In this x) −  paper, we upper bound  it by the tightest possible quantity which is almost 2L · f (e f (xk+1 )  4L · f (e x) − f (x∗ ) . Unfortunately, this upper bound needs to be compensated by an additional term h∇f (xk+1 ), x e − xk+1 i, which could be positive but we shall cancel it using the introduced negative momentum. Lemma 3.2.    e k+1 − ∇f (xk+1 )k2 ≤ 2L · f (e E k∇ x) − f (xk+1 ) − h∇f (xk+1 ), x e − xk+1 i .

Proof. Each fi (x), being convex and L-smooth, implies the following inequality which is classical in convex optimization and can be found for instance in Theorem 2.1.5 of the textbook of Nesterov [25].  k∇fi (xk+1 ) − ∇fi (e x)k2 ≤ 2L fi (e x) − fi (xk+1 ) − h∇fi (xk+1 ), x e − xk+1 i Therefore, taking expectation over the random choice of i, we have      2  e k+1 − ∇f (xk+1 )k2 = E ∇fi (xk+1 ) − ∇fi (e E k∇ x) − ∇f (xk+1 ) − ∇f (e x)

2  ¬  ≤ E ∇fi (xk+1 ) − ∇fi (e x)   = 2L · E fi (e x) − fi (xk+1 ) − h∇fi (xk+1 ), x e − xk+1 i  = 2L · f (e x) − f (xk+1 ) − h∇f (xk+1 ), x e − xk+1 i .

Above, ¬ is because for any random vector ζ ∈ Rd , it holds that Ekζ − Eζk2 = Ekζk2 − kEζk2 . 8



The next lemma is a classical one for proximal mirror descent. e k+1 is fixed and Lemma 3.3 (proximal mirror descent). Suppose ψ(·) is σ strongly convex. If ∇ zk+1 = arg min z

then it satisfies for all u ∈ Rd ,

1 e k+1 , z − zk i + αψ(z) − αψ(zk ) , kz − zk k2 + αh∇ 2

e k+1 , zk+1 − ui + αψ(zk+1 ) − αψ(u) ≤ − 1 kzk − zk+1 k2 + 1 kzk − uk2 − 1 + ασ kzk+1 − uk2 . αh∇ 2 2 2

Proof. By the minimality definition of zk+1 , we have that

e k+1 + αg = 0 zk+1 − zk + α∇

where g is some subgradient of ψ(z) at point z = zk+1 . This implies that for every u it satisfies

e k+1 + αg, zk+1 − ui . 0 = zk+1 − zk + α∇

At this point, using the equality hzk+1 − zk , zk+1 − ui = 12 kzk − zk+1 k2 − 21 kzk − uk2 + 12 kzk+1 − uk2 , as well as the inequality hg, zk+1 − ui ≥ ψ(zk+1 ) − ψ(u) − σ2 kzk+1 − uk2 which comes from the strong convexity of ψ(·), we can write e k+1 , zk+1 − ui + αψ(zk+1 ) − αψ(u) αh∇

= −hzk+1 − zk , zk+1 − ui − hαg, zk+1 − ui + αψ(zk+1 ) − αψ(u) 1 1 1 + ασ ≤ − kzk − zk+1 k2 + kzk − uk2 − kzk+1 − uk2 . 2 2 2



The following lemma combines Lemma 3.1, Lemma 3.2 and Lemma 3.3 all together, using the special choice of xk+1 which is a convex combination of yk , zk and x e:

Lemma 3.4 (coupling step 1). If xk+1 = τ1 zk + τ2 x e + (1 − τ1 − τ2 )yk , where τ1 ≤ then it satisfies

3 αL

and τ2 = 12 ,

αh∇f (xk+1 ), zk − ui − αψ(u)      α F (xk+1 ) − E F (yk+1 ) + τ2 F (e x) − τ2 E f (xk+1 ) − τ2 h∇f (xk+1 ), x e − xk+1 i ≤ τ1  α(1 − τ1 − τ2 ) 1 + ασ  α 1 + kzk − uk2 − E kzk+1 − uk2 + ψ(yk ) − ψ(xk+1 ) 2 2 τ2 τ1

Proof. We first apply Lemma 3.3 and get

e k+1 , zk − ui + αψ(zk+1 ) − αψ(u) αh∇ e k+1 , zk+1 − ui + αψ(zk+1 ) − αψ(u) e k+1 , zk − zk+1 i + αh∇ = αh∇

e k+1 , zk − zk+1 i − 1 kzk − zk+1 k2 + 1 kzk − uk2 − 1 + ασ kzk+1 − uk2 . ≤ αh∇ 2 2 2

9

(3.1)

def

By defining v = τ1 zk+1 + τ2 x e + (1 − τ1 − τ2 )yk , we have xk+1 − v = τ1 (zk − zk+1 ) and therefore h i h i e k+1 , zk − zk+1 i − 1 kzk − zk+1 k2 = E α h∇ e k+1 , xk+1 − vi − 1 kxk+1 − vk2 E αh∇ 2 τ1 2τ12 hα   i e k+1 , xk+1 − vi − 1 kxk+1 − vk2 − ψ(v) + ψ(xk+1 ) + α ψ(v) − ψ(xk+1 ) =E h∇ τ1 2ατ1 τ1 i  α ¬ hα 3L e k+1 , xk+1 − vi − h∇ ψ(v) − ψ(xk+1 ) ≤E kxk+1 − vk2 − ψ(v) + ψ(xk+1 ) + τ1 2 τ1 i ­ hα 1 2  α ≤E F (xk+1 ) − F (yk+1 ) + ψ(v) − ψ(xk+1 ) σ + τ1 4L k+1 τ1 ® hα  1 ≤E F (xk+1 ) − F (yk+1 ) + f (e x) − f (xk+1 ) − h∇f (xk+1 ), x e − xk+1 i τ1 2 i α + τ1 ψ(zk+1 ) + τ2 ψ(e x) + (1 − τ1 − τ2 )ψ(yk ) − ψ(xk+1 ) . (3.2) τ1

3 Above, ¬ uses our choice τ1 ≤ αL , ­ uses Lemma 3.1, ® uses Lemma 3.2. Finally, noticing that e E[h∇k+1 , zk − ui] = h∇f (xk+1 ), zk − ui and τ2 = 12 , we obtain the desired inequality by combining (3.1) and (3.2). 

The next lemma simplifies the left hand side of Lemma 3.4 using the convexity of f (·), and gives an inequality that relates the objective-distance-to-minimizer quantities F (yk ) − F (x∗ ), F (yk+1 ) − F (x∗ ), and F (e x) − F (x∗ ) to the point-distance-to-minimizer quantities kzk − x∗ k2 and kzk+1 − x∗ k2 .

Lemma 3.5 (coupling step 2). Under the same choices of τ1 , τ2 as in Lemma 3.4, we have   ατ2  α(1 − τ1 − τ2 ) α  0≤ E F (yk+1 ) − F (x∗ ) + F (e x) − τ2 F (x∗ ) (F (yk ) − F (x∗ )) − τ1 τ1 τ1   1 1 + ασ + kzk − x∗ k2 − E kzk+1 − x∗ k2 . 2 2 Proof. We first compute that ¬ α f (xk+1 ) − f (u) ≤ αh∇f (xk+1 ), xk+1 − ui

= αh∇f (xk+1 ), xk+1 − zk i + αh∇f (xk+1 ), zk − ui α(1 − τ1 − τ2 ) ­ ατ2 h∇f (xk+1 ), x e − xk+1 i + h∇f (xk+1 ), yk − xk+1 i + αh∇f (xk+1 ), zk − ui = τ1 τ1 α(1 − τ1 − τ2 ) ® ατ2 = h∇f (xk+1 ), x e − xk+1 i + (f (yk ) − f (xk+1 )) + αh∇f (xk+1 ), zk − ui . τ1 τ1

Above, ¬ uses the convexity of f (·), ­ uses the choice that xk+1 = τ1 zk + τ2 x e + (1 − τ1 − τ2 )yk , and ® uses the convexity of f (·) again. By applying Lemma 3.4 to the above inequality, we have  α(1 − τ1 − τ2 ) α f (xk+1 ) − F (u) ≤ (F (yk ) − f (xk+1 )) τ1  1    α α 1 + ασ  + F (xk+1 )−E F (yk+1 ) +τ2 F (e x)−τ2 f (xk+1 ) + kzk −uk2 − E kzk+1 −uk2 − ψ(xk+1 ) τ1 2 2 τ1 which implies  α(1 − τ1 − τ2 ) α F (xk+1 ) − F (u) ≤ (F (yk ) − F (xk+1 )) τ1  1    α 1 + ασ  + F (xk+1 ) − E F (yk+1 ) + τ2 F (e x) − τ2 F (xk+1 ) + kzk − uk2 − E kzk+1 − uk2 . τ1 2 2 10

After rearranging and setting u = x∗ , the above inequality yields 0≤

4

 ατ2  α  α(1 − τ1 − τ2 ) (F (yk ) − F (x∗ )) − E F (yk+1 ) − F (x∗ ) + F (e x) − τ2 F (x∗ ) τ1 τ1 τ1  1 1 + ασ  + kzk − x∗ k2 − E kzk+1 − x∗ k2 . 2 2



Strongly Convex Case

In this section we telescope Lemma 3.5 from the previous section across all iterations k, and prove the following theorem: Theorem 4.1. If each fi (x) is convex, L-smooth, and ψ(x) is σ-strongly convex in (1.1), then Katyusha(x0 , S, σ, L) satisfies   −Sm ) · F (x ) − F (x∗ ) , if mσ ≤ 3 ;   O((1 + ασ) 0 S ∗ L 4   E F (e x ) − F (x ) ≤ 3 O 1.5−S · F (x0 ) − F (x∗ ) , if mσ > L 4.

  In other words, Katyusha achieves an ε-additive error (i.e., E F (e xS ) − F (x∗ ) ≤ ε) using at most p  (x∗ )  iterations.7 O n + nL/σ · log F (x0 )−F ε

Remark 4.2. Because m = n, each iteration of Katyusha computes only O(1) stochastic gradients ∇fi (·) in the amortized sense. Therefore, the per-iteration time complexity of Katyusha is dominated by the computation time of ∇fi (·), the proximal update in Line 10 of Algorithm 1, plus an overhead O(d). If ∇fi (·) has at most d0 ≤ d non-zero entries, this overhead O(d) is improvable to O(d0 ) using a sparse implementation of Katyusha.8 As a result, for all the ERM problems defined in (1.2), the amortized per-iteration complexity of Katyusha is only O(d0 ) where d0 is the sparsity of feature vectors, asymptotically the same as the per-iteration complexity of SGD. def e s def Proof of Theorem 4.1. We define Dk = F (yk )−F (x∗ ), D = F (e xs )−F (x∗ ), and rewrite Lemma 3.5 as follows:

0≤

 (1 − τ1 − τ2 ) 1 1 τ2  e s  1 + ασ  Dk − Dk+1 + E D + kzk − x∗ k2 − E kzk+1 − x∗ k2 . τ1 τ1 τ1 2α 2α

At this point, let us define θ = 1 + ασ and multiply the above inequality by θj for each k = sm + j. Then, we sum up the resulting m inequalities for all j = 0, 1, . . . , m − 1: m−1 m−1 h (1 − τ − τ ) m−1 i τ X X 1 X 1 2 2 es j j 0≤E Dsm+j · θ − Dsm+j+1 · θ + D · θj τ1 τ1 τ1 j=0

j=0

j=0

 1 θm  + kzsm − x∗ k2 − kz(s+1)m − x∗ k2 . 2α 2α

Note that in the above inequality we have assumed all the randomness in the first s − 1 epochs are fixed and the only source of randomness comes from epoch s. We can rearrange the terms in the 7

Like in all stochastic first-order methods, one can apply a Markov inequality to conclude that with probability at least 2/3, Katyusha satisfies F (e xS ) − F (x∗ ) ≤ ε in the same stated asymptotic running time. 8 This requires one to defer the updates such as xk+1 ← τ1 zk + τ2 x es + (1 − τ1 − τ2 )yk without naively implementing it. An experienced programmer can consult for instance [5] for a detailed treatment but on a different problem.

11

above inequality and get m h τ + τ − (1 − 1/θ) X i (1 − τ − τ )  X   τ2 s m−1 1 2 1 2 e · Dsm − θm E D(s+1)m + D E Dsm+j · θj ≤ θj τ1 τ1 τ1 j=1

j=0

 θm  1 kzsm − x∗ k2 − E kz(s+1)m − x∗ k2 . + 2α 2α  P P m−1 j −1 j Using the special choice that x es+1 = · m−1 j=0 θ j=0 xsm+j+1 ·θ and the convexity of F (·), we  e s+1 ≤ Pm−1 θj −1 · Pm−1 Dsm+j+1 · θj . Substituting this into the above inequality, derive that D j=0 j=0 we get m−1 X   τ2 s m−1 τ1 + τ2 − (1 − 1/θ)  e s+1  X j (1 − τ1 − τ2 )  e · · θE D θ ≤ Dsm − θm E D(s+1)m + D θj τ1 τ1 τ1 j=0

j=0

+

We consider two cases next. Case 1. √ √mσ 3L

Suppose that

mσ L



3 4.

θm

  1 kzsm − x∗ k2 − E kz(s+1)m − x∗ k2 . 2α 2α

In this case, we choose α =

√ 1 3mσL

and τ1 =

1 3αL

(4.1)

= mασ =

∈ [0, 12 ] as in Katyusha. Our parameter choices imply ασ ≤ 1/2m and therefore the following inequality holds: 1 1 ) ≤ (m − 1)ασ + ασ = mασ = τ1 . τ2 (θm−1 − 1) + (1 − 1/θ) = ((1 + ασ)m−1 − 1) + (1 − 2 1 + ασ In other words, we have τ1 + τ2 − (1 − 1/θ) ≥ τ2 θm−1 and thus (4.1) implies that m−1 i hτ X 1 1 − τ1 − τ2 2 e s+1 E D · D(s+1)m + kz(s+1)m − x∗ k2 θj + τ1 τ1 2α j=0

≤θ

−m

·



2

τ1

es · D

m−1 X

θj +

j=0

 1 − τ1 − τ2 1 Dsm + kzsm − x∗ k2 τ1 2α

If we telescope the above inequality over all epochs s = 0, 1, . . . , S − 1, we obtain

     S  ¬ −Sm e ≤θ e 0 + D0 + τ1 kx0 − x∗ k2 E F (e xS ) − F (x∗ ) = E D ·O D αm   ­ τ 1 ≤ θ−Sm · O 1 + · (F (x0 ) − F (x∗ )) αmσ  ® = O((1 + ασ)−Sm ) · F (x0 ) − F (x∗ ) .

(4.2)

P 1 j Above, ¬ uses the fact that m−1 j=0 θ ≥ m and τ2 = 2 ; ­ uses the strong convexity of F (·) which implies F (x0 ) − F (x∗ ) ≥ σ2 kx0 − x∗ k2 ; and ® uses our choice of τ1 . 3 Case 2. Suppose that mσ L > 4 . In this case, we choose τ1 = Our parameter choices help us simplify (4.1) as

1 2

and α =

1 3τ1 L

=

2 3L

as in Katyusha.

m−1 X X   s+1  m−1 θm  1 j s e e 2E D · θ ≤D · θj + kzsm − x∗ k2 − E kz(s+1)m − x∗ k2 . 2α 2α j=0

j=0

12

Since θm = (1 + ασ)m ≥ 1 + ασm = 1 +

2σm 3L

≥ 32 , the above inequality implies

m−1 X X  m−1 3  e s+1  9L  3L es · E D + E kz(s+1)m − x∗ k2 · kzsm − x∗ k2 . θj ≤ D θj + 2 8 4 j=0

j=0

If we telescope this inequality over all the epochs s = 0, 1, . . . , S − 1, we immediately have m−1 m−1 i  h X 2 S  e 0 X j 3L 3L eS · kzSm − x∗ k2 ≤ · D · θ + kz0 − x∗ k2 . E D θj + 4 3 4 j=0

j=0

P σ j ∗ 2 ∗ Finally, since m−1 j=0 θ ≥ m and 2 kz0 − x k ≤ F (x0 ) − F (x ) owing to the strong convexity of F (·), we conclude that     E F (e xS ) − F (x∗ ) ≤ O 1.5−S · F (x0 ) − F (x∗ ) . (4.3)



Combining (3.1) and (3.2) we finish the proof of Theorem 4.1.

5

Corollaries Via Reduction

It is immediately clear from Theorem 4.1 that Katyusha satisfies the HOOD property: √  Corollary 5.1. Katyusha satisfies the HOOD property with T (L, σ) = O n + √nL iterations. σ

Remark 5.2. Notice that existing accelerated stochastic methods (even only for solving the simpler problem (1.2) either do not satisfy HOOD property or satisfy HOOD with an additional factor log(L/σ) in the number of iterations. This is why they can not be combined with the reductions in [1] to get the optimal convergence rates. Based on the HOOD property, we can apply Theorem 2.3, 2.4 and 2.5 in Section 2 to deduce that Corollary 5.3. If each fi (x) is convex, L-smooth and ψ(·) is not necessarily strongly convex in (1.1), then by applying AdaptReg on Katyusha with a starting vector x0 , we obtain an output x satisfying E[F (x)] − F (x∗ ) ≤ ε in at most √  F (x0 ) − F (x∗ ) nL · kx0 − x∗ k  √ + O n log iterations. ε ε

In contrast, the best known convergence rate was only applying the new AdaptReg reduction on Catalyst.

log2 (1/ε) √ ε

due to Catalyst, or

log(1/ε) √ ε

by

Corollary 5.4. If each fi (x) is G-Lipschitz continuous and ψ(x) is σ-strongly convex in (1.2), then by applying AdaptSmooth on Katyusha with a starting vector x0 , we obtain an output x satisfying E[F (x)] − F (x∗ ) ≤ ε in at most √   F (x0 ) − F (x∗ ) nG O n log + √ iterations. ε σε

In contrast, the best known convergence rate was only

13

log(1/ε) √ ε

due to APCG and SPDC.

Algorithm 2 Katyushans (x0 , S, σ, L) 1: m ← n;  the time window for re-computing snapshot x e 2: τ2 ← 12 ; 3: y0 = z0 = x e0 ← x0 ;  initial vectors 4: for s ← 1 to S do 2 1 5: τ1,s ← s+4 , αs ← 3τ1,s  different parameter choices comparing to Katyusha L 6: µs ← ∇f (e xs );  compute the full gradient only once every m iterations 7: for j ← 0 to m − 1 do 8: k ← (sm) + j; 9: xk+1 ← τ1,s zk + τ2 x es + (1 − τ1,s − τ2 )yk ; e k+1 ← µs + ∇fi (xk+1 ) − ∇fi (e 10: ∇ xs ) where i is randomly chosen from {1, 2, . . . , n};  1 2 e 11: zk+1 = arg minz 2αs kz − zk k + h∇k+1 , zi + ψ(z) ;  2 + h∇ e k+1 , yi + ψ(y) ; 12: Option I: yk+1 ← arg miny 3L ky − x k k+1 2 13: Option II: yk+1 ← xk+1 + τ1,s (zk+1 − zk )  we analyze only Option I in this paper, but Option II also works 14: end for P m 1 15: x es+1 ← m j=1 xsm+j ; 16: end for 17: return x eS . Corollary 5.5. If each fi (x) is G-Lipschitz continuous and ψ(x) is not necessarily strongly convex in (1.2), then by applying JointAdaptRegSmooth on Katyusha with a starting vector x0 , we obtain an output x satisfying E[F (x)] − F (x∗ ) ≤ ε in at most √  F (x0 ) − F (x∗ ) nGkx0 − x∗ k  O n log + iterations. ε ε

In contrast, the best known convergence rate was only

6

log(1/ε) ε

due to APCG and SPDC.

Non-Strongly Convex Case

In this section we consider a variant of Katyusha that directly works on non-strongly convex objectives F (·) for problem (1.1). We call this algorithm Katyushans , see Algorithm 2. As in the strongly convex case, we set τ2 = 12 throughout the algorithm, but choose τ1 = τ1,s to be a parameter that depends on the epoch index s, and accordingly αs = 3Lτ11,s . These parameter choices will satisfy the presumptions in Lemma 3.4. We prove the following theorem in this section: Theorem 6.1. If each fi (x) is convex, L-smooth in (1.1) and ψ(·) is not necessarily strongly convex, then Katyushans (x0 , S, L) satisfies  F (x ) − F (x∗ ) Lkx − x∗ k2    0 0 E F (e xS ) − F (x∗ ) ≤ O + S2 nS 2   In other words, Katyushans achieves an ε-additive error (i.e., E F (e xS ) − F (x∗ ) ≤ ε) using at √  √  n F (x0 )−F (x∗ ) nLkx0 −x∗ k √ √ most O + stochastic gradients and the same number of iterations. ε ε

14

Remark 6.2. Katyushans is a direct, accelerated solver for the non-strongly convex case of problem (1.1). One should compare it with the convergence lemma of a direct, non-accelerated solver for the same setting, which can be usually written as follows when translated to our notations (see for instance SAGA [10]):  F (x ) − F (x∗ ) Lkx − x∗ k2    0 0 E F (x) − F (x∗ ) ≤ O + . S nS

It is clear at this moment that Katyushans is at least a factor S faster than non-accelerated methods such as SAGA. This convergence can also be written in terms of the number of iterations which is ∗ k2  (x∗ )) O n(F (x0 )−F . + Lkx0 −x ε ε

Remark 6.3. Our stated convergence √result in Theorem 4.1 is a slightly worse than the de(x∗ ) −x∗ k  √0 sired complexity O n log F (x0 )−F obtained from using the optimal reduction, + nLkx ε ε see Corollary 5.5. This can be fixed by making some non-trivial changes to the epoch lengths, and is omitted in the current version of this paper.9 def e s def Proof of Theorem 6.1. Again by defining Dk = F (yk ) − F (x∗ ) and D = F (e xs ) − F (x∗ ), we can rewrite Lemma 3.5 as follows:

0≤

 αs τ2   αs (1 − τ1,s − τ2 ) αs  e s + 1 kzk − x∗ k2 − 1 E kzk+1 − x∗ k2 . Dk − E Dk+1 + nD τ1,s τ1,s τ1,s 2 2

Summing up the above inequality for all the iterations k = sm, sm + 1, . . . , sm + m − 1, we have m h 1−τ −τ i τ1,s + τ2 X 1,s 2 E αs D(s+1)m + αs Dsm+j τ1,s τ1,s j=1

≤ αs

 1 − τ1,s − τ2 τ2 e s 1 1  Dsm + αs nD + kzsm − x∗ k2 − E kz(s+1)m − x∗ k2 . τ1,s τ1,s 2 2

(6.1)

Note that in the above inequality we have assumed all the randomness in the first s − 1 epochs are fixed and the only sourcePof randomness comes from epoch s. m 1 es es = m j=1 x(s−1)m+j , then by the convexity of function F (·) we have mD ≤ PnIf we define x j=1 D(s−1)m+j . Therefore, for every s ≥ 1 we can derive from (6.1) that m−1 h 1 i τ1,s + τ2 X E 2 D(s+1)m + D sm+j 2 τ1,s τ1,s j=1



m−1  1 − τ1,s τ2 X 3L 3L  D + D(s−1)m+j + kzsm − x∗ k2 − E kz(s+1)m − x∗ k2 . sm 2 2 2 2 τ1,s τ1,s j=1

(6.2)

For the base case s = 0, we can also rewrite (6.1) as m−1 i h 1 τ1,0 + τ2 X E 2 Dm + D j 2 τ1,0 τ1,0 j=1

≤ 9

 1 − τ1,0 − τ2 τ2 n e 0 3L 3L  D0 + 2 D + kz0 − x∗ k2 − E kzm − x∗ k2 . 2 2 2 τ1,0 τ1,0

(6.3)

More precisely, recall that a similar issue has also happened in the non-accelerated world: the iteration complexity in SAGA can be improved to O(n log 1ε + Lε ) using varying epoch length [8]. The technique in [8] also applies to this paper.

O( n+L ) ε

15

At this point, if we choose τ1,s = 1 2 τ1,s



2 s+4

≤ 12 , it satisfies

1 − τ1,s+1 2 τ1,s+1

and

τ1,s + τ2 τ2 ≥ 2 . 2 τ1,s τs+1

Using these two inequalities, we can telescope (6.3) and (6.2) for all s = 0, 1, . . . , S − 1. We obtain in the end that h E

1 2 τ1,S−1

D(s+1)m +

m−1 i 1−τ −τ τ1,S−1 + τ2 X τ2 n e 0 3L 3L 1,0 2 ∗ 2 D0 + 2 D + kz0 −x∗ k2 D + kz −z k ≤ Sm (S−1)m+j 2 2 2 2 τ1,S−1 τ τ 1,0 1,0 j=1

e S ≤ 1 Pm D(S−1)m+j which is no greater than Since we have D j=1 m of (6.4), we conclude that

(6.4)

2 2τ1,S−1 m

times the left hand side

2  τ1,S    S   1 − τ1,0 − τ2 τ2 n e 0 3L S ∗ ∗ 2 e + D x ) − F (x ) = E D ≤ O · D + kz − x k E F (e 0 0 2 2 m 2 τ1,0 τ1,0     1 ∗ ∗ 2 =O . · m F (x ) − F (x ) + Lkx − x k 0 0 mS 2



References [1] Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization Objectives. ArXiv e-prints, abs/1603.05642, March 2016. [2] Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. ArXiv e-prints, abs/1603.05643, March 2016. [3] Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a widthindependent, parallel, simpler, and faster positive SDP solver. In Proceedings of the 27th ACM-SIAM Symposium on Discrete Algorithms, SODA ’16, 2016. [4] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. ArXiv e-prints, abs/1407.1537, July 2014. [5] Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-Linear Time Positive LP Solver with Faster Convergence Rate. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, STOC ’15, 2015. [6] Zeyuan Allen-Zhu and Lorenzo Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In Proceedings of the 26th ACM-SIAM Symposium on Discrete Algorithms, SODA ’15, 2015. [7] Zeyuan Allen-Zhu, Peter Richt´ arik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. ArXiv e-prints, abs/1512.09103, December 2015. [8] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-NonConvex Objectives. ArXiv e-prints, abs/1506.01972, June 2015. [9] L´eon Bottou. Stochastic gradient descent. http://leon.bottou.org/projects/sgd.

16

[10] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems, NIPS 2014, 2014. [11] Aaron J. Defazio, Tib´erio S. Caetano, and Justin Domke. Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, 2014. [12] Olivier Fercoq and Peter Richt´ arik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997–2023, 2015. First appeared on ArXiv 1312.5799 in 2013. [13] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, volume 37, pages 1–28, 2015. [14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, January 2014. [15] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 2013, pages 315–323, 2013. [16] Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. ArXiv e-prints, abs/1212.2002, 2012. [17] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. ArXiv e-prints, abs/1507.02000, October 2015. [18] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 147–156. IEEE, 2013. [19] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In NIPS, 2015. [20] Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. In Advances in Neural Information Processing Systems, NIPS 2014, pages 3059–3067, 2014. [21] Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, pages 1–28, 2013. [22] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems, pages 674–682, 2013. [23] Julien Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2):829–855, April 2015. Preliminary version appeared in ICML 2013.

17

[24] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543–547, 1983. [25] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004. [26] Yurii Nesterov. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362, jan 2012. [27] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, ICML ’12, 2012. [28] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013. Preliminary version appeared in NIPS 2012. [29] Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arXiv preprint arXiv:1211.2717, pages 1–18, 2012. [30] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567–599, 2013. [31] Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pages 64–72, 2014. [32] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 24(4):2057—-2075, 2014. [33] Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages 980–988, 2013. [34] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning, ICML 2004, 2004. [35] Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, 2015.

18