Neighborhood Watch: Stochastic Gradient Descent with Neighbors

Report 6 Downloads 241 Views
Neighborhood Watch: Stochastic Gradient Descent with Neighbors

arXiv:1506.03662v1 [cs.LG] 11 Jun 2015

Thomas Hofmann Aurelien Lucchi Brian McWilliams Department of Computer Science, ETH Z¨urich {thomas.hofmann, aurelien.lucchi, brian.mcwilliams } @inf.ethz.ch

Abstract Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet it is also known to be slow relative to steepest descent. The variance in the stochastic update directions only allows for sublinear or (with iterate averaging) linear convergence rates. Recently, variance reduction techniques such as SVRG and SAGA have been proposed to overcome this weakness. With asymptotically vanishing variance, a constant step size can be maintained, resulting in geometric convergence rates. However, these methods are either based on occasional computations of full gradients at pivot points (SVRG), or on keeping per data point corrections in memory (SAGA). This has the disadvantage that one cannot employ these methods in a streaming setting and that speed-ups relative to SGD may need a certain number of epochs in order to materialize. This paper investigates a new class of algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points. While not meant to be offering advantages in an asymptotic setting, there are significant benefits in the transient optimization phase, in particular in a streaming or singleepoch setting. We investigate this family of algorithms in a thorough analysis and show supporting experimental results. As a side-product we provide a simple and unified proof technique for a broad class of variance reduction algorithms.

1

Introduction

We consider a general problem that is pervasive in machine learning, namely optimization of an empirical or regularized convex risk function. Given a convex loss l and a strongly convex regularizer Ω, one aims at finding a parameter vector w which minimizes the (empirical) expectation: n

w∗ ∈ argmin f (w), w

f (w) := Efi (w) =

1X fi (w), n i=1

fi (w) := l(w, (xi , yi )) + Ω(w) . (1)

We assume throughout the data-dependent functions fi to have Lipschitz-continuous gradients. Steepest descent is a straightforward algorithm to find a minimizer w∗ of f , but it requires the repeated computation of full gradients f 0 (w), which becomes prohibitive in the case of massive data sets and which is impossible in the case of streaming data. Stochastic gradient descent (SGD) is a popular alternative, in particular in the context of large-scale learning [2, 9]. The key advantage of SGD is that each update only involves a single example. This is sufficient to compute a stochastic version of the true gradient as fi0 (w), which provides an unbiased estimate since Efi0 (w) = f 0 (w). It is a surprising recent finding [10, 5, 8, 6] that the additive structure of f allows for significantly faster convergence in expectation. Instead of O(1/t) or O(log t/t) rates for standard SGD variants, it is possible to obtain geometric rates, i.e. to guarantee exponentially fast convergence. While the classic convergence proof of SGD requires vanishing step sizes, typically at a rate of O(1/t) [7], 1

these more recent methods introduce corrections to the stochastic gradients that ensure convergence for finite step sizes. Based on the work mentioned above, the contributions of our paper are as follows: First, we provide a novel, simple, and more general analysis technique for studying variance reduced versions of SGD. Second, based on the above analysis, we present new insights into the trade-offs between freshness and biasedness of the corrections computed from previous stochastic gradients. Third, we present a new class of algorithms that resolves this trade-off by computing corrections based on stochastic gradients at neighboring points. Fourth, we will show experimentally that this new class of algorithms offers advantages in the regime of streaming data and single/few-epoch learning.

2

Algorithms

Variance Reduced SGD Given an optimization problem as in (1), we investigate a class of stochastic gradient descent algorithms that generates an iterate sequence wt (t ≥ 0) with updates taking the form: gi (w) = fi0 (w) − α ¯i

w+ = w − γgi (w),

(2)

Here w is the current and w+ the new parameter vector, γ is the step size, and i is an index selected uniformly at random. α ¯ i are variance correction terms such that E¯ αi = 0. The aim is to define updates of asymptotically vanishing variance, i.e. gi (w) → 0 as w → w∗ , which requires that α ¯ i → fi0 (w∗ ). This implies the corrections need to be designed in a way to exactly cancel out the stochasticity of fi0 (w∗ ) at the optimum. SAGA The SAGA algorithm [3] maintains variance corrections αi after selecting data point i by memorizing stochastic gradients. The update rule is αi+ = fi0 (w) for the selected i, and αj+ = αj , for j 6= i. Note that these corrections will be used, the next time the same Pn index i gets sampled. Bias adjusted versions are then defined via α ¯ i := αi − α ¯ , where α ¯ := n1 i=1 αi . Obviously, α ¯ can be updated incrementally. One convenient property of SAGA is that it can reuse the stochastic gradient fi0 (w) computed at step t to update both, w as well as the correction αi at no additional costs. We also consider a variant of SAGA, which we call q-SAGA that on average updates q ≥ 1 randomly chosen αj variables at each iteration. While this is practically less interesting, it is a convenient reference point in our analysis as well as in the experiments to investigate the advantages of ”fresher” corrections. Note that in SAGA the corrections will be on average n iterations ”old”. In q-SAGA this can be controlled to be n/q. SVRG To show the fruitfulness of our framework, we reformulate a variant of Stochastic Variance Reduced Gradient (SVRG) [5]. We use a randomization argument similar to (but much simpler than) the one suggested in [6] and define a real-valued parameter q > 0. In each iteration we sample r ∼ Uniform[0; 1). If r < q/n we perform a complete update of the α variables αj+ = fj0 (w) (∀j), otherwise they are left unchanged. For q = 1 in expectation one parameter is updated per step, which is the same as in SAGA. The main difference is that SAGA always updates exactly one αi in each iteration, while SVRG occasionally updates all α parameters by triggering an additional sweep through the data. One can also see that here α ¯ = f 0 (w). There is an option to not maintain α variables explicitly and to save on space by storing only α ¯ and the parameter vector w ˜ at which the last full re-computation happened. Uniform Memorization Algorithms Motivated by SAGA and SVRG, we define a class of algorithms, which we call memorization algorithms. Memorization algorithms use the αi variables to keep track of past stochastic gradients, i.e. αit = fj0 (wτi ), for some iteration τi < t. Note that for SAGA as well as SVRG i = j. We refer to this special cases as algorithms without sharing. We will go beyond this restriction by considering i 6= j below. It will turn out to be a technical requirement of our analysis that the probability of updating αi is the same for all i. We call memorization algorithms with this property uniform memorization algorithms. N -SAGA We assume that we have structured our data set such that for each data index i, there is a set of neighbors Ni available. Specifically, we propose two variants: (i) The balanced case in 2

Ni xi

✏ xj

Ekfi0 (w)

Figure 1: The N -SAGA method presented in this paper generalizes the SAGA correction by exploiting a neighborhood structure Ni for computing corrections.

fj0 (w)k2 < ⌘

which |Ni | = q (∀i). We call this variant qN -SAGA. It is mainly important for the analysis, since it fulfills the requirements of an uniform memorization algorithm. (ii) The case of -neighborhoods where Ni := {j : kxi − xj k ≤ }. which we refer to as N -SAGA. As the analysis will show, the quantity that matters most is the expectation of the squared pairwise distances kfi0 (w) − fj0 (w)k2 . Since computing those explicitly defeats the purpose of sharing stochastic gradients across points, N -SAGA resorts to the heuristic of constructing neighborhoods based on distances in input space. This can be more formally motivated by assuming a Lipschitz condition of the loss with regard to the inputs. Based on a neighborhood system, we define a modified version of SAGA updates on selecting i via  0 fi (w) if i ∈ Nj + αj = (3) αj otherwise Intuitively, the stochastic gradient computed for the observed or sampled data point (xi , yi ) is propagated to those data points (xj , yj ) for which i is a neighbor of j. Computing Neighborhoods From a practical point of view it is critical to limit the computational overhead of N -SAGA. However, note that we do not need exactness in finding nearest neighbors. Obviously, the closer the neighbors, the better the expected improvements. Yet, one can be opportunistic here and use plain SAGA as a fallback option, e.g. in the N -variant with occasionally trivial neighborhoods Ni = {i}.

From a practical point of view: (i) We should create or maintain a candidate set of k data vectors, where k ∈ o(n), most drastically k ∈ O(1). The simplest way to do this is via uniform subsampling of k points from the training data. More refined methods are available from the literature on core sets, e.g. [4]. How  will decrease with k remains data set dependent though. (ii) We can use locality-sensitive hashing to index these k data points, perhaps even using data-dependent hashing [1], whenever this cost can be amortized. Note also that as SGD is a sequential method, increasing the computational cost in a highly parallelizable way may not affect data throughput.

3

Analysis

Primal Recurrence The evolution equation (2) in expectation implies the recurrence Ekw+ −w∗ k2 = kw − w∗ k2 − 2γhf 0 (w), w − w∗ i + γ 2 Ekgi (w)k2 .

(4)

From here we utilize a number of well-known bounds (see e.g. [3]), which exploit strong convexity of f (wherever µ appears) as well as Lipschitz continuity of the fi -gradients (wherever L appears): µ hfi0 (w), w − w∗ i ≥ f (w) − f (w∗ ) + kw − w∗ k2 , (5) 2 Ekgi (w)k2 ≤ (1 + β)Ekfi0 (w) − fi0 (w∗ )k2 − βkf 0 (w)k2  + 1 + β −1 Ek¯ αi − fi0 (w∗ )k2 (β > 0), (6) kf 0 (w)k2 ≥ 2µ (f (w) − f (w∗ )) ,

kfi0 (w) − fi0 (w∗ )k2 Ekfi0 (w)−fi0 (w∗ )k2 Ek¯ αi − fi0 (w∗ ))k2

≤ 2Lhi (w),

hi (w) := fi (w) − fi (w ) + hw − w ∗

≤ 2L(f (w) − f (w )) ,

= Ekαi −

(7)



fi0 (w∗ )k2

2

− k¯ αk ≤ Ekαi −

3

fi0 (w∗ )k2



, fi0 (w∗ )i ,

(8) (9) (10)

By applying all of these, we can derive a β-parameterized bound  kw−w∗ k2 − Ekw+ −w∗ k2 ≥ γµkw − w∗ k2 − γ 2 1 + β −1 Ekαi − fi0 (w∗ ))k2 + 2γ [1 − γ(L(1 + β) − µβ)] (f (w) − f (w∗ )) .

(11)

fi0 (w∗ )

Note that in the ideal case of perfect asymptotic variance reduction, where αi = we can look at the limit of β → 0 and would immediately get a condition for a contraction by choosing γ = L1 , µ yielding a contraction rate of 1 − ρ with ρ = γµ = L , which is just the condition number. Our main result will show that we are not losing more than a factor of 4 relative to that gold standard. Exploiting Properties of Uniform Memorization Algorithms How can we further bound Ekαi − fi0 (w∗ )k2 in the case of variance-reducing SGD? A key insights is that for memorization algorithms without sharing, we can apply the same smoothness bound as in Eq. (8) kαi − fi0 (w∗ )k2 ≤ 2Lhi (wτi ) .

(12)

For the N -SAGA family things get slightly more complicated, but we can apply a similar parametrized bound as before (with φ > 0) to split the squared norm kαi − fi0 (w∗ )k2 ≤(1 + φ)kfi0 (wτi ) − fi0 (w∗ )k2 + (1 + φ−1 )kfj0 (wτi ) − fi0 (wτi )k2 .

(13)

The first of these terms is the same as before, only re-scaled by (1 + φ). The second term is the error introduced by making use of neighbor information (i.e. sharing). We assume it can be bounded in expectation by Ekfj0 (w) − fi0 (w)k2 < η

(14)

where the expectation is over random index pairs {(i, j) : j ∈ Ni } with regard to the distribution induced by the update rule. We require this bound to hold for each w along the iterate sequence. Lyapunov Function We want to show that for a suitable choice of the step size γ each iteration results in a contraction that brings us closer to the optimum, i.e. Ekw+ −w∗ k2 ≤ (1−ρ)kw −w∗ k2 . where 0 < ρ < 1. However, the main challenge arises from the fact that αi store stochastic gradients from previous iterations, i.e. they constitute quantities that are not evaluated at the current iterate w. This requires a somewhat more complex proof technique. Inspired by the Lyapunov function method in [3] we define upper bounds Hi ≥ kαi − fi0 (w∗ )k2 such that Hi → 0 as w → w∗ . We (conceptually) initialize Hi = 2Lhi (w0 ), start with αi0 = 0 and then update Hi in synch with αi ,  2L hi (w) if αi is updated + Hi := (15) Hi otherwise ¯ with so that we maintain valid bounds kαi − fi0 (w∗ )k2 ≤ Hi and Ekαi − fi0 (w∗ )k2 ≤ H Palways n 1 0 ∗ 2 −1 ¯ H := n i=1 Hi . Note that for N -SAGA kαi − fi (w )k ≤ (1 + φ)Hi + (1 + φ )η. It should be clear that Hi are quantities showing up in the analysis, but are not computed in the algorithm. We now define a Lyapunov function (which is much simpler than in [3])   γn ∗ 2 ¯ L(w, H) = kw − w k + Sσ H, with S := and Lq

0 < σ < 1.

(16)

In expectation under a random update, the Lyapunov function L changes as EL(w+ , H + ) = ¯ + . The first part is due to the parameter update and we can apply (11) Ekw+ − w∗ k2 + Sσ EH to bound it. The second part is due to the recurrence (15), which mirrors the update of the α variables. For the previously defined uniform memorization algorithms we have   n − q ¯ 2Lq + ¯ EH = H+ (f (w) − f (w∗ )) (17) n n where we have made use of the fact that Ehi (w) = f (w) − f (w∗ ). Note that for this expectation to work out correctly without further complication requires the mentioned uniformity property. Also note that in expectation the shrinkage does not depend on the location of previous iterates wτ and the new increment is always proportional to the sub-optimality of the current iterate w. 4

General Convergence Analysis Our main contraction result for uniform memorization algorithms (without sharing) can be stated as follows: Theorem 1. Given an instance of an uniform memorization algorithm without sharing as defined above. For any choice of β > 0 and 0 < σ < 1 there is a step size γ > 0 such that we get a contraction L(w, H) − EL(w+ , H + ) ≥ ρ(σ, β)L(w, H)

with ρ(σ, β) :=

µ min L



σ 1−σ , Rσ + (1 + β −1 ) 1 + β

 ,

where R :=

nµ . qL

As it provides a lot of motivation and insights, we carry out the proof in the main text. From Eq. (5), we can see that it may be reasonable to aim for ρ ≤ γµ based on the kw − w∗ k2 part of L. The ¯ changes and what constraints that may impose on γ. First of all note that question then is, how H ¯ that only depends on q there is a shrinkage of H (n − q) ¯ q ¯ γ ¯ ¯ 4+ σH = S σH = σH . (18) H := Sσ H − S n n L But as we are using Hi to bound kαi − fi0 (w∗ )k2 as it appears in (11), we need to subtract a suitable 2 −1 ¯ term, namely 4− )H. Combining the two terms, we pull out S and collect constants: H := γ (1+β #  " γ 1 + β −1 Lq 1 + − ¯. 4H := 4H − 4H = Sσ − H (19) n L σ As we aim for a contraction rate of ρ = γµ, this leads to a constraint on the step size   −1 γ 1 + β −1 1 qL 1 σ − ≥ µγ = Rγ ⇐⇒ γ ≤ · , L σ n L Rσ + (1 + β −1 )

(20)

with R as defined in the claim of the theorem. We next investigate terms that can be bounded by sub-optimality, i.e. f (w) − f ∗ (w). From Eq. (11) we get a factor 41f := 2γ [1 − γ(L(1 + β) − µβ)]. However, we also have f (w) − f (w∗ ) occur ¯ recurrence in (17). After working through cancelations of the constants one gets a second in the H factor 42f := −2γσ. Combining both terms, we require 41f + 42f ≥ 0 ⇐⇒ σ + γL(1 + β) ≤ 1 + γµβ ⇐⇒ γ ≤

1−σ . L + β(L − µ)

(21)

Here, we can directly see that σ < 1 is needed in order to get a positive step size as µ ≤ L implies that the denominator is positive. We would like to simplify the bound, which we can get by strengthening as follows: 1−σ 1−σ ≥ ≥γ (22) L + β(L − µ) L(1 + β)

So we have derived two bounds, which – together with the identity ρ = µγ – can be summarized in the claim of the theorem. The theorem provides a two-dimensional family of bounds as β > 0 and 0 < σ < 1 can be chosen arbitrarily. The question is, which choice of constant gives the best (maximal) rate ρ∗ . The optimal choice of σ is provided by the following corollary. Corollary 1. In Theorem 1, the maximal rate ρ∗ (β) = ρ(β, σ ∗ ) = maxσ ρ(σ, β) is obtained at i p 1 h σ∗ = R − (a + b) + R2 + (a + b)2 − 2R(a − b) (23) 2R where a = (1 + β) and b = (1 + β −1 ). Proof. We have to equate both expressions in the minimum of Theorem 1, resulting in aσ = (Rσ + b)(1 − σ) ⇐⇒ Rσ 2 + (a + b − R)σ − b = 0

Applying the standard formula for quadratic equations provides the claim. 5

(24)

Solving the resulting bound for β yields the best bound. Here, we simply state a specialization to the case of β = 1. Corollary 2. The optimal rate is lower bounded by   s  2 µ  4 4  ∗ ∗ ∗ ρ = max ρ (β) ≥ ρ (1) = 1+ − 1+ (25) β 4L R R

Note that the constant R is the ratio of n/q and the condition number µ/L. In the large data case where n/q  µ/L, we get the following result. Corollary 3. The optimal contraction rate is guaranteed to be at least    1 µ  4 q µ + O R−2 1+ + O R−2 = + (26) ρ∗ ≥ 4L R 4 L n Proof. Set z = 4/R and perform a Taylor approximation of the bound in Corollary 2 around z = 0:   i p µ h z µ b(z) := 1− √ (27) 1 + z − 1 + z 2 , b0 (z) = 4L 4L 1 + z2 Approximating b(z) =

µ 4L

+ zb0 (0) gives the claim.

The effect of q on the rate can be elucidated as follows. Using b0 (z) as defined in Eq. (27) and √  4L 1 noting that z = µn q we see that b0 (q) = n 1 − z/ 1 + z 2 ≤ n1 . Moreover it is easy to see that b00 (q) < 0. Hence the effect of increasing q on the bound in Corollary 2 is at most n1 . If the µ condition number L  n1 , improved freshness has a small effect, yet for small µ (i.e. because of µ weak regularization) and a regime where L ≈ n1 , the rate will increase proportionally to q. N -SAGA We now provide results for the N -SAGA algorithm in the (easier to analyze) qN variant. Taking the above analysis as a starting point, we first need to incorporate the additional penalty term (1 + φ) > 1. So wherever we had (1 + β −1 ) before, we now will have a factor b = (1 + β −1 )(1 + φ), e.g. in Corollary 1. More challenging is the error kfi0 (w) − fj0 (w)k2 for i ∈ Nj as it introduces a finite (non-vanishing) bias that can only be controlled indirectly over the granularity of the neighborhoods. What is possible to obtain in this case is a geometric convergence towards a neighborhood of w∗ . Theorem 2. Let ρ be a rate guaranteed by Theorem 1 for a uniform memorization algorithm without sharing (e.g. q-SAGA). For a fixed w assume that Ekfi0 (w) − fj0 (w)k2 < η (∀i ∈ Nj ) and define + + 0 ρ0 = (1 − ζ)ρ, 0 < ζ < 1. Then for qN -SAGA q we have: L(w, H) − EL(w , H ) ≥ ρ L(w, H) 2(1−ζ) ηρ as long as kw − w∗ k ≤ δ with δ := µ ζ . Proof. Based on the assumptions it is straightforward to derive L(w, H) − EL(w+ , H + ) ≥ ρL(w, H) − 4γ 2 η

(28)

by setting φ = 1. In order to get a contraction of ρ0 = (1 − ζ)ρ, it is required that ζρ L(w, H) ≥ 0 4γ 2 η. As L(w, H) ≥ kw − w∗ k2 and with an adjusted step size γ = ρµ we get as a sufficient condition  0 2 ρ 4η (1 − ζ)2 ρ ζρkw − w∗ k2 ≥ 4 η ⇐⇒ kw − w∗ k2 ≥ 2 (29) µ µ ζ

Corollary 4. As √ a specialization of Theorem 2 we chose√ζ = 2 − ρ0 = (1 − 2 + 3)ρ ≈ 0.732 ρ as long as kw − w∗ k ≥ µ2 2ηρ. 6



3 and get a contraction of

0

0

10

3

10

10

−1

10

−1

10

−2

10

10

Ob jective

Ob jective

Objective

−2

−3

10

−3

10

2

10

−4

10

−4

−5

10

−6

10

10 SGD cst 0N -SAGA SAGA SGD q-SAGA 1

−5

2

3

4

5

epochs

(a) Cov

10

SGD cst 0N -SAGA SAGA SGD q-SAGA 1

1

2

3

4

5

10

SGD cst 0N -SAGA SAGA SGD q-SAGA 1

epochs

(b) Ijcnn1

2

3

4

5

epochs

(c) Year

Figure 2: Comparison of -SAGA, q-SAGA, SAGA and SGD on three datasets.

Corollary 5. Under the assumptions of Theorem 2, if for i ∈ Nj , Ekfi0 (w) − fj0 (w)k2 < η holds for all iterates w = wt , then wt converges towards a δ-ball around w∗ at a geometric rate of ρ0 . So the theoretically expected behavior of qN -SAGA is to achieve very fast initial progress towards w∗ (for large enough q, faster than√SAGA) until the iterate reaches a δ-ball around the optimum. The size of this δ-ball scales with η. Within this ball, we either get a random walk behavior or need to switch to a SGD-style schedule for the step size to ensure convergence. Alternatively, we can lower q or ultimately switch to SAGA without sharing (effectively for q = 1), if highly accurate convergence towards the empirical risk minimizer is desired. In practice we have found N -SAGA to behave better than qN -SAGA. This is sensible as the constant neighborhood size in qN -SAGA seems more like a technical requirement than a principled feature.

4

Experimental Results

Algorithms We present experimental results on the performance of the different variants of memorization algorithms for variance reduced SGD as discussed in this paper. Since SAGA has shown uniformly superior behavior than SVRG in our experiments and N -SAGA has been almost uniformly superior to qN -SAGA (albeit, sometimes with small differences), we focus on these algorithms alongside with SGD as a straw man and q-SAGA as a point of reference for speed-ups. We have chosen q = 50 in q-SAGA and (based on some crude experimentation) chose  such that the average neighborhood size q ≈ 20. The same setting was used across all data sets and experiments. Data Sets As special cases for the choice of the loss function and regularizer in Eq. (1), we consider two commonly occurring problems in machine learning, namely least-square regression and `2 -regularized logistic regression. We apply least-square regression on the the million song year regression from the UCI repository. This dataset contains n = 515, 345 data points, each described by d = 90 input features. We apply logistic regression on the cov and ijcnn1 datasets obtained from the libsvm website 1 . The cov dataset contains n = 581, 012 datapoints, each described by d = 54 input features. The ijcnn1 dataset contains n = 49, 990 datapoints, each described by d = 22 input features. We added an `2 -regularizer Ω(w) = µkwk22 with µ = 10−3 to ensure the objective is strongly convex. Experimental Protocol We have run the algorithms in question in an i.i.d. sampling setting and averaged the results over 5 runs. Figure 2 shows the evolution of the value of the objective function f as a function of the number of update steps performed. Note that all algorithms compute one stochastic gradient per update step, with the exception of q-SAGA, which is included here not as a practically relevant algorithm, but as an indication of potential improvements that could be achieved by using fresher corrections. A constant step size γ has been used everywhere, expect for plain SGD. γ = 10−3 was found to be roughly the best in the set {10−1 , . . . , 10−5 }. For plain SGD we used a 1

http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets

7

0

0

10

4

10

10

−1

10

3

10

Ob jective

Ob jective

Ob jective

10

−1

−2

10

2

10 −3

10

−2

10

SGD cst 0N -SAGA SAGA SGD q-SAGA 1

SGD cst 0-saga SAGA SGD q-SAGA −4

2

3

epochs

(a) Cov

4

5

10

1

1

2

3

epochs

(b) Ijcnn1

4

5

10

SGD cst 0N -SAGA SAGA SGD q-SAGA 1

2

3

4

5

epochs

(c) Year

Figure 3: Test set error for -SAGA, q-SAGA, SAGA and SGD on three datasets.

schedule of the form γt = γ0 /(T0 + t) with constants optimized coarsely via cross-validation. The x-axis is expressed in units of n (suggestively called ”epochs”). SAGA vs. SGD As we can see, if we run SGD with the same constant step size as SAGA, it takes at least 2-4 epochs until SAGA really shows a significant gain. Of course, if the SGD step size is chosen more conservatively, the gains are more significant from the start. SAGA vs. q-SAGA q-SAGA outperforms plain SAGA quite consistently when counting stochastic update steps. This establishes optimistic reference curves of what we can expect to achieve with N -SAGA. The actual speed-up is somewhat data set dependent. N -SAGA vs. SAGA and q-SAGA N -SAGA can realize quite a fraction of the possible gains and typically traces nicely between the SAGA and q-SAGA curves. On cov we see solid speed-ups in epoch 1 and 2, which then start wearing off. The same is true for ijcnn1. On year the differences are less significant, but note that here SAGA has only a small edge on SGD with constant step size. At least the difference between N -SAGA and SAGA are bigger than those between SAGA and SGD. Asymptotics It should be clearly stated that running N -SAGA at a fixed  for longer will not result in good asymptotics on the empirical risk. In our experiments, the cross-over point with SAGA was typically after 5 − 10 epochs. Note that the gains are quite significant though for the single epoch learning. We have found very similar results in a streaming setting, i.e. presenting the data once in random order. Generalization Error Although our analysis was carried out purely for the empirical loss, it is important to also take a look at the expected risk on a test set (as a proxy for generalization performance). Here the curves are somewhat more noisy. On cov and ijcnn1 N -SAGA shows some improvements in early epochs, however on year we get a less satisfying result. Note that here already q-SAGA fails.

5

Conclusion

We proposed a novel analysis method for variance reducing SGD methods that demonstrates geometric convergence rates and provides a number of new insights, in particular about the role of the freshness of stochastic gradients evaluated at previous iterates. We have also investigated the effect of additional errors in the variance correction terms on the convergence behavior. Motivated by this, we have proposed N -SAGA, a modification of SAGA that can achieve consistent and at times significant speed-ups in the initial phase of the optimization. Most remarkably, this algorithm can be run in a streaming mode, which is – to our knowledge – the first of its kind within the family of variance reduced methods.

8

References [1] A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. arXiv preprint arXiv:1501.01062, 2015. [2] L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177–186. Springer, 2010. [3] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014. [4] D. Feldman, M. Faulkner, and A. Krause. Scalable training of mixture models via coresets. In Advances in Neural Information Processing Systems, pages 2142–2150, 2011. [5] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013. [6] J. Koneˇcn`y and P. Richt´arik. Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666, 2013. [7] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. [8] M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013. [9] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011. [10] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.

9