Towards stability and optimality in stochastic ... - Semantic Scholar

Report 3 Downloads 59 Views
Towards stability and optimality in stochastic gradient descent arXiv:1505.02417v2 [stat.ME] 20 Oct 2015

Panos Toulis∗ , Dustin Tran∗ , and Edoardo M. Airoldi∗ ∗

Department of Statistics, Harvard University, Cambridge, MA, 02138, USA

Abstract Iterative procedures for parameter estimation based on stochastic gradient descent allow the estimation to scale to massive data sets. However, in both theory and practice, they suffer from numerical instability. Moreover, they are statistically inefficient as estimators of the true parameter value. To address these two issues, we propose a new iterative procedure termed AISGD. For statistical efficiency, AISGD employs averaging of the iterates, which achieves the optimal Cramér-Rao bound under strong convexity, i.e., it is an optimal unbiased estimator of the true parameter value. For numerical stability, AISGD employs an implicit update at each iteration, which is related to proximal operators in optimization. In practice, AISGD achieves competitive performance with other state-of-the-art procedures. Furthermore, it is more stable than averaging procedures that do not employ proximal updates, and is simple to implement as it requires fewer tunable hyperparameters than procedures that do employ proximal updates.

1

Contents 1

Introduction 1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overview of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 4 5

2

Preliminaries

6

3

Theory 3.1 Computational efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Non-asymptotic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 8 9

4

Experiments 10 4.1 Statistical efficiency and stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Classification error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5

Conclusion

6

Appendix 6.1 Notation and assumptions . . 6.2 Proof of Lemma 2 . . . . . . 6.3 Proof of Theorem 3 . . . . . 6.3.1 Useful lemmas . . . 6.4 Proof of Theorem 5 . . . . . 6.5 Data sets used in experiments 6.6 Sensitivity analysis . . . . .

1

13

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

17 17 18 20 20 26 29 29

Introduction

The majority of problems in statistical estimation can be cast as finding the parameter value θ? ∈ Θ such that θ? = arg min E (L(θ, ξ)) , θ∈Θ

(1)

where the expectation is with respect to the random variable ξ ∈ Ξ ⊆ Rd that represents the data, Θ ⊆ Rp is the parameter space, and L : Θ×Ξ → R is a loss function. A popular procedure for solving Eq.(12) is stochastic gradient descent (SGD) (Zhang, 2004; Bottou, 2004), where a sequence θn approximates θ? , and is updated iteratively, one data point at a time, through the iteration θn = θn−1 − γn ∇L(θn−1 , ξn ),

(2)

where {ξ1 , ξ2 , . . .} is a stream of i.i.d. realizations of ξ, and {γn } is a non-increasing sequence of positive real numbers, known as the learning rate. The nth iterate θn in SGD (2) can be viewed as an estimator of θ? . To evaluate such iterative estimators it is typical to consider three properties: 2

convergence rate and numerical stability, by studying the mean-squared errors E (||θn − θ? ||2 ); and statistical efficiency, by studying the limit nVar (θn ), as n → ∞. While computationally efficient, the SGD procedure (2) suffers from numerical instability and statistical inefficiency. Regarding stability, SGD is sensitive to specification of the learning rate γn , since the mean-squared errors can diverge arbitrarily when γn is misspecified with the respect to problem parameters, e.g., the convexity and Lipschitz parameters of the loss function (Benveniste et al., 1990; Moulines and Bach, 2011). Regarding statistical efficiency, SGD loses statistical information. In fact, the amount of information loss depends on the misspecification of γn with respect to the spectral gap of the matrix E (∇2 L(θ? , ξ)) (Toulis et al., 2014), also known as the Fisher information matrix. Several solutions have been proposed to resolve these two issues, e.g., using projections and gradient clipping. However, they are usually heuristic and hard to generalize. In this paper, we aim for the ideal combination of computational efficiency, numerical stability, and statistical efficiency using the following procedure: θn = θn−1 − γn ∇L(θn , ξn ), n X ¯ θn = (1/n) θi . AI-SGD

(3) (4)

i=1

Our proposed procedure, termed AISGD, is comprised of two inner procedures. The first procedure employs updates given in Eq.(3), which are implicit because the iterate θn appears on both sides of the equation. Procedure (3), also known as implicit SGD (Toulis et al., 2014), aims to stabilize the updates of the classic SGD procedure (2). In fact, implicit SGD can be motivated as the limit of a sequence of improved classic SGD procedures, as follows. First, fix the sample history Fn−1 = {θ0s , ξ1 , ξ2 , . . . , ξn−1 }, where we use the superscript “s” in the classic SGD procedure in order to (1) (1) s s , ξn ) , θn . If we “trust” θn to be − γn ∇L(θn−1 distinguish from implicit SGD. Then, θns = θn−1 (1) s s in computing the loss function , then we can use θn instead of θn−1 a better estimate of θ? than θn−1 (1) (2) s s at data point ξn . This leads to a revised update θn = θn−1 − γn ∇L(θn , ξn ) , θn . Likewise, we (2) (1) can use θn instead of θn , and so on. If we repeat this argument ad infinitum, then we get the following sequence of improved SGD procedures, s θns = θn−1 − γn ∇L(θn−1 , ξn ), s θns = θn−1 − γn ∇L(θn(1) , ξn ), s − γn ∇L(θn(2) , ξn ), θns = θn−1 ... s θns = θn−1 − γn ∇L(θn(∞) , ξn ), (i)

(i−1)

(0)

(5)

s s where θn = θn−1 − ∇L(θn , ξn ), with initial condition θn = θn−1 . In the limit, assuming a unique fixed point is reached almost surely, the final procedure of sequence (5) satisfies θns = (∞) (∞) s s θn−1 − γn ∇L(θn , ξn ) = θn . This can be rewritten as θns = θn−1 − γn ∇L(θns , ξn ), which is equivalent to implicit SGD. Thus, implicit SGD can be viewed as a repeated application of classic s SGD, where we keep updating the same iterate θn−1 using the same data point ξn , until a fixed-point is reached.

3

The improvement in stability that is achieved by implicit SGD can be motivated through the following argument. Assume, for simplicity, that L is strongly convex, almost surely, with parameter µ > 0. Then, for the implicit SGD procedure (3), we have θn + γn ∇L(θn , ξn ) = θn−1 , ||θn − θ? ||2 + 2γn (θn − θ? )| ∇L(θn , ξn ) ≤ ||θn−1 − θ? ||2 , (1 + γn µ)||θn − θ? ||2 ≤ ||θn−1 − θ? ||2 , 1 ||θn − θ? ||2 ≤ ||θn−1 − θ? ||2 , 1 + γn µ which implies that ||θn − θ? ||2 is contracting almost surely. The classic SGD procedure does not share this contracting property even when L is strongly convex. While the implicit updates of Eq.(3) aim to achieve stability, the averaging of the iterates in Eq.(4) aims to achieve statistical optimality. Ruppert (1988) gave a nice intuition on why iterate averaging can lead to statistical optimality. When the learning rate is γn ∝ n−1 , then θ¯n − θ? is a weighted average of n error variables ∇L(θi−1 , ξi ), which therefore are significantly autocorrelated. However, when γn ∝ n−γ , γ ∈ (0, 1), then θ¯n − θ? is the average of nγ log n error variables, which become uncorrelated in the limit. Thus, averaging improves the estimation accuracy.

1.1

Related work

The implicit update (3) is equivalent to 

 1 2 θn = arg min ||θ − θn−1 || + L(θ, ξn ) . (6) θ∈Θ 2γn Arguably, the first method that used an update similar to (6) for estimation was the normalized least-mean squares filter of Nagumo and Noda (1967), used in signal processing. This update is also used by the incremental proximal method in optimization (Bertsekas, 2011), and has shown superior performance to classic SGD both in theory and applications (Bertsekas, 2011; Toulis et al., 2014; Défossez and Bach, 2015; Toulis and Airoldi, 2015). In particular, implicit updates lead to similar convergence rates as classic SGD updates, but are significantly more stable. This stability can also be motivated from a Bayesian interpretation of Eq.(6), where θn is the posterior mode of a model with the standard multivariate normal N (θn−1 , γn I) as the prior, L(θ, ·) as the log-likelihood, and ξn as the observation. A statistical analysis of procedure (3) without averaging was done by Toulis et al. (2014) who derived the asymptotic variance Var (θn ) of θn , and provided an algorithm to efficiently solve the fixed-point equation (3) for θn in the family of generalized linear models, which we generalize in this current work. In the online learning literature, Kivinen et al. (2006) and Kulis and Bartlett (2010) have also analyzed implicit updates; Schuurmans and Caelli (2007) have further applied implicit procedures on learning with kernels. Assuming that the expected loss ` is known, instead of update (6) we could use the update   1 2 + ||θ − θn−1 || + `(θ) . θn = arg min θ∈Θ 2γn 4

(7)

In optimization, this mapping from θn−1 to θn+ in Eq. (7) is known as a proximal operator, and is a special instance of the proximal point algorithm (Rockafellar, 1976). Thus, implicit SGD involves mappings that are stochastic versions of mappings from proximal operators. The stochastic proximal gradient algorithm (Singer and Duchi, 2009; Parikh and Boyd, 2013; Rosasco et al., 2014) is related but different to implicit SGD. In contrast to implicit SGD, the stochastic proximal gradient algorithm first makes a classic SGD update (forward step), and then an implicit update (backward step). Thus, only the forward step is stochastic, whereas the backward proximal step is not. This may increase convergence speed but may also introduce instability due to the forward step. Interest on proximal operators has surged in recent years because they are non-expansive and converge with minimal assumptions. Furthermore, they can be applied on non-smooth objectives, and can easily be combined in modular algorithms for optimization in large-scale and distributed settings (Parikh and Boyd, 2013). The idea has also been generalized through splitting algorithms (Lions and Mercier, 1979; Beck and Teboulle, 2009; Singer and Duchi, 2009; Duchi et al., 2011). Krakowski et al. (2007) and Nemirovski et al. (2009) have shown that proximal methods can fit better in the geometry of the parameter space Θ, and Toulis and Airoldi (2014) have made a connection to shrinkage methods in statistics. Two recent procedures based on stochastic proximal updates are PROXSVRG (Xiao and Zhang, 2014) and PROXSAG (Schmidt et al., 2013, Section 6). The main idea in both methods is to periodically compute an estimate of the full gradient averaged over all data points in order to reduce the variance of stochastic gradients. This requires a finite data setting, whereas AISGD also applies to streaming data. Moreover, the periodic calculations in PROXSVRG are controlled by additional hyperparameters, and the periodic calculations in PROXSAG require storage of the full gradient at every iteration. AISGD differs because it employs averaging to achieve statistical efficiency, has no additional hyperparameters or major storage requirements, and thus it has a simpler implementation. Averaging of the iterates in Eq.(4) is the other key component of AISGD. Averaging was proposed and analyzed in the stochastic approximation literature by Ruppert (1988) and Bather (1989). Polyak and Juditsky (1992) substantially expanded the scope of the averaging method by proving asymptotic optimality of the classic SGD procedure with averaging, under suitable assumptions. Their results showed clearly that slowly-convergent stochastic approximations (achieved when the learning rates are large) need to be averaged. Recent work has analyzed only classic SGD with averaging (Zhang, 2004; Xu, 2011; Shamir and Zhang, 2012; Bach and Moulines, 2013) and has shown their superiority in numerous learning tasks.

1.2

Overview of results

In this paper, we study the iterates θn and use the results to study θ¯n as an estimator of θ? . Under strong convexity of the expected loss, we derive upper bounds for the squared errors E (||θn − θ? ||2 )  and E θ¯n − θ? ||2 in Theorem 3 and Theorem 2, respectively. In the appendix, we also give bounds for E (||θn − θ? ||4 ). Two main results are derived from our theoretical analysis. First, θ¯n achieves the Cramér-Rao bound, i.e., no other unbiased estimator of θ? can do better in the limit, which is equivalent to the optimal 5

O(1/n) rate of convergence for first-order procedures. Second, AISGD is significantly more stable to misspecification of the learning rate relative to classic averaged SGD procedures, with respect to the learning problem parameters, e.g., convexity and Lipschitz constants. Finally, we perform experiments on several standard machine learning tasks, which show that AISGD comes closer to combining optimality, stability, and simplicity than other competing methods.

2

Preliminaries

Let Fn = {θ0 , ξ1 , ξ2 , . . . , ξn } denote the filtration that process θn (3) is adapted to. The norm def || · || will denote the L2 norm. The symbol , indicates a definition, and the symbol = denotes def “equal by definition”. For example, x , y defines x as equal to known variable y, whereas x = y denotes that x is equal to y by definition. We will not use this formalism when defining constants. For two positive sequences an , bn , we write bn = O(an ) if there exists a fixed c > 0 such that bn ≤ can , for all n; also, bn = o(an ) if bn /an → 0. When a positive scalar sequence an is monotonically decreasing to zero, we write an ↓ 0. Similarly, for a sequence Xn of vectors or matrices, Xn = O(an ) denotes that ||Xn || = O(an ), and Xn = o(an ) denotes that ||Xn || = o(an ). For two matrices A, B, A  B denotes that B − A is nonnegative-definite; tr(A) denotes the trace of A. We now introduce the main assumptions pertaining to the theory of this paper. Assumption 1. The loss function L(θ, ξ) is almost-surely differentiable. The random vector ξ can be decomposed as ξ = (x, y), x ∈ Rp , y ∈ Rd , such that L(θ, ξ) = L(x| θ, y).

(8)

Assumption 2. The learning rate sequence {γn } is defined as γn = γ1 n−γ , where γ1 > 0 and γ ∈ (1/2, 1]. Assumption 3 (Lipschitz conditions). For all θ1 , θ2 ∈ Θ, a combination of the following conditions is satisfied almost-surely: (a) The loss function L is Lipschitz-continuous with parameter λ0 , i.e., |L(θ1 , ξ) − L(θ2 , ξ)| ≤ λ0 ||θ1 − θ2 ||, (b) The map ∇L is Lipschitz-continuous with parameter λ1 , i.e., ||∇L(θ1 , ξ) − ∇L(θ2 , ξ)|| ≤ λ1 ||θ1 − θ2 ||, (c) The map ∇2 L is Lipschitz-continuous with parameter λ2 , i.e., ||∇2 L(θ1 , ξ) − ∇2 L(θ2 , ξ)|| ≤ λ2 ||θ1 − θ2 ||. ˆ Assumption 4. The observed Fisher information matrix, I(θ) , ∇2 L(θ, ξ), has non-vanishing ˆ trace, i.e., there exists φ > 0 such that ≥ φ, almost-surely, for all θ ∈ Θ. The expected  tr(I(θ))  ˆ Fisher information matrix, I(θ) , E I(θ) , has minimum eigenvalue 0 < λf ≤ φ, for all θ ∈ Θ. 6

Assumption 5. The zero-mean random variable Wθ , ∇L(θ, ξ) − ∇`(θ) is square-integrable, such that, for a fixed positive-definite Σ,  E Wθ? Wθ|?  Σ. Remarks. Assumption 6 puts a constraint on the loss function, but it is not very restrictive because the majority of machine learning models indeed depend on parameter θ through a linear combination with features. A notable exception includes loss functions with a regularization term. Although it is easy to add regularization to AISGD we will not do so in this paper because AISGD works well without it, since the proximal operator (6) already regularizes the estimate θn towards θn−1 . In experiments, regularization neither improved nor worsened AISGD (see appendix for more details). Assumption 7 on learning rates and Assumption 10 are standard in the literature of stochastic approximations, dating back to the original paper of Robbins and Monro (1951) in the one-dimensional parameter case. Assumptions on Lipschitz gradients (Assumption 8(b), Assumption 8(c)) can be relaxed; for example, Benveniste et al. (1990) relax this assumption using ||θ1 − θ2 ||q . However, these two Lipschitz conditions are commonly used in order to simplify the non-asymptotic analysis (Moulines and Bach, 2011). Assumption 8(a) is less standard in classic SGD literature but, so-far, it is standard in the limited literature on implicit SGD (Bertsekas, 2011). However, we can forgo this assumption and still maintain identical rates for the errors, although at the expense of a more complicated analysis. It is also an open problem whether a nice stability result similar to Theorem 3 can be derived under Assumption 8(b) instead of Assumption 8(a). We discuss this issue after the proof of Theorem 3 in the appendix. Assumption 9 makes two claims. The first claim on the observed Fisher information matrix is a relaxed form of strong convexity for the loss L(θ, ξ). However, in contrast to strong convexity, this claim allows several eigenvalues of ∇2 L to be zero. The second claim of Assumption 9 is equivalent to strong convexity of the expected loss `(θ). From a statistical perspective, strong convexity posits that there is information in the data for all elements of θ? . This assumption is necessary to derive bounds on the errors E (||θn − θ? ||2 ), and has been used to show optimality of classic SGD with averaging (Polyak and Juditsky, 1992; Ljung et al., 1992; Xu, 2011; Moulines and Bach, 2011). Overall, our assumptions are weaker than the assumptions in the limited literature on implicit SGD. For example, Bertsekas (2011, Assumptions 3.1, 3.2) assumes almost-sure bounded gradients ∇L(θ, ξ) in addition to Assumption 8(a); Ryu and Boyd (2014) assume strong convexity of L(θ, ξ), in expectation, which can simplify the analysis significantly. We discuss more details in the appendix after the proof of Theorem 3.

3

Theory

In this section we present our theoretical analysis of AISGD. All proofs are given in the appendix. The main technical challenge in analyzing implicit SGD (3) is that unlike typical analysis with classic SGD (2), the error ξn is not conditionally independent of θn . This implies that E (∇L(θn , ξn )| θn ) 6= 7

`(θn ), which makes it no longer possible to use the convexity properties of ` to analyze the errors E (||θn − θ? ||2 ), as it is common in the literature. As mentioned earlier, to circumvent this issue other authors have made strict almost-sure assumptions on the implicit procedure (3). (Bertsekas, 2011; Ryu and Boyd, 2014). In this paper, we rely on weaker conditions, namely the Lipschitz assumptions 8(a)-8(c), which are also used in nonimplicit procedures. Our proof strategy relies on a master lemma (Lemma 3 in appendix) for the analysis of recursions that appear to be typical in implicit procedures. This result is novel to our best knowledge, and it can be useful in future research on implicit procedures.

3.1

Computational efficiency

Our first result enables efficient computation of the implicit update (3). In general, this can be expensive due to solving a fixed-point equation in many dimensions, at every iteration. We reduce this multi-dimensional equation to an equation of only one dimension. Furthermore, under almostsure convexity of the loss function, efficient search bounds for the one-dimensional fixed-point equation are available. This result generalizes an earlier result in efficient computation of implicit updates on generalized linear models (Toulis et al., 2014, Algorithm 1). Definition 1. Suppose that Assumption 6 holds. For observation ξ = (x, y), the first derivative with respect to the natural parameter x| θ is denoted by L0 (θ, ξ), and is defined as L0 (θ, ξ) , Similarly, L00 (ξ, θ) ,

∂L(θ, ξ) def ∂L(x| θ, y) = . ∂(x| θ) ∂(x| θ)

(9)

∂L0 (θ,ξ) . ∂(x| θ)

Lemma 1. Suppose that Assumption 6 holds, and consider functions L0 , L00 from Definition 2. Then, almost-surely, ∇L(θn , ξn ) = sn ∇L(θn−1 , ξn );

(10)

the scalar sn satisfies the fixed-point equation, sn κn−1 = L0 (θn−1 − sn γn κn−1 xn , ξn ) ,

(11)

where κn−1 , L0 (θn−1 , ξn ). Moreover, if L00 (θ, ξ) ≥ 0 almost-surely for all θ ∈ Θ, then ( [κn−1 , 0) if κn−1 < 0, sn ∈ [0, κn−1 ] otherwise. Remarks. Lemma 2 has two parts. First, it shows that the implicit update can be performed by obtaining sn from the fixed-point Eq.(18), and then using ∇L(θn , ξn ) = sn ∇L(θn−1 , ξn ) in the implicit update (3). The fixed-point equation can be solved through a numerical root-finding procedure (Kivinen et al., 2006; Kulis and Bartlett, 2010; Toulis et al., 2014). Second, when the loss function is convex, then narrow search bounds for sn are available. This property holds, for example, when the loss function is the negative log-likelihood in the exponential family of models. 8

3.2

Non-asymptotic analysis

Our next result is on the mean-squared errors E (||θn − θ? ||2 ). These errors show the stability and convergence rates of implicit SGD and are used in combination with bounds on errors E (||θn − θ? ||4 )  to derive bounds on the errors E ||θ¯n − θ? ||2 of the averaged procedure.1 Theorem 1. Suppose that Assumptions 6, 7, 8(a), and 9 hold. Define δn , E (||θn − θ? ||2 ), and P constants Γ2 = 4λ20 γi2 < ∞,  = (1 + γ1 (φ − λf ))−1 , and λ = 1 + γ1 λf . Then, there exists constant n0 > 0 such that, for all n > 0, 1−γ

δn ≤(8λ20 γ1 λ/λf )n−γ + e− log λ·n

[δ0 + λn0 Γ2 ].

Remarks. According to Theorem 3, the convergence rate of the implicit iterates θn is O(n−γ ). This matches earlier results on rates of classic SGD (Benveniste et al., 1990; Moulines and Bach, 2011). The most important difference, however, is that the implicit procedure discounts the initial conditions δ0 at an exponential rate, regardless of the specification of the learning rate. As shown by Moulines and Bach (2011, Theorem 1), in classic SGD there exists a term exp(λ21 γ12 n1−2γ ) in front of the initial conditions, which can be catastrophic if the learning rate parameter γ1 is misspecified. In contrast, the implicit iterates are unconditionally stable, i.e., any specification of the learning rate will lead to a stable discounting of the initial conditions. Theorem 2. Consider the AISGD procedure (4), and suppose that Assumptions 7, 8(a), 8(c), 9, and 10 hold. Then,  1/2 1 (E ||θ¯n − θ? ||2 )1/2 ≤ √ tr(∇2 `(θ? )−1 Σ∇2 `(θ? )−1 ) n + O(n−1+γ/2 ) + O(n−γ ) + O(exp(− log λ · n1−γ /2). Remarks. The full version of Theorem 2, which includes all constants, is given in the appendix. Even in its shortened form, Theorem 2 delivers three main results. First, the iterates θ¯n attain the Cramér-Rao lower bound, i.e., any other unbiased estimator of θ? cannot have lower MSE than θ¯n . From an optimization perspective, θ¯n attains the rate O(1/n), which is optimal for first-order methods (Nesterov, 2004). This result matches the asymptotic optimality of averaged iterates from classic SGD procedures, which has been proven by Polyak and Juditsky (1992). Second, the remaining rates are O(n−2+γ ) and O(n−2γ ). This implies the optimal choice γ = 2/3 for the exponent of the learning rate. It extends the results of Ruppert (1988), and more recently by Xu (2011), and Moulines and Bach (2011), on optimal exponents for classic SGD procedures. Third, as with non-averaged implicit iterates in Theorem 3, the averaged iterates θ¯n have a decay of the initial conditions regardless of the specification of the learning rate parameter. This stability property is inherited from the underlying implicit SGD procedure (3) that is being averaged. In contrast, averaged iterates of classic SGD procedures can diverge numerically because arbitrarily large terms can appear in front of initial conditions (Moulines and Bach, 2011, Theorem 3).  The bounds for the fourth moments E ||θn − θ? ||4 are given in the appendix because they rely on the same  intermediate results as E ||θn − θ? ||2 . 1

9

4

Experiments

In this section, we show that AISGD achieves comparable, and sometimes superior, results to other methods while combining statistical efficiency, stability, and simplicity. In our experiments, we compare our procedure to the following procedures: • SGD: Classic stochastic gradient descent in its standard formulation (Sakrison, 1965; Zhang, 2004), which employs the update θn = θn−1 − γn ∇L(θn−1 , ξn ). • ISGD: Stochastic gradient descent procedure introduced in Toulis et al. (2014) which employs implicit update (3) without averaging. It is robust to misspecification of the learning rate but also exhibits slower convergence in practice relative to classic SGD. • ASGD: Averaged stochastic gradient descent procedure with standard updates of the iterates (Xu, 2011; Shamir and Zhang, 2012; Bach and Moulines, 2013). This is equivalent to AISGD where the update (3) is replaced by an explicit step θn = θn−1 − γn ∇L(θn−1 , ξn ). • PROXSVRG: A proximal version of the stochastic gradient descent procedure with progressive variance reduction (SVRG) (Xiao and Zhang, 2014). • PROXSAG: A proximal version of the stochastic average gradient (SAG) procedure (Schmidt et al., 2013). While its theory has not been formally established, PROXSAG has shown similar convergence properties to PROXSVRG. • ADAGRAD: A stochastic gradient descent procedure with a form of diagonal scaling to adapt the learning rate (Duchi et al., 2011). Note that PROXSVRG and PROXSAG are applicable only to fixed data sets and not to the streaming setting. Therefore the theoretical linear convergence rate of these methods refers to convergence to an empirical minimizer (e.g., maximum likelihood or maximum a-posteriori if there is regularization), and not to the ground truth θ? . On the other hand, AISGD can be applied to both data settings. We also note that ADAGRAD, and similar adaptive schedules, (Tieleman and Hinton, 2012; Kingma and Ba, 2015) effectively approximate the natural gradient I(θ)−1 ∇L(θ, ξ) by using a multi-dimensional learning rate. These learning rates have the advantage of being less sensitive than one-dimensional rates to tuning of hyperparameters, and can effectively be combined in practice with AISGD.

4.1

Statistical efficiency and stability

We first demonstrate the theoretical results on the stability and statistical optimality of AISGD. To do so, we follow a simple normal linear regression example from Bach and Moulines (2013). Let N = 106 be the number of observations, and p = 20 be the number of features. Let θ? = (0, 0, . . . , 0)| be the ground truth. The random variable ξ is decomposed as ξn = (xn , yn ), where the feature vectors x1 , . . . , xN ∼ Np (0, H) are i.i.d. normal random variables, and H is a randomly generated symmetric matrix with eigenvalues 1/k, for k = 1, . . . , p. The outcome yn is sampled from a normal distribution as yn | xn ∼ N (x|n θ∗ , 1), for n = 1, . . . , N . Our loss function is 10

defined as the squared residual, i.e., L(θ, ξn ) = (yn − x|n θ)2 , and thus `(θ) = E (L(θ, ξ)) = (θ − θ? )| H(θ − θ? ).

−2 −3 −4

Log excess risk

−1

0

We choose a constant learning rate γn ≡ γ according to the average radius of the data R2 = trace(H), and for both ASGD and AISGD we collect iterates θn , n = 1, . . . , N , and keep the average θ¯n . In Figure 1, we plot `(θ¯n ) for each iteration for a maximum of N iterations in log-log space.

−5

AI−SGD, 2 R2 AI−SGD, 1 R2 ASGD, 2 R2 ASGD, 1 R2 Implicit−SGD, 2 R2 Implicit−SGD, 1 R2 0

1

2

3

4

5

6

log N

Figure 1: Loss of AISGD, ASGD, and ISGD, on simulated multivariate normal data with N = 106 observations, d = 20 features. The plot shows that AISGD achieves stability regardless of the specification of the learning rate γn ≡ γ. In contrast, ASGD diverges when the learning rate is only slightly misspecified (e.g., solid, blue line). 11

Figure 1 shows that AISGD performs on par with ASGD for the rates at which ASGD is known to be optimal. However, the benefit of the implicit procedure (3) in AISGD becomes clear as the learning rate increases. Notably, AISGD remains stable for learning rates that are above the theoretical threshold, i.e., when γ > 1/R2 , whereas ASGD diverges above that threshold, e.g., when γ = 2/R2 . This stable behavior is also exhibited in ISGD, but ISGD converges at a slower rate than AISGD, and thus does not combine stability with statistical efficiency. This behavior is also reflected for AISGD when using decaying learning rates, e.g., γn ∝ 1/n.

4.2

Classification error

We now conduct a study of AISGD’s empirical performance on standard benchmarks of large-scale linear classification. For brevity, we display results on four data sets, although we have seen similar results on eight additional ones (see the appendix for more details). Table 2 displays a summary of the data sets. The COVTYPE data set (Blackard, 1998) consists of forest cover types in which the task is to classify class 2 among 7 forest cover types. DELTA is synthetic data offered in the PASCAL Large Scale Challenge (Sonnenburg et al., 2008) and we apply the default processing offered by the challenge organizers. The task in RCV1 is to classify documents belonging to class CCAT in the text dataset (Lewis et al., 2004), where we apply the standard preprocessing provided by Bottou (2012). In the MNIST data set (Le Cun et al., 1998) of images of handwritten digits, the task is to classify digit 9 against all others. For AISGD and ASGD, we use the learning rate γn = η0 (1 + η0 n)−3/4 prescribed in Xu (2011), where the constant η0 is determined through preprocessing on a small subset of the data. Hyperparameters for other methods are set based on a computationally intensive grid search over the entire hyperparameter space: this includes step sizes for PROXSAG, PROXSVRG, and ADAGRAD, and the inner iteration count for PROXSVRG. For all methods we use L2 regularization with parameter λ which varies for each data set,and which is also used in Xu (2011).

covtype delta rcv1 mnist

description forest cover type synthetic data text data digit image features

type sparse dense sparse dense

features 54 500 47,152 784

training set 464,809 450,000 781,265 60,000

test set 116,203 50,000 23,149 10,000

λ 10−6 10−2 10−5 10−3

Table 1: Summary of data sets and the L2 regularization parameter, following the settings in Xu (2011). The results are shown in Figure 2. We we see that AISGD achieves comparable performance with the tuned proximal methods PROXSVRG and PROXSAG, as well as ADAGRAD. All methods have a comparable convergence rate and take roughly a single pass in order to converge. Interestingly, ADAGRAD exhibits a larger variance in its estimate than the proximal methods. This comes from the less known fact that the learning rate in ADAGRAD is a suboptimal approximation of the Fisher information, and hence it is statistically inefficient.

12

covtype test error

delta test error

0.30

0.30 AdaGrad ai−SGD ASGD Prox−SAG Prox−SVRG SGD

0.28

AdaGrad ai−SGD ASGD Prox−SAG Prox−SVRG SGD

0.28 0.26

0.26 0.24 0.24

0.22

0.22

0.20 0.2

0.4

0.6

0.8

1.0

0.2

Number of passes

0.4

0.6

0.8

1.0

Number of passes

rcv1 test error

mnist test error

0.070

0.15 AdaGrad ai−SGD ASGD Prox−SAG Prox−SVRG SGD

0.065

AdaGrad ai−SGD ASGD Prox−SAG Prox−SVRG SGD

0.14 0.13

0.060 0.12 0.055

0.11

0.050

0.10 0.2

0.4

0.6

0.8

1.0

0.2

Number of passes

0.4

0.6

0.8

1.0

Number of passes

Figure 2: Large scale linear classification with log loss on four data sets. Each plot indicates the test error of various stochastic gradient methods over a single pass of the data.

5

Conclusion

We propose a statistical learning procedure, termed AISGD, and investigate its theoretical and empirical properties. AISGD combines simple stochastic proximal steps, also known as implicit updates, with iterate averaging and larger step-sizes. The proximal steps allow AISGD to be significantly more stable compared to classic SGD procedures, with or without averaging of the iterates; this stability comes at virtually no computational cost for a large family of machine learning models. Furthermore, the averaging of the iterates lead AISGD to be statistically optimal, i.e., the variance of the iterate θ¯n of AISGD achieves the minimum Cramér-Rao lower bound, under strong convexity. Last but not least, AISGD is as simple to implement as classic SGD. In comparison, other stochastic proximal procedures, such as PROXSVRG or PROXSAG, require tuning of hyperparameters that control periodic calculations over the entire dataset, and possibly storage of the full gradient.

13

References Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems, pages 773– 781. Bather, J. (1989). Stochastic approximation: A generalisation of the Robbins-Monro procedure, volume 89. Mathematical Sciences Institute, Cornell University. Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202. Benveniste, A., Priouret, P., and Métivier, M. (1990). Adaptive algorithms and stochastic approximations. Bertsekas, D. P. (2011). Incremental proximal methods for large scale convex optimization. Mathematical programming, 129(2):163–195. Blackard, J. (1998). Comparison of neural networks and discriminant analysis in predicting forest cover types. PhD thesis, Department of Forest Sciences, Colorado State University. Bottou, L. (2004). Stochastic learning. In Advanced lectures on machine learning, pages 146–168. Springer. Bottou, L. (2012). Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, volume 1, pages 421–436. Défossez, A. and Bach, F. (2015). Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 205–213. Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159. Kingma, D. and Ba, J. (2015). Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations. Kivinen, J., Warmuth, M. K., and Hassibi, B. (2006). The p-norm generalization of the lms algorithm for adaptive filtering. Signal Processing, IEEE Transactions on, 54(5):1782–1793. Krakowski, K. A., Mahony, R. E., Williamson, R. C., and Warmuth, M. K. (2007). A geometric view of non-linear on-line stochastic gradient descent. Author website. Kulis, B. and Bartlett, P. L. (2010). Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 575–582. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of IEEE, 86(11):2278–2324. Lewis, D., Yang, Y., Rose, T., and Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5:361–397.

14

Lions, P.-L. and Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964–979. Ljung, L., Pflug, G. C., and Walk, H. (1992). Stochastic approximation and optimization of random systems, volume 17. Springer. Moulines, E. and Bach, F. R. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459. Nagumo, J.-I. and Noda, A. (1967). A learning method for system identification. Automatic Control, IEEE Transactions on, 12(3):282–287. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609. Nesterov, Y. (2004). Introductory lectures on convex optimization, volume 87. Springer Science & Business Media. Parikh, N. and Boyd, S. (2013). Proximal algorithms. Foundations and Trends in optimization, 1(3):123–231. Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855. Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407. Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898. Rosasco, L., Villa, S., and V˜u, B. C. (2014). Convergence of stochastic proximal gradient algorithm. arXiv preprint arXiv:1403.5074. Ruppert, D. (1988). Efficient estimators from a slowly convergent robbins-monro process. Technical report, School of Operations Research and Industrial Engineering, Cornell University. Ryu, E. K. and Boyd, S. (2014). Stochastic proximal iteration: A non-asymptotic improvement upon stochastic gradient descent. Author website, early draft. Sakrison, D. J. (1965). Efficient recursive estimation; application to estimating the parameters of a covariance function. International Journal of Engineering Science, 3(4):461–483. Schmidt, M., Le Roux, N., and Bach, F. (2013). Minimizing finite sums with the stochastic average gradient. Technical report, HAL 00860051. Schuurmans, L. C. S. and Caelli, S. (2007). Implicit online learning with kernels. Advances in neural information processing systems, 19:249. Shamir, O. and Zhang, T. (2012). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. arXiv preprint arXiv:1212.1824.

15

Singer, Y. and Duchi, J. C. (2009). Efficient learning using forward-backward splitting. In Advances in Neural Information Processing Systems, pages 495–503. Sonnenburg, S., Franc, V., Yom-Tov, E., and Sebag, M. (2008). Pascal large scale learning challenge. Tieleman, T. and Hinton, G. (2012). Lecture 6.5—RmsProp: Divide the Gradient by a Running Average of its Recent Magnitude. COURSERA: Neural Networks for Machine Learning. Toulis, P. and Airoldi, E. M. (2014). arXiv:1408.2923.

Implicit stochastic gradient descent.

arXiv preprint

Toulis, P. and Airoldi, E. M. (2015). arXiv:1510.00967.

Implicit stochastic approximation.

arXiv preprint

Toulis, P., Rennie, J., and Airoldi, E. (2014). Statistical analysis of stochastic gradient methods for generalized linear models. In 31st International Conference on Machine Learning. Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24:2057–2075. Xu, W. (2011). Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490. Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116. ACM.

16

6

Appendix

Consider a random variable ξ ∈ Ξ, a parameter space Θ that is convex and compact, and a loss function L : Θ × Ξ → R. We wish to solve the following stochastic optimization problem: θ? = arg min E (L(θ, ξ)) , θ∈Θ

(12)

where the expectation is with respect to ξ. Define the expected loss, `(θ) = E (L(θ, ξ)) ,

(13)

where L is differentiable almost-surely. In this work we study a stochastic approximation procedure to solve (12) defined through the iterations θn = θn−1 − γn ∇L(θn , ξn ), θ0 ∈ Θ, n 1X ¯ θn = θi , n i=1

(14) (15)

where {ξ1 , ξ2 , . . .} are i.i.d. realizations of ξ, and ∇L(θ, ξn ) is the gradient of the loss function with respect to θ given realized value ξn . The sequence {γn } is a non-increasing sequence of positive real numbers. We will refer to procedure defined by (14) and (15) as averaged implicit stochastic gradient descent, or AISGD for short. Procedure AISGD combines two ideas, namely an implicit update in Eq. (14) as θn appears on both sides of the update, and averaging of the iterates θn in Eq. (15).

6.1

Notation and assumptions

Let Fn = {θ0 , ξ1 , ξ2 , . . . , ξn } denote the filtration that process θn (14) is adapted to. The norm def || · || will denote the L2 norm. The symbol , indicates a definition, and the symbol = denotes def “equal by definition”. For example, x , y defines x as equal to known variable y, whereas x = y denotes that x is equal to y by definition. We will not use this formalism when defining constants. For two positive sequences an , bn , we write bn = O(an ) if there exists a fixed c > 0 such that bn ≤ can , for all n; also, bn = o(an ) if bn /an → 0. When a positive scalar sequence an is monotonically decreasing to zero, we write an ↓ 0. Similarly, for a sequence Xn of vectors or matrices, Xn = O(an ) denotes that ||Xn || = O(an ), and Xn = o(an ) denotes that ||Xn || = o(an ). For two matrices A, B, A  B denotes that B − A is nonnegative-definite; tr(A) denotes the trace of A. We now introduce the main assumptions pertaining to the theory of this paper. Assumption 6. The loss function L(θ, ξ) is almost-surely differentiable. The random vector ξ can be decomposed as ξ = (x, y), x ∈ Rp , y ∈ Rd , such that L(θ, ξ) = L(x| θ, y).

17

(16)

Assumption 7. The learning rate sequence {γn } is defined as γn = γ1 n−γ , where γ1 > 0 and γ ∈ (1/2, 1]. Assumption 8 (Lipschitz conditions). For all θ1 , θ2 ∈ Θ, a combination of the following conditions is satisfied almost-surely: (a) The loss function L is Lipschitz-continuous with parameter λ0 , i.e., |L(θ1 , ξ) − L(θ2 , ξ)| ≤ λ0 ||θ1 − θ2 ||, (b) The map ∇L is Lipschitz-continuous with parameter λ1 , i.e., ||∇L(θ1 , ξ) − ∇L(θ2 , ξ)|| ≤ λ1 ||θ1 − θ2 ||, (c) The map ∇2 L is Lipschitz-continuous with parameter λ2 , i.e., ||∇2 L(θ1 , ξ) − ∇2 L(θ2 , ξ)|| ≤ λ2 ||θ1 − θ2 ||. ˆ Assumption 9. The observed Fisher information matrix, I(θ) , ∇2 L(θ, ξ), has non-vanishing ˆ trace, i.e., there exists φ > 0 such that ≥ φ, almost-surely, for all θ ∈ Θ. The expected  tr(I(θ))  ˆ Fisher information matrix, I(θ) , E I(θ) , has minimum eigenvalue 0 < λf ≤ φ, for all θ ∈ Θ. Assumption 10. The zero-mean random variable Wθ , ∇L(θ, ξ) − ∇`(θ) is square-integrable, such that, for a fixed positive-definite Σ,  E Wθ? Wθ|?  Σ.

6.2

Proof of Lemma 2

Definition 2. Suppose that Assumption 6 holds. For observation ξ = (x, y), the first derivative with respect to the natural parameter x| θ is denoted by L0 (θ, ξ), and is defined as L0 (θ, ξ) , Similarly, L00 (ξ, θ) ,

∂L(θ, ξ) def ∂L(x| θ, y) = . ∂(x| θ) ∂(x| θ)

(17)

∂L0 (θ,ξ) . ∂(x| θ)

Lemma 2. Suppose that Assumption 6 holds, and consider functions L0 , L00 from Definition 2. Then, almost-surely, ∇L(θn , ξn ) = sn ∇L(θn−1 , ξn );

(18)

the scalar sn satisfies the fixed-point equation, sn κn−1 = L0 (θn−1 − sn γn κn−1 xn , ξn ) , where κn−1 , L0 (θn−1 , ξn ). Moreover, if L00 (θ, ξ) ≥ 0 almost-surely for all θ ∈ Θ, then ( [κn−1 , 0) if κn−1 < 0, sn ∈ [0, κn−1 ] otherwise. 18

(19)

Proof. Consider ξn = (xn , yn ) according to Assumption 6. It holds, ∇L(θ, ξn ) = ∇L(x|n θ, yn ) [by Assumption 6] = L0 (θ, ξn )xn . [by chain rule and Definition 2] def

(20)

Therefore, the gradient at each point θ has a direction that only depends on xn , which is parameterfree. Thus, ∇L(θn , ξn ) = sn ∇L(θn−1 , ξn ),

(21)

for some scalar sn since the two gradients are colinear regardless of the parameter θ. It follows that θn = θn−1 − ∇L(θn , ξn ) [by definition of implicit SGD (14)] = θn−1 − γn sn ∇L(θn−1 , ξn ). [by Eq.(21)]

(22)

Starting from Eq.(21), we now proceed as follows, ∇L(θn , ξn ) = sn ∇L(θn−1 , ξn ) L0 (θn , ξn )xn = sn L0 (θn−1 , ξn )xn [by Eq.(20)] L0 (θn , ξn ) = sn L0 (θn−1 , ξn ) L0 (θn−1 − γn sn ∇L(θn−1 , ξn ), ξn ) = sn L0 (θn−1 , ξn ) [by Eq.(22)] L0 (θn−1 − γn sn κn−1 xn , ξn ) = sn κn−1 [using κn−1 , L0 (θn−1 , ξn ) and Eq.(20)]

(23)

Eq.(23) is equivalent to Eq.(19). We now prove the second claim of the lemma regarding the search bounds for sn . As before, we fix ξn = (xn , yn ). Because L(θ, ξn ) = L(x|n θ, yn ), the derivative L0 (θ, ξn ) in Eq.(17) is only a function of x|n θ. For notational convenience, we can therefore define g(u) , −L0 (θ, ξn ), where u = x|n θ ∈ R. Also let u0 = x|n θn−1 , and c , ||xn ||2 ≥ 0, and u? = γn sn g(u0 ), then the fixed-point equation (19) can be written as u? = γn g(u0 + u? c).

(24)

where g is nonincreasing because L00 ≥ 0, by the lemma assumption. Case 1. If g(u0 ) = 0, then u? = 0. Case 2. If g(u0 ) > 0, then u? > 0 because g is nonincreasing. Also γn g(u0 + uc) ≤ γn g(u0 ) for all u > 0, because g(u0 + uc) is nonincreasing as c > 0; taking u = u? in the previous inequality def yields γn g(u0 ) ≥ γn g(u0 + u? c) = u? , by the fixed-point equation (24). Thus, 0 < u? ≤ γn g(u0 ). Case 3. Similarly, if g(u0 ) < 0, then u? < 0 and γn g(u0 + uc) ≥ γn g(u0 ) for all u < 0, since def g(u0 +uc) is nonincreasing; taking u = u? yields γn g(u0 ) ≤ γn g(u0 +u? c) = u? , by the fixed-point equation. Thus, γn g(u0 ) ≤ u? < 0. A visual proof is given Figure 3. 19

Figure 3: Search bounds for solution of Eq. (24). Case g(u0 ) > 0: Corresponds to curve (a) defined as γn g(u0 + uc), c > 0. The solution u? of fixed point equation (24) (corresponding to right triangle) is between 0 and γn g(u0 ) since curve (a) is nonincreasing. Case g(u0 ) < 0: Corresponds to curve (b) also defined as γn g(u0 + uc). The solution u? of fixed point equation (24) (left triangle) is between γn g(u0 ) and 0 since curve (b) is also nonincreasing.

6.3 6.3.1

Proof of Theorem 3 Useful lemmas

In this section, we will prove certain lemmas on recursions that will be useful for the non-asymptotic analysis of the implicit procedures. P Lemma 3. Consider a sequence bn such that bn ↓ 0 and ∞ i=1 bi = ∞. Then, there exists a positive constant K > 0, such that n Y i=1

n

X 1 ≤ exp(−K bi ). 1 + bi i=1

(25)

Proof. The function x log(1 + 1/x) is increasing-concave in (0, ∞). Since bn ↓ 0 we can set x = 1/bn , implying that log(1 + bn )/bn is increasing. Let K = log(1 + b1 )/b1 , then log(1 + bn )/bn ≥ K which implies that (1 + bn )−1 ≤ exp(−Kbn ). Successive applications of this inequality yield Ineq. (25). Lemma 4. Consider scalar sequences an ↓ 0, bn ↓ 0, and cn ↓ 0 such that, an = o(bn ), and

20

A,

P∞

i=1

ai < ∞. Suppose there exists n0 such that cn /bn < 1 for all n > n0 . Define, δn ,

cn an−1 1 (an−1 /bn−1 − an /bn ) and ζn , , an bn−1 an

(26)

and suppose that δn ↓ 0 and ζn ↓ 0. Fix n0 > 0 such that δn + ζn < 1 and (1 + cn )/(1 + bn ) < 1, for all n ≥ n0 . Consider a positive sequence yn > 0 that satisfies the recursive inequality, 1 + cn yn−1 + an . (27) yn ≤ 1 + bn Then, for every n > 0, an (28) yn ≤ K0 + Qn1 y0 + Qnn0 +1 (1 + c1 )n0 A, bn Q where K0 = (1 + b1 ) (1 − δn0 − ζn0 )−1 , and Qni = nj=i (1 + ci )/(1 + bi ), such that Qni = 1 if n < i, by definition. Proof. We consider the cases, n < n0 and n ≥ n0 , and then we will combine the respective bounds. Analysis for n < n0 . We first find a crude bound for Qni+1 . It holds, Qni+1 ≤ (1 + ci+1 )(1 + ci+2 ) · · · (1 + cn ) ≤ (1 + c1 )n0 ,

(29)

since c1 ≥ cn (cn ↓ 0 by definition) and there are no more than n0 terms in the product (n < n0 ). From Ineq. (27) we get yn =

Qn1 y0

+

n X

Qni+1 ai

[by expanding recursive Ineq. (27)]

i=1



Qn1 y0

n0

+ (1 + c1 )

n X

ai

[using Ineq. (29)]

i=1

≤ Qn1 y0 + (1 + c1 )n0 A.

(30)

This inequality holds also for n = n0 ; i.e., yn0 ≤ Qn1 0 y0 + (1 + c1 )n0 A.

(31)

Analysis for n ≥ n0 . In this case, we have for all n ≥ n0 ,

1 an−1 K0 ( an bn−1

(1 + b1 ) (1 − δn − ζn )−1 K0 (δn + ζn ) + 1 + b1 K0 (δn + ζn ) + 1 + bn an 1 cn an−1 − ) + K0 + 1 + bn bn an bn−1

≤ K0 ≤ K0 ≤ K0

[by definition of n0 , K0 , and δn + ζn ↓ 0]

≤ K0

[by definition of δn , ζn ]

[because bn ≤ b1 , since bn ↓ 0]



(1 + cn )an−1 an an (1 + bn ) ≤ K0 an − K0 − bn−1 bn an 1 + cn an−1 an ≤ K 0 ( − ). bn 1 + bn bn−1 21



(32)

Now we combine Ineqs. (32) and (27) to obtain (yn − K0

1 + cn an−1 an )≤ (yn−1 − K0 ). bn 1 + bn bn−1

(33)

n n For brevity, define zn , yn − K0 an /bn . Then, from Ineq. (33), zn ≤ 1+c z , where 1+c 0, an yn ≤ K0 + Qn1 y0 + Qnn0 +1 (1 + c1 )n0 A, bn

(35)

since Qni = 1 for n < i, by definition. Corollary 1. In Lemma 4 assume an = a1 n−α and bn = b1 n−β , and cn = 0, where a1 , b1 , β > 0 and max{β, 1} < α < 1 + β. Then, a1 (1 + b1 ) −α+β n + exp(− log(1 + b1 )n1−β )[y0 + (1 + b1 )n0 A], b1 P where n0 > 0 and A = i ai < ∞. yn ≤ 2

(36)

Proof. In this proof, we will assume, P for simplicity, (n − 1)−c − n−c ≤ n−1−c , c ∈ (0, 1), for every n > 0. Furthermore, we assume ni=1 i−γ ≥ n1−γ , for every n > 0. Formally, this holds for n ≥ n0 , where n0 in practice is very small (e.g., n0 = 14 if γ = 0.1, n0 = 5 if γ = 0.5, and n0 = 9 if γ = 0.9, etc.) It is straightforward to derive appropriate constants to make the inequalities tight for every n > 0. By definition, def

δn =

1 a1 1 an−1 an ( − )= ((n − 1)−α+β − n−α+β ) an bn−1 bn a1 n−α b1 1 = −α [(n − 1)−α+β − n−α+β ] n b1 1 ≤ n−1+β . b1

(37)

Also, ζn = 0 since cn = 0. We can take n0 = d(2/b1 )1/(1−β) e, for which δn0 ≤ b11 n−1+β ≤ 1/2. 0 Qn def −1 n −1 Therefore, K0 = (1 + b1 )(1 − δn0 ) ≤ 2(1 + b1 ). Since cn = 0, Qi = j=i (1 + bi ) . Thus, Qn1 ≥ (1 + b1 )−n , 22

(38)

and also Qn1

≤ exp(− log(1 + b1 )/b1

n X

bi ),

[by Lemma 3.]

i=1

Qn1 ≤ exp(− log(1 + b1 )n1−β ).

[because

n X

i−β ≥ n1−β .]

(39)

i=1

Lemma 4 and Ineqs. (38) and (39) imply an + Qn1 y0 + Qnn0 +1 (1 + c1 )n0 A [by Lemma 4 ] bn a1 (1 + b1 ) −α+β 1 ≤2 n + Qn1 [y0 + n0 A] [by cn = 0, and Qn1 0 Qnn0 +1 = Qn1 ] b1 Q1 a1 (1 + b1 ) −α+β ≤2 n + Qn1 [y0 + (1 + b1 )n0 A] [by Ineq. (38)] b1 a1 (1 + b1 ) −α+β ≤2 n + exp(− log(1 + b1 )n1−β )[y0 + (1 + b1 )n0 A]. [by Ineq.(39)] b1

y n ≤ K0

(40)

Lemma 5. Suppose Assumptions 6, 8(a), and 9 hold. Then, almost surely, 1 , 1 + γn φ ||θn − θn−1 ||2 ≤ 4λ20 γn2 , sn ≥

(41) (42)

where sn is defined in Lemma 2, and θn is the nth iterate of implicit SGD (14). Proof. For the first part, from Lemma 2 and Assumption 6, the random variable sn satisfies (see Eq. 21 for derivation) L(x|n θn , yn ) = sn L(x|n θn−1 , yn ).

(43)

θn = θn−1 − γn sn L0 (x|n θn−1 , yn )xn ,

(44)

Using definitions (14) and (17),

where we used the fact that L0 is a function of θ only through x|n θ (see also proof of Lemma 2). We use this definition of θn into Eq. (43) and perform a Taylor approximation on L0 to obtain ˜ 00 γn sn L0 (x| θn−1 , yn )||xn ||2 , L0 (x|n θn , yn ) = L0 (x|n θn−1 , yn ) − L n

(45)

˜ yn ), and θ˜ = δθn−1 + (1 − δ)θn , and δ ∈ [0, 1]. Starting from Eq. (43), ˜ 00 , L00 (x|n θ, where L L(x|n θn , yn ) = sn L(x|n θn−1 , yn ) [Eq. (43)] ˜ 00 ||xn ||2 )sn = 1 [by Eq. (45)] (1 + γn L   ˜ sn = 1 [by Assumption 6, I(θ) ˆ θ)) ˆ 1 + γn trace(I( = L00 (θ, ξ)xn x|n ] (1 + γn φ)sn ≥ 1.

[by Assumption 9]

23

(46)

For the second part of this lemma, since the log-likelihood is differentiable (Assumption 6) we can re-write the definition of implicit SGD (14) as θn = arg min{

1 ||θ − θn−1 ||2 + L(x|n θ, yn )}. 2γn

Therefore, comparing θ = θn−1 and θ = θn using the above equation, we obtain 1 ||θn − θn−1 ||2 + L(x|n θn , yn ) ≤ L(x|n θn−1 , yn ) 2γn ||θn − θn−1 ||2 ≤ 2γn (L(x|n θn−1 , yn ) − L(x|n θn , yn )) ||θn − θn−1 ||2 ≤ 2γn λ0 ||θn − θn−1 ||. [by Assumption 8(a)] ||θn − θn−1 || ≤ 2γn λ0 ||θn − θn−1 ||2 ≤ 4λ20 γn2 .

(47)

Theorem 3. Suppose Assumptions 6, 7, 8(a), and 9 hold. Define δn , E (||θn − θ? ||2 ), and Pthat 2 2 2 constants Γ = 4λ0 γi < ∞,  = (1 + γ1 (φ − λf ))−1 , and λ = 1 + γ1 λf . Then, there exists constant n0 > 0 such that, for all n > 0, 1−γ

δn ≤(8λ20 γ1 λ/λf )n−γ + e− log λ·n

[δ0 + λn0 Γ2 ].

Proof. Proof. Starting from the definition of the implicit procedure (14) we have θn − θ? =θn−1 − θ? − γn ∇L(θn , ξn ) θn − θ? =θn−1 − θ? − γn sn ∇L(θn−1 , ξn ) [By Lemma 2] ||θn − θ? ||2 =||θn−1 − θ? ||2 − 2γn sn (θn−1 − θ? )| ∇L(θn−1 , ξn ) + γn2 ||∇L(θn , ξn )||2 .

(48)

The last term can be bounded since −γn ∇L(θn , ξn ) = θn − θn−1 by definition; thus, γn2 ||∇L(θn , ξn )||2 = ||θn − θn−1 ||2 ≤ 4λ20 γn2 ,

(49)

which holds almost-surely by Lemma 5-Ineq.(42). For the second-term we can bound its expectation as follows, E(2γn sn (θn−1 − θ? )| ∇L(θn−1 , ξn )) 2γn E ((θn−1 − θ? )| ∇L(θn−1 , ξn )) [by Lemma 5-Ineq. (41)] ≥ 1 + γn φ 2γn ≥ E ((θn−1 − θ? )| ∇`(θn−1 )) [where ∇`(θn−1 ) def = E(∇L(θn−1 , ξn )|Fn−1 )] 1 + γn φ γn λf ≥ ||θn−1 − θ? ||2 . [by strong-convexity in Assumption 9] (50) 1 + γn φ

24

Taking expectations in Eq. (48) and substituting Ineq. (49) and Ineq. (50) into Eq.(48) yields the recursion, γn λf   )E ||θn−1 − θ? ||2 + 4λ20 γn2 . E ||θn − θ? ||2 ≤ (1 − 1 + γn φ

(51)

By Assumption 9 φ ≥ λf and through simple algebra we obtain, (1 −

γn λf 1 + γn φ

)≤

1 , 1 + γn λf 

(52)

for all n > 0, where  = (1 + γ1 (φ − λf ))−1 . Therefore we can write recursion (51) as  E ||θn − θ? ||2 ≤

 1 E ||θn−1 − θ? ||2 + 4λ20 γn2 . 1 + γn λf 

(53)

We can now apply Corollary 1 with an ≡ 4λ20 γn2 and bn ≡ γn λf . Remarks. #1. Assuming Lipschitz continuity of the gradient ∇L instead of function L, i.e., Assumption 8(b) over Assumption 8(a) would not alter the main result of Theorem 3 about the O(n−γ ) rate of the mean-squared error. Assuming Lipschitz continuity with constant λ1 of ∇L and boundedness of E (||∇L(θ? , ξn )||2 ) ≤ σ 2 , as it is typical in the literature, would simply add a term γn2 λ21 E (||θn − θ? ||2 ) + γn2 σ 2 in the right-hand side of Ineq.(48). Specifically, by Lemma 2, sn ≤ 1, and thus    E ||∇L(θn , ξn )||2 = E s2n ||∇L(θn−1 , ξn )||2 ≤ E ||∇L(θn−1 , ξn )||2 = E ||∇L(θn−1 , ξn ) − ∇L(θ? , ξn ) + ∇L(θ? , ξn )||2   ≤ λ21 E ||θn−1 − θ? ||2 + γn2 E ||∇L(θ? , ξn )||2  ≤ λ21 E ||θn−1 − θ? ||2 + γn2 σ 2 . (54) The recursion for the implicit errors would then be  E ||θn − θ? ||2 ≤ (

 1 + λ21 γn2 )E ||θn−1 − θ? ||2 + γn2 σ 2 , 1 + γn λf 

which also implies the O(n−γ ) convergence rate. However, it is an open problem whether it is possible to derive a nice stability property for implicit SGD under Assumption 8(b) similar to the result of Theorem 3 under Assumption 8(a). Remarks. #2. An assumption of almost-sure convexity can simplify the analysis significantly. For example, similar to the assumption of Ryu and Boyd (2014), assume that L(θ, ξ) is stongly convex in expectation, i.e., (θn − θ? )| ∇L(θn , ξn ) ≥

25

µn ||θn − θ? ||2 , 2

(55)



where µn ≥ 0 and E (µn ) = µ > 0. Then, θn + 2γn ∇L(θn , ξn ) = θn−1 [by definition of implicit SGD (14)] ||θn − θ? || + 2γn (θn − θ? )| ∇L(θn , ξn ) ≤ ||θn−1 − θ? ||2 . (1 + γn µn )||θn − θ? ||2 ≤ ||θn−1 − θ? ||2 .   1 E ||θn − θ? ||2 ≤ E ||θn−1 − θ? ||2 + SD(1 + γn µn )SD(||θn − θ? ||2 ), 1 + γn µ (56) 2

where the last inequality follows from the identity E (XY ) ≥ E (X) E (Y ) − SD(X)SD(Y ). However, SD(1 + γn µn ) = O(γn ), and assuming bounded θn we get  E ||θn − θ? ||2 ≤

 1 E ||θn−1 − θ? ||2 + O(γn ), 1 + γn µ

(57)

which indicates a fast convergence towards θ? . It is also possible to use ||θn − θ? ||2 ≤

1 ||θn−1 − θ? ||2 , 1 + γn µn

(58)

and then use a stochastic version of Lemma 4.

6.4

Proof of Theorem 5

In this section, we prove Theorem 5. To do so, we need bounds for E (||θn − θ? ||2 ), which are available through Theorem 3, but also bounds for E (||θn − θ? ||4 ), which are established in the following theorem. Theorem 4. Suppose that Assumptions 6,P7, 8(a), and 9 hold. For a constant K3 > 0, define ζn , E (||θn − θ? ||2 ), and constants ∆3 , K3 γi3 < ∞,  , (1 + γ1 (φ − λf ))−1 , and λ , 1 + γ1 λf . Then, there exists constant n0 such that, for all n > 0, 1−γ

ζn ≤(2K3 γ12 λ/λf )n−2γ + e− log λ·n

[ζ0 + λn0 ∆3 ].

Proof. We start from Eq.(48). Define Wn , sn (θn−1 − θ? )| ∇L(θn−1 , ξn ) for compactness, and proceed as folllows, ||θn − θ? ||2 ||θn − θ? ||2 ||θn − θ? ||2 ||θn − θ? ||4

= ||θn−1 − θ? ||2 − 2γn sn (θn−1 − θ? )| ∇L(θn−1 , ξn ) + γn2 ||∇L(θn , ξn )||2 = ||θn−1 − θ? ||2 − 2γn Wn + γn2 ||∇L(θn , ξn )||2 [by definition] ≤ ||θn−1 − θ? ||2 − 2γn Wn + 4λ20 γn2 , [from Ineq.(49)] ≤ ||θn−1 − θ? ||4 + 4γn2 Wn2 + 16λ40 γn4 − 2γn ||θn−1 − θ? ||2 Wn + 4λ20 γn2 ||θn−1 − θ? ||2 − 8λ20 γn3 Wn .

[from Eq.(48)]

(59)

By Ineq.(50), we have E (Wn | Fn−1 ) ≥

λf 2(1 + γn φ) 26

||θn−1 − θ? ||2 .

(60)

Furthermore, def E Wn2 Fn−1 ) = E [sn (θn−1 − θ? )| ∇L(θn−1 , ξn )]2 Fn−1 ) def = E [(θn−1 − θ? )| ∇L(θn , ξn )]2 Fn−1 ) [by Eq.(43)] ≤ ||θn−1 − θ? ||2 E ||∇L(θn , ξn )||2 Fn−1 ) [by Cauchy-Schwartz inequality] ≤ 4λ20 ||θn−1 − θ? ||2

[by Ineq.(49)]

(61)

Define Bn , E (||θn − θ? ||2 ) for notational brevity. We use results (60) and (61) to get   γn λf γn λf   4 E ||θn−1 − θ? ||4 + 4λ20 γn2 (5 − )Bn−1 + 16λ40 γn4 E ||θn − θ? || ≤ 1 − 1 + γn φ 1 + γn φ   γn λf   4 E ||θn − θ? || ≤ 1 − E ||θn−1 − θ? ||4 + 20λ20 γn2 Bn−1 + 16λ40 γn4 1 + γn φ   1 E ||θn−1 − θ? ||4 + 20λ20 γn2 Bn−1 + 16λ40 γn4 . [by Eq.(52)] E ||θn − θ? ||4 ≤ 1 + γn λf    1 1−γ E ||θn − θ? ||4 ≤ E ||θn−1 − θ? ||4 + K0 γn3 + e− log λ·n K1 + K2 γn4 , [by Theorem 3] 1 + γn λf  (62) P where λ = (1 + γ1 (φ − λf ))−1 and Γ2 = 4λ20 γi2 , (as in Theorem 3), K0 , 160λ40 λ/λf , K1 , 20λ20 (E (||θ0 − θ? ||2 ) + λn0 Γ2 ), and K2 , 16λ40 , and n0 is a constant defined in the proof of Theorem 3. Now, define 1−γ K 1

e− log λ·n K3 , K0 + K2 γ1 + max{ γn3

},

(63)

which exists and is finite. Through simple algebra it is easy to verify that 1−γ

K0 γn3 + e− log λ·n

K1 + K2 γn4 ≤ K3 γn3 ,

(64)

for all n. Therefore, we can simplify Ineq.(62) as  E ||θn − θ? ||4 ≤

 1 E ||θn−1 − θ? ||4 + K3 γn3 . 1 + γn λf 

(65)

We can now apply Corollary 1 with an ≡ K3 γn3 and bn ≡ γn λf  to derive the final bounds for E (||θn − θ? ||4 ). We now evaluate the mean squared error of the averaged iterates, θ¯n .

27

Theorem 5. Consider the AISGD procedure 15 and suppose that Assumptions 6, 7, 8(a), 8(c), 9, and 10 hold. Then, 1/2  1 (E ||θ¯n − θ? ||2 )1/2 ≤ √ trace(∇2 `(θ? )−1 Σ∇2 `(θ? )−1 ) n 2γ + 1 + 1/2 (8λ20 γ1 λ/λf )1/2 n−1+γ/2 λf γ1 2γ + 1 1−γ + 1/2 [δ0 + λn0,1 Γ2 ]1/2 e− log λ·n /2 λf nγn +

λ2 (2K3 γ12 λ/λf )1/2 n−γ 1/2 2λf

+

λ2 [ζ + λn0,2 ∆3 ]1/2 K2 (n). 1/2 0 2nλf

(66)

P where K2 (n) = ni=1 exp (− log λ · i1−γ /2), and constants λ, , n0,1 , δ0 , Γ2 are defined in Theorem 3 (susbtituting n0 for n0,1 ), and ζ0 , n0,2 , ∆3 are defined in Theorem 4, substituting (n0 for n0,2 ). Proof. We leverage a result shown for averaged explicit stochastic gradient descent. In particular, it has been shown that the squared error for the averaged iterate satisfies:  1/2 1 (E ||θ¯n − θ? ||2 )1/2 ≤ √ trace(∇2 `(θ? )−1 Σ∇2 `(θ? )−1 ) n  2γ + 1 + 1/2 (E ||θn − θ? ||2 )1/2 λf nγn n 1/2 λ2 X (E ||θi − θ? ||4 . (67) + 1/2 2nλf i=1 The proof technique for (67) was first devised by Polyak and Juditsky (1992), but was later significantly refined by Xu (2011), and Moulines and Bach (2011). In this paper,we follow the formulation of Moulines and Bach (2011, Theorem 3, page 20); the derivation of Ineq.(67) for the implicit procedure is identical to the derivation for the explicit one, however the two procedures differ in the terms that appear in the bound (67). All such terms in (67) have been bounded in the previous sections. In particular, we can use Theorem 3 for E (||θn − θ? ||2 ); we can also use Theorem 5 and the concavity of the square-root to derive n n   X 1/2 X 1−γ (E ||θi − θ? ||4 ≤ (2K3 γ12 λ/λf )1/2 i−γ + e− log λ·i /2 [ζ0 + λn0,2 ∆3 ]1/2 i=1

i=1

≤ (2K3 γ12 λ/λf )1/2 n1−γ + K2 (n)[ζ0 + λn0,2 ∆3 ]1/2 , (68)  P where K2 (n) = ni=1 exp − log2 λ i1−γ , ζ0 = E (||θ0 − θ? ||4 ), and ∆3 , n0,2 are defined in Theorem 4, substituing n0 for n0,2 in the Theorem. Similarly, using Theorem 3, 1/2 1−γ (E ||θn − θ? ||2 ≤ (8λ20 γ1 λ/λf )1/2 n−γ/2 + e− log λ·n /2 [δ0 + λn0,1 Γ2 ]1/2 , 28

where δ0 = E (||θn − θ? ||2 ), and n0,1 , Γ2 are defined in Theorem 3, substituing n0,1 for n0 . These two bounds can be used in Ineq.(67) and thus yield the result of Theorem 5.

6.5

Data sets used in experiments

covtype delta rcv1 mnist sido alpha beta gamma epsilon zeta fd ocr dna

description forest cover type synthetic data text data digit image features molecular activity synthetic data synthetic data synthetic data synthetic data synthetic data character image character image DNA sequence

type sparse dense sparse dense dense dense dense dense dense dense dense dense sparse

features 54 500 47,152 784 4,932 500 500 500 2000 2000 900 1156 800

training set 464,809 450,000 781,265 60,000 10,142 400k 400k 400k 400k 400k 1000k 1000k 1000k

test set 116,203 50,000 23,149 10,000 2,536 50k 50k 50k 50k 50k 470k 500k 1000k

λ 10−6 10−2 10−5 10−3 10−3 10−5 10−4 10−3 10−5 10−5 10−5 10−5 10−3

Table 2: Summary of data sets and the L2 regularization parameter λ used Table 2 includes a full summary of all data sets considered in our experiments. The majority of regularization parameters are set according to Xu (2011).

6.6

Sensitivity analysis

We examine the inherent stability of the aforementioned procedures by perturbing their hyperparameters. That is, we perform sensitivity analysis by varying any hyperparameters that the user must tweak in order to fine tune the convergence of each procedure. We do so for hyperparameters in ASGD (the learning rate), PROXSVRG (proximal step size η and inner iteration m), and AISGD (the learning rate). The results are shown in Figure 4. When we decrease the regularization parameter, ASGD performs increasingly worse. While it may converge, the test error can be arbitrarily large. On the other hand, AISGD always achieves convergence and is not affected by the choice of the hyperparameter. When the regularization parameter is about 1/N , e.g., when λ < 1e-6, ASGD remains stable and achieves the same performance as AISGD. Similar results hold when perturbing the hyperparameters η and m in PROXSVRG, as AISGD does not require specification of such hyperparameters.

29

0.10 0.08 0.07 0.05

0.06

Test error

0.09

AI−SGD, λ = 1e−6 AI−SGD, λ = 1e−5 AI−SGD, λ = 1e−4 ASGD, λ = 1e−6 ASGD, λ = 1e−5 ASGD, λ = 1e−4

0.0

0.5

1.0

1.5

0.25

Test error

0.30

0.35

Number of passes

0.20

AI−SGD, λ = 1e−7 AI−SGD, λ = 1e−6 AI−SGD, λ = 1e−4 Prox−SVRG, η = 1e−4, m=N 10 Prox−SVRG, η = 1e−2, m=N 10 Prox−SVRG, η = 1e−1, m=N 20 0.2

0.4

0.6

0.8

1.0

Number of passes

Figure 4: Top: Logistic regression on the RCV1 dataset, performing sensitivity analysis of AISGD and ASGD for the choice of regularization parameter λ. Bottom: linear SVM on the covtype dataset, performing sensitivity analysis of AISGD and PROXSVRG, in which PROXSVRG has additional hyperparameters η according to the step size of the proximal update and m according to the inner iteration count.

30