ON STOCHASTIC PROXIMAL GRADIENT ... - Semantic Scholar

Report 3 Downloads 161 Views
ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS ´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

Abstract. We study a perturbed version of the proximal gradient algorithm for which the gradient is not known in closed form and should be approximated. We address the convergence and derive a non-asymptotic bound on the convergence rate for the perturbed proximal gradient, a perturbed averaged version of the proximal gradient algorithm and a perturbed version of the fast iterative shrinkagethresholding (FISTA) of Beck and Teboulle (2009). When the approximation is achieved by using Monte Carlo methods, we derive conditions involving the Monte Carlo batch-size and the step-size of the algorithm under which convergence is guaranteed. In particular, we show that the Monte Carlo approximations of some averaged proximal gradient algorithms and a Monte Carlo approximation of FISTA achieve the same convergence rates as their deterministic counterparts. To illustrate, we apply the algorithms to high-dimensional generalized linear mixed models using `1 -penalization.

Keywords Proximal Gradient Methods; Stochastic Optimization; Monte Carlo approximations; Non convex optimization; Perturbed Majorization-Minimization algorithms 1. Introduction This paper deals with statistical optimization problems of the form: (P)

min F (θ)

with F = f + g ,

θ∈Θ

where f is a smooth function and g is a possibly non-smooth convex penalty term. It focuses on the case where f + g and ∇f are both intractable. In penalized likelihood estimation, f = −` where ` is the log-likelihood function. Intractability arises because the likelihood is known only up to a normalization factor, depending on the parameter to estimate, which cannot be computed explicitly. This is the case when considering for example parameter inference in random fields or undirected graphical models. In such case, ` and ∇` are integrals with respect to (w.r.t.) some Gibbs probability measure πθ on some measurable space X and known only up to the partition function Zθ (normalization constant). Intractable likelihood 2000 Mathematics Subject Classification. 60F15, 60G42. Y. F. Atchad´e: University of Michigan, 1085 South University, Ann Arbor, 48109, MI, United States. E-mail address: [email protected]. Gersende Fort: LTCI, CNRS & Telecom ParisTech, 46, rue Barrault 75634 Paris Cedex 13, France. E-mail address: [email protected]. Eric Moulines: LTCI, Telecom ParisTech & CNRS, 46, rue Barrault 75634 Paris Cedex 13, France. E-mail address: [email protected]. 1

2

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

functions and intractable gradient also arise when dealing with hierarchical latent variable models, including missing data or mixed effects models. Here again, `(θ) and ∇`(θ) are integrals w.r.t. a distribution πθ which represents the conditional distribution of the latent/missing variables given the parameter and the data. In both cases, πθ not only depends on θ, but is often difficult to simulate and typically requires to use Markov Chain Monte Carlo (MCMC) methods. Another source of intractability PN arises when doing learning on a huge data set. In this case, f is written as f = i=1 fi where N is the sample size. In this case, f and ∇f can be computed explicitly but it is more practical to reduce the computational cost by using Monte Carlo estimation of these quantities (see e.g. the survey on incremental stochastic gradient and subgradient methods by Bertsekas (2012)) Intractable problems also occur in onlineR learning and stochastic approximation, where the function f takes the form f (θ) = f¯(θ; x)π(dx) with an unknown distribution π: the user is only provided with random samples from π by an oracle. The goal is to minimize some parameter-dependent functional of the distribution to minimize an empirical risk or maximize a likelihood function. To cope with problems where f + g is intractable and possibly non-smooth, various methods have been proposed. Some of these work focused on stochastic sub-gradient and mirror descent algorithms; see Nemirovski et al. (2008); Duchi et al. (2011); Cotter et al. (2011); Lan (2012); Juditsky and Nemirovski (2012a,b). Other authors have proposed algorithms based on proximal operators to better exploit the smoothness of f and the properties of g (Hu et al. (2009); Xiao (2010)). Proximal algorithms are well established optimization algorithms for dealing with non-smooth objective functions; Beck and Teboulle (2010); Parikh and Boyd (2013); Juditsky and Nemirovski (2012a,b). The current paper focuses on the proximal gradient algorithm (see e.g. Nesterov (2004)) and some perturbed versions: the gradient ∇f (θn ) at the current estimate θn is replaced by an approximation Hn+1 . to solve the problem (P). We provide sufficient conditions on the perturbation ηn+1 = Hn+1 − ∇f (θn ) to obtain convergence of the sequence {F (θn ), n ∈ N} and convergence of the sequence {θn , n ∈ N}. These results are obtained under weak conditions on f and we provide a characterization of the limit points of the sequences {F (θn ), n ∈ N} and {θn , n ∈ N} (see Theorem 6). In the case where f is convex, the limit set of {θn , n ∈ N} is the set of the solutions of (P ) (see Theorem 7). Our results therefore cover the case of strongly convex function f , despite in this setting, the tools are significantly different and the rates of convergence as well (see. e.g. Rosasco et al. (2014); Nitanda (2014); Xiao and Zhang (2014)). Corollary 8 gives a special emphasis to the case the approximation error ηn+1 is random and f is convex. Conditions (expressed in terms of the conditional bias and the conditional variance of the error ηn+1 ) are given in order to ensure convergence; these conditions improve on the earlier works by Combettes and Wajs (2005); Rosasco et al. (2014). We then consider an averaging scheme of the perturbed proximal gradient algodef Pn rithm: given non-negative weights {an , n ∈ N} withPpartial sum An = k=1 ak , n Theorem 10 addresses the rate of convergence of A−1 a F (θ ) to the minimum k k n k=1

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

3

F? of the function F . Rates of convergence in expectation are also provided in the case of a stochastic perturbation ηn+1 (see Corollary 11). We also extend the analysis to perturbed version of the fast iterative shrinkage-thresholding algorithm (FISTA) of Beck and Teboulle (2009). Theorem 13 provides sufficient conditions for the convergence of {F (θn ), n ∈ N}. Here again, an emphasis is given on the stochastic setting: Corollary 14 and Corollary 15 provide rates of convergence of {F (θn ), n ∈ N} to the minimum F? of the function F , for resp. almost-sure convergence and convergence in expectation. In this stochastic setting, our results for the averaging scheme and the perturbed FISTA improve on previous work (see e.g. Schmidt et al. (2011)). Both the perturbed averaged proximal gradient algorithm and the perturbed FISTA depend on design parameters: the stepsize sequence {γn , n ∈ N} in the proximal operator, the weight sequence {an , n ∈ N} in the averaged proximal gradient (resp. the weight sequence in FISTA). We discuss the choice of these parameters in the case of Monte Carlo proximal gradient wherePat each iteration the current gradient mn+1 ∇f (θn ) is approximated by the mean m−1 n+1 k=1 Hθn (Xn+1,k ). The choice of the Monte Carlo batch size mn+1 is also discussed under assumptions on the convergence rates to zero of the error ηn+1 which are satisfied with Importance Sampling Monte Carlo and Markov chain Monte Carlo approximations. We show that this perturbed averaged proximal gradient algorithm (resp. the perturbed FISTA) can reach the same convergence rate as the non perturbed ones for convenient choices of {γn , n ∈ N}, {mn , n ∈ N} and of the weight sequences: the perturbed averaged proximal gradient algorithm (resp. perturbed FISTA) converges at the rate n−1 (resp. n−2 ) when the stepsize is constant γn = γ and the Monte Carlo batch size {mn , n ∈ N} increases to infinity at a convenient rate. However, since these algorithms require the approximation of the gradient at increasing precision (which is costly), the convergence rates expressed as a function of the number of iterations do not yield the complete picture as they do not account for the computational cost of approximating the gradient. A better measure of the performance of these algorithms is the total number of Monte Carlo samples needed to approximate the solution at a given precision δ > 0. Our results imply that the number of Monte Carlo samples needed to achieve a precision δ is O δ −2 for the stochastic averaged gradient proximal algorithm and O δ −2−κ for some κ > 0, for the perturbed FISTA. We illustrate these results with the problem of estimating the parameters of a `1 penalized random effect logistic regression model. The simulation results support the theoretical findings reported above. The paper is organized as follows. In section 2 our basic assumptions are stated, the proximal gradient algorithm is described and its main properties are recalled. In section 3, the convergence results for the perturbed proximal gradient, some averaged version and the FISTA algorithm are stated. section 4 is devoted to the case ηn+1 is a Monte Carlo sum. An application of the stochastic proximal gradient algorithm to

4

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

the `1 -penalized inference of a logistic regression with mixed effects is presented in section 5. The proofs are postponed to section 6. 2. The proximal gradient algorithm In the sequel Θ denotes a finite-dimensional Euclidean space with norm k · k and inner product h·, ·i. Let f : Θ → R, and g : Θ → (−∞, +∞] be functions. We consider the problem of finding solutions to (P) when f, g satisfy the conditions H 1. g : Θ → (−∞, +∞] is convex, not identically +∞, and lower semi-continuous. The function f : Θ → R is continuously differentiable and there exists a finite constant L such that, for all θ, θ0 ∈ Θ, k∇f (θ) − ∇f (θ0 )k ≤ Lkθ − θ0 k , where ∇f denotes the gradient of f . This assumption implicitly holds in all the sequel. Note that since the function g is convex, it is continuous on its domain Dom(g) = {θ ∈ Θ, |g(θ)| < ∞}. In the examples below, f is the negative log-likelihood (and is denoted by −`) and g is a penalty function. Example 1 (High-dimensional logistic regression with random effects). We model binary responses {Yi }N i=1 ∈ {0, 1} as N conditionally independent realizations of a random effect logistic regression model,  ind. Yi |U ∼ Ber s(x0i β + σzi0 U) , 1 ≤ i ≤ N , (1) where xi ∈ Rp is the vector of covariates, zi ∈ Rq are (known) loading vector, Ber(α) denotes the Bernoulli distribution with parameter α ∈ [0, 1], s(x) = ex /(1 + ex ) is the cumulative distribution function of the standard logistic distribution. The random effect U is assumed to be standard Gaussian U ∼ Nq (0, I). The log-likelihood of the observations at θ = (β, σ) ∈ Θ = Rp × (0, ∞) is given by `(θ) = log

Z Y N

1−Yi s(x0i β + σzi0 u)Yi 1 − s(x0i β + σzi0 u) φ(u)du ,

(2)

i=1

where φ is the density of a Rq -valued standard Gaussian random vector. The number of covariates p is possibly larger than N , but only a very small number of these covariates are relevant which suggests to use the elastic-net penalty   1−α 2 λ kβk2 + αkβk1 , (3) 2 P where λ > 0 is the regularization parameter, kβkr = ( pi=1 |βi |r )1/r and α ∈ [0, 1] controls the trade-off between the `1 and the `2 penalties. In this example,   1−α g(θ) = λ kβk22 + αkβk1 . (4) 2

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

5

Define the conditional log-likelihood of Y = (Y1 , . . . , YN ) given U (the dependence upon Y is omitted) by N X    `c (θ|u) = Yi x0i β + σzi0 u − ln 1 + exp x0i β + σzi0 u , i=1

and the conditional distribution of the random effect U given the observations Y and the parameter θ πθ (u) = exp (`c (θ|u) − `(θ)) φ(u) . (5) The Fisher identity implies that the gradient of the log-likelihood (2) is given by  ) Z Z (X N x 0 0 ∇`(θ) = ∇θ `c (θ|u) πθ (u) du = (Yi − s(xi β + σzi u)) 0 i πθ (u) du . zi u i=1

The Hessian of the log-likelihood ` is given by (see e.g.(McLachlan and Krishnan, 2008, Chapter 3))   ∇2 `(θ) = Eπθ ∇2θ `c (θ|U) + Covπθ (∇θ `c (θ|U)) where Eπθ and Covπθ denotes the expectation and the covariance with respect to the distribution πθ , respectively. Since    0 N X  xi xi 0 0 0 0 2 s(xi β + σzi u) 1 − s(xi β + σzi u) , ∇θ `c (θ|u) = − zi0 u zi0 u i=1 R and supθ∈Θ kuk2 πθ (u) du < ∞ (see Appendix A), ∇2 `(θ) is bounded on Θ. Hence, ∇`(θ) satisfies the Lipschitz condition showing that H 1 is satisfied.  Example 2 (Network structure estimation). Consider a Markov random field over a finite set X with joint probability distribution   p   X X 1 fθ (x1 , . . . , xp ) = exp θii B0 (xi ) + θij B(xi , xj ) , (6)   Zθ i=1

1≤j 0. Notice

6

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

that we do not penalize the diagonal terms of θ. Therefore, we compute the minimum of F = −` + g where N 1 X D ¯ (i) E `(θ) = θ, B(x ) − log Zθ and g(θ) = λ N i=1

X

|θjk | + IKa (θ) ;

1≤k<j≤p

¯ : Xp → Rp×p is defined by the matrix-valued function B ¯ii (x) = B0 (xi ) B

¯ij (x) = B(xi , xj ) , i 6= j , B

λ is the positive regularization parameter and IC is defined by  0 if ϑ ∈ C , IC (ϑ) = +∞ otherwise .

(7)

Observe that g is convex, not identically equal to +∞ and is lower-continuous (since C is closed, see (Bauschke and Combettes, 2011, Example 1.25)). Upon noting that (6) is a canonical exponential model, (Shao, 2003, Section 4.4.2) shows that θ 7→ −`(θ) is convex and Z N 1 X ¯ (i) ¯ B(z)f (8) B(x ) − ∇`(θ) = θ (z)µ(dz) , N p X i=1

where µ is the counting measure on Xp . In addition, (see Appendix B)  k∇`(θ) − ∇`(ϑ)k2 ≤ p (p − 1)osc2 (B) + osc2 (B0 ) kθ − ϑk2 ,

(9)

˜ : X × X → R, osc(B) ˜ = supx,y,u,v∈X |B(x, ˜ y) − B(u, ˜ v)|. Hence, where for a function B H 1 holds.  Example 3 (large scale convex optimization). Consider the case when f is the average of many smooth component functions fi i.e. f (θ) =

N 1 X fi (θ) , N i=1

and all the component functions fi admit a Lipschitz-continuous gradient which can be explicitly computed. Such a situation occurs in statistical inference problems (in that case N is the total number of observations and fi is the loss function associated to the i-th observation), in distributed optimization, in the minimization of an expected value when the random variable takes many finite values; see e.g. Bertsekas (2012) and the references therein. In the large scale problems i.e. when N  1, the computational cost of ∇f (θ) is so high that it is advocated to useP incremental methods by substituting, at iteration n+1, mn+1 the gradient ∇f (θn ) by m−1 n+1 k=1 ∇fIk (θn ) where {Ik , k ≥ 1} are, conditionally to the past Fn , i.i.d. random variables drawn at random in {1, · · · , N } and independent of Fn . 

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

7

The proximal map (see e.g. Parikh and Boyd (2013)) associated to g is defined for γ > 0 by:   1 def 2 Proxγ (θ) = Argminϑ∈Θ g(ϑ) + (10) kϑ − θk . 2γ Note that under H 1, there exists an unique point ϑ minimizing the RHS for any θ ∈ Θ and γ > 0. There are many important examples of penalty functions for which the proximal map is tractable. For example, for the projection (7), the proximal operator is the orthogonal projection on C, that is the map ΠC : Θ → Θ such that kθ − ΠC (θ)k = minϑ∈C kθ − ϑk. For the elastic-net penalty (3), Proxγ (θ) is the component-wise soft-thresholding operator defined as  θi −γλα  if θi ≥ γλα ,  1+γλ(1−α) θi +γλα (sγ,λ,α (θ))i = (11) if θi ≤ −γλα ,   1+γλ(1−α) 0 if θi ∈ (−γλα, γλα) . Note that the case α = 1 corresponds to the Lasso penalty and the case α = 0 corresponds to the ridge-regression penalty. When both of these penalties are combined, and C is a rectangle, the proximal operator is ΠC (sγ,λ,α (θ)). We can now introduce the proximal gradient algorithm (see Beck and Teboulle (2009)): Algorithm 1 (Proximal gradient algorithm). Let θ0 ∈ Dom(g) denote the starting estimate, and {γn , n ∈ N} be a sequence of positive step sizes. Given θn , compute θn+1 = Proxγn+1 (θn − γn+1 ∇f (θn )) .

(12)

The proximal gradient is a special instance of the Majorization-Minimization (MM) technique (see (Beck and Teboulle, 2009, Section 1.3.)). Consider the surrogate function ϑ 7→ Qγ (ϑ|θ) given by 1 kϑ − θk2 + g(ϑ) 2γ 1 γ = f (θ) + kϑ − (θ − γ∇f (θ))k2 − k∇f (θ)k2 + g(ϑ) , 2γ 2 def

Qγ (ϑ|θ) = f (θ) + h∇f (θ), ϑ − θi +

(13)

for a positive constant γ ∈ (0, 1/L], where L is defined in H 1. By construction, F (θ) = Qγ (θ|θ) for all θ ∈ Θ and F (ϑ) ≤ Qγ (ϑ|θ) for all θ, ϑ ∈ Θ. Hence, ϑ 7→ Qγ (ϑ|θ) is a majorizing function for ϑ 7→ F (ϑ) which suggests to iteratively compute the minimum of the majorizing surrogate, leading to (12). Observe that (12) is of the form θn+1 = Tγn+1 (θn ) with the point-to-point map Tγ defined by def

Tγ (θ) = Proxγ (θ − γ∇f (θ)) = argminϑ∈Dom(g) Qγ (ϑ|θ) .

(14)

Hence, when γn = γ for any n, any convergent sequence {θn , n ∈ N} has a limiting point which is a fixed point of Tγ . Set def

L = {θ ∈ Dom(g) : 0 ∈ f (θ) + ∂g(θ)} ,

(15)

8

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

where ∂g(θ) denotes the subdifferential of the convex function g at θ in Dom(g). Then for any γ > 0, L = {θ : θ = Tγ (θ)} (16) (see Beck and Teboulle (2009); see also Lemma 20 in section 6). If in addition f is convex, then L is the set of the minimizers of F = f + g: L = Argminθ∈Θ F (θ) .

(17)

Convergence results for Algorithm 1 applied with constant stepsize (γn = γ for any n) can be obtained by applying general results for monotonic optimization algorithm adapted to MM algorithms; see (Meyer, 1976, Theorem 3.1.) and (Schifano et al., 2010, Appendix A). The following convergence result is established in Section 6.2. Theorem 4. Assume H 1 and set γ ∈ (0, 1/L]. Let {θn , n ∈ N} be the sequence given by Algorithm 1 applied with γn = γ for any n ≥ 1 and assume that the sequence {θn , n ∈ N} remains in a compact set K. Then, (i) the set L given by (15) is not empty and all accumulation points of {θn , n ∈ N} are in L ∩ K, (ii) there exists θ? ∈ L ∩ K such that limn F (θn ) = F (θ? ), (iii) limn kθn+1 − θn k = 0. The key properties for the proof of this result is that (i) Tγ is monotonic (see Lemma 20 in section 6): for any θ, F (Tγ (θ)) ≤ F (θ), which implies that the sequence {F (θn ), n ∈ N} is non-increasing; and (ii) Tγ : Θ → Θ is Lipschitz (see Lemma 18 in section 6): there exists C such that for any θ, θ0 , kTγ (θ) − Tγ (θ0 )k ≤ Ckθ − θ0 k. Since Tγ is monotonic, F (θn+1 ) ≤ F (θn ) for any n which implies that {θn , n ∈ N} remains in a compact set as soon as the level set {θ : F (θ) ≤ F (θ0 )} is compact. The property (iii) implies that either the sequence {θn , n ∈ N} converges to θ? ∈ L ∩ K, or the set of limit points of this sequence forms a continuum (and the sequence fails to converge). It also implies that if {θn , n ∈ N} has an isolated accumulation point θ? , then limn θn = θ? . By the properties (ii) and (iii), it is easily seen that if the number of points of L having a given value of F is finite, then the sequence {θn , n ∈ N} converges to one of these points. We stress that this convergence result is obtained without assuming that f is convex. If the following assumption holds H 2. The function f is convex and the set argminθ∈Θ F (θ) is not empty then Theorem 4 implies the stronger convergence result (see e.g. (Beck and Teboulle, 2009, Theorem 1.2.) and references therein). Corollary 5. Assume H 1, H 2 and fix γ ∈ (0, 1/L]. Let {θn , n ∈ N} be the sequence given by Algorithm 1 applied with γn = γ for any n ≥ 1. Then there exists θ? ∈ L such that limn θn = θ? . Corollary 5 follows from Theorem 4 upon noting that in the convex case, L is not empty if and only if the sequence {θn , n ∈ N} remains in a compact set (see Section 6.3).

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

9

3. Perturbed proximal gradient algorithms As shown in the above examples, the gradient ∇f is not always tractable so the proximal gradient does not apply. We introduce a perturbed proximal gradient algorithm whereby the gradient at the n-th iteration of the algorithm ∇f (θn ) is approximated by Hn+1 ∈ Θ. This yields the algorithm Algorithm 2 (Perturbed Proximal Gradient algorithm). Let θ0 ∈ Dom(g) be the initial solution and {γn , n ∈ N} be a sequence of positive non-increasing step sizes. For n ≥ 1, given (θ1 , . . . , θn ): (1) Obtain Hn+1 ∈ Θ, an approximation of ∇f (θn ). (2) Compute θn+1 = Proxγn+1 (θn − γn+1 Hn+1 ). The convergence of this algorithm to the solutions of the problem (P) is established when f is convex in (Combettes and Wajs, 2005, Theorem 3.4). We provide in Theorem 6 a convergence result, which holds when f is not convex and under weaker conditions on the approximation error : def

ηn+1 = Hn+1 − ∇f (θn ) .

(18)

We also give sufficient conditions for the convergence to a point θ? in L of this algorithm with general step size, under the stronger assumption that f is convex (see Theorem 7 and Corollary 8)). We then provide in Theorem 10 convergence rates for (possibly weighted) means of the sequence {F (θn ), n ∈ N} in the convex case. Finally, we study in Theorem 13 the rate of convergence of a perturbed version of the Fast Iterative Shrinkage-Thresholding algorithm (FISTA). Some results are stated in the general case of a random approximation Hn+1 , which covers the case of a deterministic approximation. Theorem 6. Assume H 1. Set γ ∈ (0, 1/L] and let {θn , n ∈ N} be given by Algorithm 2 with γn = γ. Assume in addition that the set L given by (15) is not empty, lim supn kθn k is finite and limn ηn = 0 a.s.Then the sequence {F (θn ), n ∈ N} converges a.s. to a connected component of F (L). If in addition, F (L) has an empty interior, there exists θ? ∈ L such that (i) limn F (θn ) = F (θ? ) a.s. (ii) the sequence {θn , n ∈ N} converges to the set L ∩ {θ : F (θ) = F (θ? )} a.s. Proof. See Section 6.4.



By using the Lipschitz properties of Prox and Tγ (see Lemma 16 and Lemma 18 in section 6), it is easy to prove that for any θ? ∈ L, kθn+1 −θ? k ≤ γn+1 kηn+1 k+kθn −θ? k. Hence, by a trivial induction we have for any n ≥ m ≥ 0, X kθn+1 − θ? k ≤ γk+1 kηk+1 k + kθm − θ? k . (19) k≥m

P The above inequality shows that when n γn+1 kηn+1 k < ∞ a.s. , then lim supn kθn k < ∞ a.s. In addition, if the sequence {θn , n ∈ N} possesses a limit point θ? in L, then the sequence converges to θ? . Nevertheless this condition on the convergence of the

10

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

sum reveals to be strong in many situations as shown in Remark 3 below (see also the case of Monte Carlo Proximal Gradient algorithms in section 4). When lim supn kθn k is finite almost-surely, the almost-sure convergence to zero of the sequence {ηn , n ≥ 1} is equivalent to the almost-sure convergence to zero of the sequence {ηn 1kθn−1 k≤B ), n ≥ 1}, for any B > 0. Therefore, it is easily seen by application of the conditional Borel-Cantelli lemma that a sufficient condition for the almost-sure convergence to zero of {ηn+1 1kθn k≤B , n ≥ 1} is: for any ε > 0, X P (kηn+1 k ≥ ε|Fn ) 1kθn k≤B < ∞ , (20) n≥1

where {Fn , n ∈ N} is the filtration defined by def

Fn = σ(θ0 , H1 , · · · , Hn ) .

(21)

In section 4, the case where ∇f is an expectation and Hn is a Monte Carlo approximation is discussed and sufficient conditions for (20) to hold are given. When f is convex, we know from (17) that L is the set of the minimizers of F ; hence, F (L) has an empty interior since it is a singleton of R. In this case, Theorem 6 establishes the convergence of the perturbed proximal gradient sequence {θn , n ∈ N} to a subset of the solutions of (P). When this set is a singleton {θ? } - which occurs for example when F is strictly convex -, the sequence {θn , n ∈ N} converges almostsurely to θ? . This convergence result can be compared to the convergence result given by (Combettes and Wajs, 2005, Theorem 3.4.) in the convex case and when Hn+1 is deterministic: our theorem requires a stronger condition on the stepsize γ but a weaker condition P on the noise ηn (in Combettes and Wajs (2005), it is assumed that γ ∈ (0, 2/L) and n kηn k < ∞). Example (large scale convex optimization (continued)). In this example, ηn+1 =

1

mn+1

mn+1 PN −1

X

{∇fIk (θn ) − ∇f (θn )} .

k=1

Since E [∇fIk (θn )|Fn ] = N i=1 ∇fi (θn ), conditionally to the past Fn the error ηn+1 is unbiased: E [ηn+1 |Fn ] = 0. Let B > 0 be fixed; upon noting that for any i ∈ {1, · · · , N }, θ 7→ ∇fi (θ) − ∇f (θ) is Lipschitz-continuous and supkθk≤B k∇fi (θ)k < ∞, the uniform law of large numbers implies (see e.g. (Jennrich, 1969, Theorem 2)) n 1 X lim sup a.s. {∇fIk (θ) − ∇f (θ)} = 0 n→∞ {θ,kθk≤B} n k=1

Therefore, a sufficient condition for the almost-sure convergence of {ηn , n ∈ N} to zero is limn mn = +∞.  The result below addresses the boundedness and the convergence of the sequence {θn , n ∈ N} to a point θ? ∈ L in the convex case. Theorem 7. Assume H 1 and H 2. Let {θn , n ∈ N} be given by Algorithm 2 with stepsizes satisfying γn ∈ (0, 1/L] for any n ≥ 1.

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

11

(i) For any θ? in L (see (17)) and for any n ≥ m ≥ 0, kθn+1 − θ? k2 ≤ kθm − θ? k2 − 2

n X

n X

2 γk+1 Tγk+1 (θk ) − θ? , ηk+1 + 2 γk+1 kηk+1 k2 . (22)

k=m

k=m

(ii) Assume that for any θ? ∈ L, X

γn+1 Tγn+1 (θn ) − θ? , ηn+1 exists,

X

2 γn+1 kηn+1 k2 < ∞ .

(23)

n≥0

n≥0

Then, for any θ? in L, limn kθn − θ? k exists. If in addition there exists θ∞ ∈ L such that limn θn = θ∞ .

P

n γn

Proof. See Section 6.5.

= +∞, then 

When Hn+1 is a random approximation, we provide below sufficient conditions for the conclusions of Theorem 7 to hold a.s. Define i h def 2 (2) def (1) = kE [ η | F ]k ,  = E kη k (24) Fn . n+1 n n+1 n n Corollary 8 (of Theorem 7). Assume H 1 and H 2. Let {θn , n ∈ N} be given by Algorithm 2 with stepsizes satisfying γn ∈ (0, 1/L] for any n ≥ 1. Assume also that lim supn kθn k < ∞ a.s. and for any B ≥ 0, n o X (2) γn+1 (1) + γ  1kθn k≤B < ∞ a.s. (25) n+1 n n n≥0

Then for any θ? ∈ L, limn kθn − θ? k exists a.s. If in addition there exists θ? ∈ L such that limn θn = θ? a.s.

P

n γn

Proof. See Section 6.6

= +∞, then 

Remark 9. A closely related result has been established in (Combettes and Pesquet, 2014, Theorem 2.5) in the case γn = γ for any n, but under stronger conditions; in particular, the authors assume that all the limit points of {θn , n ∈ N} are in L which is not an assumption in Corollary 8. Example (large scale convex optimization (continued)). We already proved that (1) n = 0. Since conditionally to Fn , {Ik , k ∈ N} are independent and independent of θn , we have   −1 −1 2 (2) n 1kθn k≤B = mn+1 E k∇fI1 (θn ) − ∇f (θn )k |Fn 1kθn k≤B ≤ M mn+1 forPa positive constant M . This inequality shows that (25) holds a.s. for any B > 0 2 m−1 < ∞. if n γn+1 n+1  Let {an , n ∈ N} be a sequence of non-negative weights, and denote n X def An = ak . k=1

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

12

Theorem 10 provides a control of the weighted mean A−1 n θ? is any minimizer of F .

Pn

k=1 ak F (θk )−F (θ? )

where

Theorem 10. Assume H 1 and H 2. Let {θn , n ∈ N} be given by Algorithm 2 with stepsizes satisfying γn ∈ (0, 1/L] for any n ≥ 1. ForPany non-negative sequence n {an , n ∈ N}, any minimizer θ? of F and any n ≥ 1, A−1 n k=1 ak F (θk ) − F (θ? ) ≤ Bn where  n  X ak a1 ak−1 def 1 Bn = kθk−1 − θ? k2 + kθ1 − θ? k2 − 2An γk γk−1 2γ1 An k=2



n 1 X  ak hTγk (θk−1 ) − θ? , ηk i + γk kηk k2 , (26) An k=1

and Tγ and ηn are resp. given by (14) and (18). Proof. See Section 6.7.



We deduce convergence rates in expectation from Theorem 10. Corollary 11 (of Theorem 10). In addition to the assumptions of Theorem 10, assume that {an /γn , n ≥ 1} is non-decreasing and there exists a constant B such that P(supn∈N kθn k ≤ B) = 1. Then, for any minimizer θ? of F , n n h i h io 1 X B 2 an 1 X n (1) (2) , (27) ak E [F (θk )] − F (θ? ) ≤ ? + ak B? E k−1 + γk E k−1 An 2γn An An k=1

k=1

def

where B? = B + kθ? k. Proof. See Section 6.8



Theorem 10 provides a convergence rate in a quite general setting: the stepsizes {γn , n ∈ N} are not necessarily constant and the weights {an , n ∈ N} are nonnegative but otherwise arbitrary. Taking aj = 1 provides a bound for the cumulative regret. When the approximation Hn+1 is deterministic, the weight sequence an = 1 and the stepsize is constant γn = 1/L, (Schmidt et al.,P 2011, Proposition 1) provides a bound of order O(1/n) under thePassumption that n kηn+1 k < ∞. Note that (19) and Lemma 22, the assumption n kηn k < ∞ implies that limn kθn − θ? k exists for any θ? ∈ L. Using the inequality | T1/L (θk ) − θ? , ηk+1 | ≤ L−1 kθk − θ? kkηk+1 k (see (55)), the upper bound Bn in (26) is also O(1/n). Theorem 10 applied with Pn ηn = 0 can be used to derive the rate of convergence of the weighted mean {A−1 n j=1 aj F (θj ), n ≥ 1} to the minimum F (θ? ) for the (exact) proximal gradient sequence {θn , n ∈ N} given by Algorithm 1. This rate is an γn−1 A−1 n when supn kθn − θ? k < ∞ and the sequence {an /γn , n ≥ 1} is non decreasing. For example, if the algorithm is run with γn ∼ Cc n−c for c ∈ [0, 1), the weighted mean converges at the rate nc−1 for any weight sequence an ∼ na with a > −1, showing that the minimal regret is achieved by taking c = 0 i.e. a constant stepsize γn = γ.

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

13

Remark 12. Consider the weighted averaged sequence {θ¯n , n ∈ N} defined by n X def 1 ¯ θn = ak θ k , An

(28)

k=1

for a given sequence {an , n ∈ N} of non-negative weights. It generalizes the PolyakJudistky averaging in Stochastic Approximation and Juditsky (1992)). Under  P(Polyak n −1 ¯ H1 and H2, F is convex so that F θn ≤ An k=1 ak F (θk ). Therefore, Theorem 10 also provides convergence rates for F (θ¯n ) − F (θ? ) and Corollary 11 provides L1 -rates. Nesterov (1983) introduced an acceleration scheme of the deterministic gradient method that is shown to converge at a rate O(1/n2 ). The algorithm was then extended in (Beck and Teboulle, 2009, Section 1.5) to the proximal gradient algorithm, yielding FISTA. We consider a perturbed version of FISTA. Let {tn , n ∈ N} be a sequence of positive numbers and {γn , n ∈ N} be positive stepsizes with the following property t0 = 1 ,

γn+1 tn (tn − 1) ≤ γn t2n−1 ,

n ≥ 1.

(29)

For instance, when {γn , n ∈ N} is non-increasing, (29) holds for the following sequences: (i) all sequences tn ∝ (n + n0 )β , with β ∈ (0, 1), for some n0 > 0; (ii) tn = n/2 + 1 and (iii) the sequence defined recursively by p 1 + 1 + 4t2n tn+1 = . (30) 2 It is easily seen from (30) that tn ∼ n/2 as n goes to infinity. Algorithm 3 (Perturbed FISTA). Let θ0 ∈ Dom(g), {tn , n ∈ N} and {γn , n ∈ N} be positive sequences satisfying (29). Compute θ1 = Proxγ1 (θ0 − γ1 H1 ), where H1 is an approximation of ∇f (θ0 ). For n ≥ 1, given (θ0 , . . . , θn ): (1) Compute ϑn = θn + t−1 (31) n (tn−1 − 1)(θn − θn−1 ) . (2) Obtain Hn+1 an approximation of ∇f (ϑn ), and set θn+1 = Proxγn+1 (ϑn − γn+1 Hn+1 ) .

(32)

For this algorithm, the approximation error is defined by def

ηˇn+1 = Hn+1 − ∇f (ϑn ) .

(33)

Theorem 13. Assume H 1 and H 2. Let {θn , n ∈ N} be given by Algorithm 3, with {tn , n ∈ N} and {γn , n ∈ N} be positive sequences satisfying (29) and γn ∈ (0, 1/L] ¯ n def for any n ≥ 1. Set ∆ = (1 − tn )θn + tn Tγ (ϑn ) where Tγ is given by (14). n+1

(i) If limn γn+1 t2n = +∞ and there exists θ? ∈ L such that X X

2 ¯ n − θ? , ηˇn+1 exists, γn+1 t2n kˇ ηn+1 k2 < ∞ , γn+1 tn ∆ n≥0

n≥0

then limn F (θn ) = F (θ? ) and all the limit points of {θn , n ∈ N} are in L.

14

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

−1 (ii) For all n ≥ 1 and any minimizer θ? of F , F (θn+1 ) − F (θ? ) ≤ Bn t−2 n γn+1 where n n X X

1 def 2 ¯ k − θ? , ηˇk+1 . Bn = γ1 (F (θ1 ) − F (θ? ))+ kθ1 −θ? k2 + γk+1 t2k kˇ ηk+1 k2 − γk+1 tk ∆ 2 k=1 k=1 P ¯ (iii) If n γn+1 tn kˇ ηn k < ∞ then supn k∆n k < ∞.

Proof. See Section 6.9.



When Hn+1 is random, we provide below sufficient conditions for the conclusions def of Theorem 13 to hold a.s. Set Fn = σ (θ0 , H1 , · · · , Hn ) and i h def 2 (2) def (34) ˇ(1) = kE [ η ˇ | F ]k ,  ˇ = E kˇ η k Fn . n+1 n n+1 n n Corollary 14 (of Theorem 13). Assume H 1 and H 2. Let {θn , n ∈ N} be given by Algorithm 3, with {tn , n ∈ N} and {γn , n ∈ N} be positive sequences satisfying (29) and γn ∈ (0, 1/L] for any n ≥ 1. If lim supn kϑn k < ∞ a.s. , limn γn t2n = +∞ and for any B > 0, there exist constants {Mn , n ∈ N} such that   X (2) ˇ(1) +  ˇ 1Tk≤n {kϑk k≤B} ≤ Mn a.s. γn+1 tn {1 + γn+1 tn }Mn < ∞ , (35) n n n≥0

then (i) Almost-surely, limn F (θn ) = F (θ? ) for θ? ∈ L and the limit points of {θn , n ∈ N} are in L. −1 (ii) For all n ≥ 1 and any minimizer θ? of F , F (θn+1 ) − F (θ? ) ≤ Bn t−2 n γn+1 a.s. where the r.v. Bn is given by Theorem 13 and is finite a.s. Proof. See Section 6.10.



Example (large scale convex optimization (continued)). Following the same lines as (1) above, it is easily established that ˇn = 0 and for any B, there exists a constant M such that for any n, −1 T ˇ(2) ˇ(2) n 1 k≤n {kϑk k≤B} ≤  n 1kϑn k≤B ≤ M mn+1 . P Therefore (35) holds if n γn+1 tn (1 + γn+1 tn )m−1 n+1 < ∞.

 We will show in section 4 (see (39) and (40)) that the condition (35) is satisfied when Hn+1 is a Markov chain Monte Carlo approximation of the intractable gradient ∇f (ϑn ). Theorem 13 provides a convergence rate of F (θn ) to the minimum F (θ? ) for arbitrary sequences {γn , n ∈ N} and {tn , n ∈ N}. It extends the previous work by Schmidt et al. (2011) which considers the case of a deterministic approximation Hn+1 , a fixed step size γn = 1/L and the sequence tn = n/2 + 1: under the assumption that P ¯ n − θ? k < ∞ and (ii) the rate of conηj+1 k < ∞, they prove that (i) supn k∆ j≥1 tj kˇ ¯ n − θ? k < ∞ vergence is n−2 . In this specific setting, Theorem 13 shows that supn k∆ −2 and supn Bn < ∞ so that the rate of convergence is n too.

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

15

Theorem 13 applied with ηˇn = 0 provides the rate of convergence of {F (θn ), n ∈ N} −1 to F (θ? ) for the (non perturbed) FISTA. This rate is t−2 n γn . For example, if the −c algorithm is run with tn ∼ Cn and with γn ∼ Cc n for c ∈ [0, 2), the exact algorithm converges at the rate nc−2 ; this rate is O(n−2 ) when the stepsize is constant γn = γ for some γ ∈ (0, 1/L] (see (Beck and Teboulle, 2009, Theorem 1.4)). We deduce convergence rates in expectation from Theorem 13. Corollary 15 (of Theorem 13). Assume H 1 and H 2. Let {θn , n ∈ N} be given by Algorithm 3, with {tn , n ∈ N} and {γn , n ∈ N} be positive sequences satisfying (29) and γn ∈ (0, 1/L] for any n ≥ 1. If P(supn kθn k ≤ B) = 1 and h i X X 2 γn+1 tn kˇ (1) γn+1 t2n E ˇ(2) 0. In addition, we choose an = na ,

mn = bCb nb c ,

γn = Cc n−c ,

where b·c denotes the lower integer part, a > −1, b ≥ 0, c ≥ −a ∨ 0 and Cb , Cc > 0. Let us first discuss the case where the approximation of ∇f (θn ) is unbiased (i.e. C1 = 0 in (41)). For a given c ∈ [0, 1), the maximal rate of the RHS of (27) is n1−c which is obtained by choosing b ≥ 0 such that b + c = 1 − c. This yields to c ∈ [0, 1/2], a > −c and b = 1 − 2c; the rate of convergence is 1/n1−c . It is maximal when c = 0 which requires to increase the batch size linearly mn = bCb nc; the number of iterations required to obtain En ≤ δ increases at a rate δ −1 and the total number of simulations increases as δ −2 . The rate δ −2 can not be improved by other choices of the parameters but can be reached by other strategies: by choosing c ∈ [0, 1/2], a > −c and b = 1 − 2c, the rate is also δ −2 . Let us now discuss the case when the approximation error is biased. For a given c ∈ [0, 1), the maximal rate of the RHS of (27) is n1−c which is obtained by choosing b = 1 − c and a > 0. This rate is maximal when c = 0, b = 1, which yields a convergence rate O(1/n). Here again, for this choice, the total number of Monte Carlo samples required to obtain En ≤ δ increases as δ −2 . As a conclusion of the above discussion, we report in Table 1 the choices of (a, b, c) leading to the optimal rate and the number of Monte Carlo samples in order to reach a precision δ. c a b Rate MC no bias 0 (0, ∞) 1 1/n 1/δ 2 1−c (C1 = 0) [0, 1/2] (−c, +∞) 1 − 2c 1/n 1/δ 2 1−c [0, 1) −c (1 − 2c, ∞) ∩ [0, ∞) 1/n 1/δ (1+b)/(1−c) with bias 0 (0, ∞) 1 1/n 1/δ 2 1−c (C1 > 0) 1−c 1/n 1/δ (2−c)/(1−c) [0, 1) (0, ∞) [0, 1) −c (1 − c, ∞) 1/n1−c 1/δ (1+b)/(1−c) C1 = C2 = 0 [0, 1) (−1, ∞) 1/n1−c Table 1. [Averaged Perturbed Proximal Gradient] Values of (a, b, c) in order to reach the rate of convergence Rate. The column MC reports the number of Monte Carlo samples in this strategy to reach a precision En = O(δ). As a reference, the last row reports the rate when ηn = 0.

We now investigate the rates of convergence of the Perturbed FISTA algorithm with Monte Carlo approximation that can be deduced from Corollary 14 as a function of the design parameters mn and γn . Here again, we assume that there exist C1 , C2 ≥ 0 and B > 0 such that   −1 −1 (1) (2) E[ˇ n ] ≤ C1 mn+1 , E[ˇ n ] ≤ C2 mn+1 , and P sup kθn − θ? k ≤ B = 1 , (42) n∈N

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

18

and we choose m n = Cb n b , γn = Cc n−c , with b, c ≥ 0, and Cb , Cc > 0. Let us discuss the rate of convergence that can be deduced from Corollary 14item ii when tn ∼ Cn as n → ∞, which is the case for example for the sequence given by (30); note that the rate of decay is at most nc /t2n ∼ 1/n2−c which implies that hereafter, we choose c ∈ [0, 2). Consider first the unbiased case (C1 = 0 in (42)); then the maximal rate of convergence nc−2 can be reached by choosing b > 3 − 2c. It is therefore possible to reach the maximal rate n−2 (which is the rate of the FISTA algorithm, see (Beck and Teboulle, 2009, Theorem 1.4)) by choosing c = 0 (i.e. a constant step-size) and b > 3 (i.e. a number of Monte Carlo samples which increases as n3 ). The number of iterations to reach E[F (θn )] − F (θ? ) ≤ δ is O(n1/(2−c) ); since the number of Monte Carlo after n iterations is O(n1+b ) then, O(δ −(1+b)/(2−c) ) samples are required to reach a precision δ. Note that since b > 3 − 2c, we have (1 + b)/(2 − c) > 2 that is this computational cost is always larger than the computational cost of the plain stochastic proximal gradient algorithm (see Table 1). In the biased case (i.e. C1 > 0 in (42)), the optimal rate nc−2 can be reached by choosing b > 3 − c. Here again, it is possible to reach the rate n−2 by choosing c = 0 and b > 3. We report in Table 2 the conclusions of this discussion. c b Rate MC 0 (3, ∞) 1/n2 1/δ (b+1)/2 [0, 2) (3 − 2c, +∞) ∩ [0, ∞) 1/n2−c 1/δ (b+1)/(2−c) 0 (3, ∞) 1/n2 1/δ (b+1)/2 [0, 1) (3 − 2c, ∞) 1/n2−c 1/δ (b+1)/(2−c) [1, 2) (2 − c, ∞) 1/n2−c 1/δ (b+1)/(2−c) C1 = C2 = 0 [0, 2) 1/n2−c Table 2. [Perturbed FISTA] Values of (b, c) in order to reach the rate of convergence Rate, when tn = O(n2 ). The column MC reports the number of Monte Carlo samples in this strategy to reach a precision E[F (θn )] − F (θ? ) = O(δ). As a reference, the last row reports the rate when ηˇn = 0. no bias (C1 = 0) with bias (C1 > 0)

We conclude this section by a comparison of the approximation of F (θ? ) provided by the weighted approach and by the perturbed FISTA. As shown in Table 1 and Table 2, it is possible to choose a step-size sequence and a number of Monte Carlo samples mn for these stochastic algorithms to reach the same rate of convergence as the deterministic ones, that is n−1 for the first algorithm and n−2 for the second one. But, taking into account the number of simulations requires to obtain a given precision δ, the perturbed averaged Proximal Gradient has a faster rate of convergence than the perturbed FISTA.

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

19

5. Example: High-dimensional logistic regression with random effects We illustrate the results using Example 1. Note that the assumptions H 1 and H 3 hold but H2 is not in general satisfied. Nevertheless, the numerical study below shows that the conclusions reached in section 3 and section 4 provide useful information to tune the design parameters of the algorithms. The gradient of the negative log-likelihood for the logistic regression model with R random effect is −∇`(θ) = Hθ (u)πθ (u)du with θ = (β, σ), πθ is given by (5) and   N X x 0 0 Hθ (u) = − (Yi − F (xi β + σzi u)) 0 i . (43) zi u i=1

The distribution πθ is sampled using the MCMC sampler proposed in Polson et al. (2013) based R on data-augmentation - see also Choi and Hobert (2013). We qwrite −∇`(θ) = Rq ×RN Hθ (u)˜ πθ (u, w) dudw where π ˜θ (u, w) is defined for u ∈ R and N w = (w1 , · · · , wN ) ∈ R by ! N Y  0 0 π ˜θ (u, w) = π ¯PG wi ; xi β + σzi u πθ (u) ; i=1

in this expression, π ¯PG (·; c) is the density of the Polya-Gamma distribution on the positive real line with parameter c given by  π ¯PG (w; c) = cosh(c/2) exp −wc2 /2 ρ(w)1R+ (w) , P where ρ(w) ∝ k≥0 (−1)k (2k +1) exp(−(2k +1)2 /(8w))w−3/2 (see (Biane et al., 2001, Section 3.1)). Thus, we have π ˜θ (u, w) = Cθ φ(u)

N Y

 exp σ(Yi − 1/2)zi0 u − wi (x0i β + σzi0 u)2 /2 ρ(wi )1R+ (wi ) ,

i=1

P 0 where ln Cθ = −N ln 2 − `(θ) + N i=1 (Yi − 1/2)xi β. This target distribution can be sampled using a Gibbs algorithm: given the current value (ut , wt ) of the chain, the next point is obtained by sampling ut+1 under the conditional distribution of u given wt , and wt+1 under the conditional distribution of w given ut+1 . In the present case, these conditional distributions are given respectively by π ˜θ (u|w) ≡ Nq (µθ (w); Γθ (w))

π ˜θ (w|u) =

N Y

π ¯PG (wi ; |x0i β + σzi0 u|)

i=1

with Γθ (w) =

I +σ

2

N X i=1

!−1 wi zi zi0

,

µθ (w) = σΓθ (w)

N X

 (Yi − 1/2) − wi x0i β zi .

i=1

(44) Exact samples of these conditional distributions can be obtained (see (Polson et al., 2013, Algorithm 1) for sampling under a Polya-Gamma distribution).

20

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

We test the algorithms with N = 500, p = 1, 000 and q = 5. We generate the N × p covariates matrix X columnwise, by sampling a stationary RN -valued autoregressive p model with parameter ρ = 0.8 and Gaussian noise 1 − ρ2 NN (0, I). We generate the vector of regressors βtrue from the uniform distribution on [1, 5] and randomly set 98% of the coefficients to zero. The variance of the random effect is set to σ 2 = 0.1. We consider a repeated measurement setting so that zi = ediq/N e where {ej , j ≤ q} is the canonical basis of Rq and d·e denotes the upper integer part. With such a simple expression for the random effect, we will be able to approximate the value F (θ) in order to illustrate the theoretical results obtained in this paper. We use the Lasso penalty (α = 1 in (3)) with λ = 30. We first illustrate the ability of Monte Carlo Proximal Gradient algorithms to find a minimizer of F . We compare the following algorithms (i) the Monte Carlo proximal gradient algorithm with γn = γ = 0.005, mn =√200+n (Algo 1), γn √ = γ = 0.001, mn = 200 + n (Algo 2) and γn = 0.05/ n and mn = 270 + d ne (Algo 3). (ii) the Monte Carlo FISTA with γn = γ = 0.001, mn = 45 + dn3.1 /6000e (Algo F1); γn = 0.005 ∧ (0.1/n), mn = 155 + dn2.1 /100e (Algo F2). In both cases, tn is given by (30). Each algorithm is run for 150 iterations. The batch sizes {mn , n ≥ 0} are chosen so that after 150 iterations, each algorithm used approximately the same number of Monte Carlo samples. We denote by β∞ the value obtained at iteration 150. A path of the relative error kβn − β∞ k/kβ∞ k is displayed on Figure 1[right] for each algorithm; a path of the sensitivity SENn and of the precision PREn defined by P P i 1{|βn,i |>0} 1{|β∞,i |>0} i 1{|βn,i |>0} 1{|β∞,i |>0} P P , PREn = , SENn = i

1{|β∞,i |>0}

i

1{|βn,i |>0}

are displayed on Figure 2. All these sequences are plotted versus the total number of Monte Carlo samples up to iteration n. These plots show that the rate of convergence depends on the step-size sequence {γn , n ∈ N} and the batch-size sequence {mn , n ∈ N}. Fixed stepsize strategies are preferable to decreasing step-size; these findings are consistent with Table 1-biased case. On Figure 1[left], we report on the bottom row the indices j such that βtrue,j is non null and on the top row, the indices j such that β∞,j given by Algo 1 is non null. We now study the convergence of {F (θn ), n ∈ N} where θn is obtained by one of the algorithms described above. We repeat 50 independent runs for each algorithm and estimate E [F (θn )] by the empirical mean over these runs. On Figure 3[left], n 7→ F (θn ) is displayed for several runs of Algo 1 and Algo 3. The figure shows that all the paths have the same limiting value, which is approximately F? = 87; we observed the same behavior on the 50 runs of each algorithm. On Figure 3[right], we report the Monte Carlo estimation of E[F (θn )] versus the total number of Monte Carlo samples used up to iteration n. The best strategies are Algo 1 and Algo F1, with an advantage for the Monte Carlo FISTA. Finally, we observe the distribution of {F (θn ), n ∈ N} through the boxplot of the 50 independent runs. We also consider in this study averaging techniques:

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

21

2

10

Algo 1 Algo 2 Algo 3 Algo F1 Algo F2

1

10

beta estime

0

10

−1

10

−2

10

beta_true

−3

10

−4

10 0

100

200

300

400

500

600

700

800

900

0.5

1000

1

1.5

2

2.5

3

3.5

4 4

x 10

Figure 1. [left] The support of the sparse vector β∞ obtained by Algo 1 (resp. βtrue ) on the top (resp. on the bottom). [right] Relative error along one path of each algorithm as a function of the total number of Monte Carlo samples drawn from the initialization of the algorithm.

1 1 0.9 0.8

0.9

0.7 0.8

0.6 0.5

0.7 0.4 0.3

0.6 Alg 1 Alg 2 Alg 3 Alg 4 Alg 5

0.5 0.5

1

1.5

2

2.5

3

3.5

Alg 1 Alg 2 Alg 3 Alg F1 Alg F2

0.2 0.1

4

0.5 4

x 10

1

1.5

2

2.5

3

3.5

4 4

x 10

Figure 2. The sensitivity SENn [left] and the precision PREn [right] along a path, versus the total number of Monte Carlo samples up to time n

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

22

4

4

10

Algo 1 Algo 3

3

10

10

3

10

Algo 1 Algo 2 Algo 3 Algo F1 Algo F2

2

10

2

20

40

60

80

100

120

140

10

2

10

3

10

4

10

Figure 3. [left] n 7→ F (θn ) for several independent runs. [right] E [F (θn )] versus the total number of Monte Carlo samples up to iteration n (iii) (Algo W) θ¯n is the weighted average of the output of Algo 1 (see (28)) with three different strategies for the averaging sequence: decreasing weight an = 1/n0.1 , √ uniform weights an = 1 and increasing weights an = n. The averaging procedure can be implemented in parallel of Algo 1, and of course is of interest when computed from the outputs in the convergence phase of Algo 1. The averaging process is started at iteration n = 35, discarding the previous estimates. On Figure 4[left], we compare the three strategies for Algo W. For n = 50, 100, 150, the boxplots of F (θ¯n ) are displayed with - from left to right - the decreasing, the uniform and the increasing weight sequences {an , n ∈ N}. It appears that an increasing weight sequence is more efficient to reduce the fluctuations of F (θ¯n ) around its limiting value. On Figure 4[right], we compare Algo 1, Algo F1 and Algo W with increasing weight sequence after the same number of Monte Carlo samples namely 10500, 24200, 41300; this corresponds to iteration n = 50, 100, 150 for Algo 1 and Algo W and n = 98, 128, 150 for Algo F1. Algo F1 is the best one in terms of rate of convergence but seems to have a higher variability. Not surprisingly, there is a gain in averaging the output of Algo 1 in order to reduce the variability. 6. Proofs 6.1. Preliminary lemmas. Lemma 16. Assume that g is lower semi-continuous and convex. For θ, ϑ ∈ Θ and γ>0 γ {g (Proxγ (θ)) − g(ϑ)} ≤ − hProxγ (θ) − ϑ, Proxγ (θ) − θi . (45)

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

23

Figure 4. [left] Algo W: boxplot of F (θ¯n ) for n = 50, 100, 150 with - from left to right - the decreasing, the uniform and the increasing weight sequence. [right] Boxplot of F (θn ) and F (θ¯n ) with n chosen such that the total number of Monte Carlo samples up to time n is about 10 500, 24 200, 41 300. For any γ > 0, the operator θ 7→ Proxγ (θ) is firmly non-expansive, i.e. for any θ, ϑ ∈ Θ, hProxγ (θ) − Proxγ (ϑ), θ − ϑi ≥ k Proxγ (θ) − Proxγ (ϑ)k2 . (46) In particular, the maps θ 7→ Proxγ (θ) and θ 7→ θ − Proxγ (θ) are Lipschitz with Lipschitz constants that can be taken equal to 1. Proof. See (Bauschke and Combettes, 2011, Propositions 4.2., 12.26 and 12.27).



Lemma 16 has the following useful consequence. Lemma 17. Assume H 1 and let γ ∈ (0, 1/L]. Then for all θ, ϑ ∈ Θ, − 2γ (F (Proxγ (θ)) − F (ϑ)) ≥ k Proxγ (θ) − ϑk2 + 2 hProxγ (θ) − ϑ, ϑ − γ∇f (ϑ) − θi . (47) If in addition f is convex, then for all θ, ϑ, ξ ∈ Θ, − 2γ (F (Proxγ (θ)) − F (ϑ)) ≥ k Proxγ (θ) − ϑk2 + 2 hProxγ (θ) − ϑ, ξ − γ∇f (ξ) − θi − kϑ − ξk2 . (48) Proof. Since ∇f is Lipschitz, the descent lemma implies that for any γ −1 ≥ L f (p) − f (ϑ) ≤ h∇f (ϑ), p − ϑi +

1 kp − ϑk2 . 2γ

(49)

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

24

This inequality applied with p = Proxγ (θ) combined with (45) yields (47). When f is convex, f (ξ) + h∇f (ξ), ϑ − ξi − f (ϑ) ≤ 0 which, combined again with (45) and (49) applied with (p, ϑ) ← (Proxγ (θ), ξ) yields the result.  Lemma 18. Assume H 1. Then for any γ > 0, kθ − γ∇f (θ) − ϑ + γ∇f (ϑ)k ≤ (1 + γL)kθ − ϑk ,

(50)

kTγ (θ) − Tγ (ϑ)k ≤ (1 + γL)kθ − ϑk .

(51)

If in addition f is convex then for any γ ∈ (0, 2/L], kθ − γ∇f (θ) − ϑ + γ∇f (ϑ)k ≤ kθ − ϑk ,

(52)

kTγ (θ) − Tγ (ϑ)k ≤ kθ − ϑk .

(53)

Proof. (51) and (53) follows from (50) and (52) respectively by the Lipschitz property of the proximal map Proxγ (see Lemma 16). (50) follows directly from the Lipschitz property of f assumed in H1. It remains to prove (52). Since f is a convex function with Lipschitz-continuous gradients, (Nesterov, 2004, Theorem 2.1.5) shows that, for all θ, ϑ ∈ Θ, h∇f (θ) − ∇f (ϑ), θ − ϑi ≥ L1 k∇f (θ) − ∇f (ϑ)k2 . The result follows.  def

def

Lemma 19. Assume H 1. Set Sγ (θ) = Proxγ (θ − γH) and η = H − ∇f (θ). For any θ ∈ Θ and γ > 0, kTγ (θ) − Sγ (θ)k ≤ γkηk . (54) For any θ, H ∈ Θ and γ ∈ (0, 1/L], |F [Sγ (θ)] − F [Tγ (θ)]| ≤ (1 + c γL)kηk (γkηk + (1 + c γL)kθ − Tγ (θ)k) ,

(55)

where c = 0 if f is convex or c = 1 otherwise. Proof. We have |Tγ (θ) − Sγ (θ)k = k Proxγ (θ − γ∇f (θ)) − Proxγ (θ − γH)k and (54) follows since θ 7→ Proxγ (θ) is a Lipschitz contraction (see Lemma 16). Apply (47) with ϑ ← Tγ (θ) and θ ← θ − γH: F (Sγ (θ)) − F (Tγ (θ)) ≤ ≤

1 hTγ (θ) − Sγ (θ), Tγ (θ) − γ∇f (Tγ (θ)) − θ + γHi γ

1 kTγ (θ) − Sγ (θ)k (kTγ (θ) − γ∇f (Tγ (θ)) − θ + γ∇f (θ)k + γkηk) . γ

(56)

By plugging (54) and (50,52) in (56), we get F (Sγ (θ)) − F (Tγ (θ)) ≤ kηk (γkηk + (1 + c γL)kθ − Tγ (θ)k) . Similarly, we apply the same inequality with ϑ ← Sγ (θ), and θ ← θ − γ∇f (θ) to get 1 hTγ (θ) − Sγ (θ), Sγ (θ) − γ∇f (Sγ (θ)) − θ + γ∇f (θ)i γ ≥ −(1 + c γL)kηkkθ − Sγ (θ)k

F (Sγ (θ)) − F (Tγ (θ)) ≥

≥ −(1 + c γL)kηk (kθ − Tγ (θ)k + γkηk) , where we used again (54) and Lemma 18. The results follows.



ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

25

Lemma 20. Assume H 1. For any γ > 0 {θ : θ = Tγ (θ)} = {θ ∈ Dom(g) : 0 ∈ ∇f (θ) + ∂g(θ)} .

(57)

For any γ ∈ (0, 1/L] and θ ∈ Θ, F (Tγ (θ)) − F (θ) ≤ 0 and F ◦ Tγ (θ) = F (θ) iff θ = Tγ (θ). If in addition f is convex, then F (Tγ (θ)) − F (θ) ≤ −

1 kTγ (θ) − θk2 . 2γ

(58)

Proof. Let γ > 0. By (14), we have θ = Tγ (θ) ⇐⇒ θ = argminϑ Qγ (ϑ|θ) ⇐⇒ 0 ∈ ∂u Qγ (u|θ)|u=θ and |g(θ)| < ∞ , where ∂u denotes the sub-differential of the convex function u 7→ Qγ (u|θ). The proof of (57) is concluded upon noting that ∂u Qγ (u|θ) = ∇f (θ) + ∂g(θ). Let γ ∈ (0, 1/L]. We have from the MM approach (see section 2), for any θ, ϑ and γ ≤ 1/L: F (ϑ) ≤ Qγ (ϑ|θ), F (θ) = Qγ (θ|θ) and by definition of Tγ (θ), Qγ (Tγ (θ)|θ) ≤ Qγ (ϑ|θ). Therefore F (Tγ (θ)) ≤ Qγ (Tγ (θ)|θ) ≤ Qγ (θ|θ) = F (θ) . The above inequality also implies that if F ◦Tγ (θ) = F (θ) then Qγ (Tγ (θ)|θ) = Qγ (θ|θ) which implies that Tγ (θ) = θ. When f is convex, the inequality (58) is a consequence of (48) applied with ϑ ← θ, and θ ← θ − γ∇f (θ).  Lemma 21. n , n ∈ N} and {en , n ∈ N} be sequences satisfying P Let {un , n ∈ N}, {vP u2n ≤ vn + nk=0 uk ek and 2vn + nk=0 e2k ≥ 0. Then for any n ≥ 0, ! n n−1 1X 2 1X ek ek , |ek | sup uk − ≤ U vn + 2 2 2 0≤k≤n k=0

with the convention that

P−1

k=0

k=0

√ def = 0 and U(a, b) = b + a + b2 .

Proof. The proof is adapted from (Schmidt et al., 2011, Lemma 1). For any n ≥ 1, 

un −

n−1

n

n−1

k=0

k=0

k=0

X e n 2 1 1 X 2 X ek  ≤ vn + e2n + uk ek ≤ vn + ek + uk − ek . 2 4 2 2

Set n

n

ek def sn = sup uk − . 2 0≤k≤n k=0 k=0 √ Then s2n ≤ s2n−1 ∨ {An + sn−1 2Bn−1 }. By induction (note that s0 ≤ A0 and B−1 = 0), this yields for any n ≥ 0, 1/2 2 0 ≤ sn ≤ Bn−1 + Bn−1 + An . def

An = vn +

1X 2 ek 2

def

Bn =

1X |ek | 2



´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

26

6.2. Proof of Theorem 4. Since the sequence {θn , n ∈ N} is in the compact set K, it possesses at least one accumulation point in K. For the proof of the other statements, we apply (Meyer, 1976, Theorem 3.1): let us check its assumptions and show that Tγ is a point-to-set mapping, uniformly compact, strictly monotonic and upper semi-continuous (see the definitions (Meyer, 1976, p.109). Under H 1, Tγ is a point-to-point mapping. Since K is compact, there exist θ0 and r > 0 such that for any θ ∈ K, kθ − θ0 k ≤ r. By Lemma 18, kTγ (θ) − Tγ (θ0 )k ≤ r thus showing that Tγ (K) is included in a compact set. Hence, we proved that Tγ is uniformly compact. Let θ ∈ K and {θn , n ∈ N} be a K-valued sequence such that limn θn = θ. Assume that the sequence {Tγ (θn ), n ∈ N} converges to θ0 . Let us prove that θ0 = Tγ (θ). By Lemma 18, kTγ (θ) − θ0 k ≤ kTγ (θ) − Tγ (θn )k + kTγ (θn ) − θ0 k ≤ (1 + γL) kθn − θk + kTγ (θn ) − θ0 k . The RHS tends to zero as n → ∞ thus proving that θ0 = Tγ (θ). Hence the mapping Tγ is upper-semicontinuous. Finally, Lemma 20 shows that Tγ is strictly monotonic w.r.t. the function F . 6.3. Proof of Corollary 5. Let us prove that L is not empty iff the sequence {θn , n ∈ N} is compact. If the sequence is compact, it possesses an accumulation point which is in the set L by Theorem 4; hence, L is not empty. Conversely, if L is not empty, then there exists θ? such that θ? = Tγ (θ? ) which implies by Lemma 18 and a trivial induction kθn+1 − θ? k = kTγ (θn ) − Tγ (θ? )k ≤ kθn − θ? k ≤ · · · ≤ kθ0 − θ? k . Hence θn+1 is in the closed ball centered at θ? with radius kθ0 − θ? k. 6.4. Proof of Theorem 6. The claims are a consequence of the deterministic result for random iterative maps developed in (Fort and Moulines, 2003, Proposition 9): let us check the assumptions of this Proposition. Since lim supn kθn k < ∞, there exists a compact set K such that θn ∈ K for any n. By Lemma 18, θ 7→ Tγ (θ) is continuous which implies that L is closed; hence, L ∩ K is compact. By Lemma 20, for any θ ∈ Θ, F (Tγ (θ)) − F (θ) ≤ 0 and for any compact K ⊆ Θ \ L, inf K F ◦ Tγ − F < 0 (note that since F ◦ Tγ − F is lower semi-continuous, it achieves its infimum on any compact set (see e.g. (Bauschke and Combettes, 2011, Theorem 1.28))). Note also that we can assume that F is positive by replacing F with exp(F ). Hence, F is a Lyapunov function relative to (Tγ , L) (see (Fort and Moulines, 2003, p. 1243) for the definitions). We now prove that for θ? ∈ L (which exists since L is not empty) and for any B > 0, lim |F (θn+1 ) − F (Tγ (θn ))| 1kθn −θ? k≤B = 0 . (59) n→∞

By Lemma 19, |F (θn+1 ) − F (Tγ (θn ))| ≤ (1 + cγL)kηn+1 k (γkηn+1 k + (1 + cγL)kθn − Tγ (θn )k) .

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

27

Since θ? = Tγ (θ? ) and since Tγ is a Lipschitz function (see Lemma 18), we conclude that there exists C > 0 such that for any n ≥ 1, |F (θn+1 ) − F (Tγ (θn ))| 1kθn −θ? k≤M ≤ C kηn+1 k . Since limn ηn = 0, the proof of (59) is concluded. Finally, (Fort and Moulines, 2003, Proposition 9) requires F to be continuous; a careful reading of their proof shows that this property is used to prove that (i) F (L ∩ K) is compact and (ii) for any α small enough, F −1 (Oα ) is an open set of Θ where Oα is the α-neighborhood of F (L ∩ K). Under our assumptions, F is lower semi-continuous and is continuous on the domain of g. By (57), L is in the domain of g, so F (L ∩ K) is compact. In addition, F (L ∩ K) ⊂ R, so Oα ⊂ R and this implies that F −1 (Oα ) is in the domain of g. 6.5. Proof of Theorem 7. We preface the proof with a preliminary lemma, which might be seen as a deterministic version of the Robbins-Siegmund Lemma Lemma 22. Let {vn , n P ∈ N} and {χn , n ∈ N} be non-negative sequences and {η , n ∈ N} be such that n ηn exists. If for any n ≥ 0, vn+1 ≤ vn − χn + ηn then Pn n χn < ∞ and limn vn exists. P P def Proof. Set wn = vn + k≥n ηk + M with M = − inf n k≥n ηk so that inf n wn ≥ 0. Then X 0 ≤ wn+1 ≤ vn − χn + ηn + η k + M ≤ wn − χn . k≥n+1

{wn ,Pn ∈ N} is non-negative Pand non increasing; therefore it converges. Furthermore, n 0 ≤ k=0 χk ≤ w0 so that n χn < ∞. The convergence of {wn , n ∈ N} also implies the convergence of {vn , n ∈ N}. This concludes the proof.  def

Let θ? be a minimizer of F and set F? = F (θ? ). Since F is convex, we have by (48) applied with θ ← θn − γn+1 Hn+1 , ξ ← θn , ϑ ← θ? , γ ← γn+1 kθn+1 − θ? k2 ≤ kθn − θ? k2 − 2γn+1 (F (θn ) − F? ) − 2γn+1 hθn+1 − θ? , ηn+1 i . We write θn+1 − θ? = θn+1 − Tγn+1 (θn ) + Tγn+1 (θn ) − θ? . By Lemma 19, kθn+1 − Tγn+1 (θn )k ≤ γn+1 kηn+1 k so that, kθn+1 − θ? k2 ≤ kθn − θ? k2 − 2γn+1 (F (θn ) − F? )

2 + 2γn+1 kηn+1 k2 − 2γn+1 Tγn+1 (θn ) − θ? , ηn+1 . (60) Proof of (i) Since F (θ) − F? ≥ 0, we have by iterating (60) kθn+1 − θ? k2 ≤ kθm − θ? k2 + 2

n X k=m

2 γk+1 kηk+1 k2 − 2

n X

γk+1 Tγk+1 (θk ) − θ? , ηk+1 .

k=m

Proof of (ii) Under (23), (60) and Lemma 22 show that: limn kθn − θ? k exists and P γ {F (θn ) − F? } < ∞. n n+1P Since n γn = +∞, there exists a subsequence {θφn , n ∈ N} such that limn F (θφn ) = F? . Since {θn , n ∈ N} is bounded, we can assume without loss of generality that

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

28

{θφn , n ∈ N} converges; let θ∞ be the limiting value. Since F is convex, it is continuous on its domain and we have F (θ∞ ) = F? . Hence, θ∞ ∈ L. By (22), for any m and n ≥ φm kθn+1 − θ∞ k2 ≤ kθφm − θ∞ k2 − 2

n X

γk+1 { Tγk+1 (θk ) − θ∞ , ηk+1 + γk+1 kηk+1 k2 } .

k=φm

For any  > 0, there exists m such that the RHS is upper bounded by . Hence, for any n ≥ φm , kθn+1 − θ∞ k2 ≤ , which proves the convergence of {θn , n ∈ N} to θ∞ . 6.6. Proof of Corollary 8. Define ( )   [ X \ def (1) (2) Ω? = γn+1 {n + γn+1 n }1kθn k≤B < ∞ lim sup kθn k < ∞ . B∈N

n

n

Note that P(Ω? ) = 1. Using the conditional Borel-Cantelli lemma (see (Chen, P 2 2 kη 1978, Theorem 1)), for any B > 0, γ n n+1 n+1 k 1kθn k≤B < ∞ on Ω? since P 2 (2) n γn+1 n 1kθn k≤B < ∞ on Ω? . def

Consider now the first termPin (23). Set ξk = γk hTγk (θk−1 ) − θ? , ηk i. Here again, we prove that for any B > 0, k ξk 1kθk−1 k≤B converges a.s. Fix B > 0: X

|E [ξk+1 |Fk ]| 1kθk k≤B ≤ (B + kθ? k)

k≥0

X

(1)

γk+1 k

1kθk k≤B ,

k≥0

where we used that θ? = Tγ (θ? ) by definition of L, and kTγ (θ)P − Tγ (θ? )k ≤ kθ − θ? k ≤ kθk+kθ? k (see Lemma 18). Since the RHS is finite on Ω? , then k |E [ξk+1 |Fk ] |1kθk k≤B < def Pn ∞ on Ω? . Define Mn = k=1 (ξk − E [ξk |Fk−1 ]) 1kθk−1 k≤B ; it is a martingale w.r.t. the filtration {Fn , n ∈ N}. We have X  X    E |ξk − E [ξk |Fk−1 ] |2 |Fk−1 1kθk−1 k≤B ≤ E |ξk |2 |Fk−1 1kθk−1 k≤B k≥1

k≥0

≤ (B + kθ? k)2

X

(2)

γk2 k

1kθk−1 k≤B

k≥0

and the RHS is finite on Ω? under (25). By (Hall and Heyde, 1980, Theorem 2.17), lim P n Mn exists on Ω? . This concludes the proof of the almost-sure convergence of k ξk 1kθk−1 k≤B . 6.7. Proof of Theorem 10. We first apply (48) with θ ← θj − γj+1 Hj+1 , ξ ← θj , ϑ ← θ? , γ ← γj+1 : V (θj+1 ) ≤

 1 kθj − θ? k2 − kθj+1 − θ? k2 − hθj+1 − θ? , ηj+1 i , 2γj+1

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

29

since V (θ? ) = 0. Multiplying both side by aj+1 gives:   aj 1 aj+1 aj aj+1 V (θj+1 ) ≤ − kθj − θ? k2 + kθj − θ? k2 2 γj+1 γj 2γj aj+1 − kθj+1 − θ? k2 − aj+1 hθj+1 − θ? , ηj+1 i . 2γj+1 Summing from j = 0 to n − 1 gives n

n

X an 1X kθn − θ? k2 + aj V (θj ) ≤ 2γn 2 j=1

j=1



aj−1 aj − γj γj−1



n X



kθj−1 − θ? k2

aj hθj − θ? , ηj i +

j=1

a0 kθ0 − θ? k2 . (61) 2γ0

We decompose hθj − θ? , ηj i as follows:



hθj − θ? , ηj i = θj − Tγj (θj−1 ), ηj + Tγj (θj−1 ) − θ? , ηj .

By Lemma 19, we get θj − Tγj (θj−1 ), ηj ≤ γj kηj k2 which concludes the proof. 6.8. Proof of Corollary 11. By conditioning on Fj−1 and using that Tγj (θ? ) = θ? and Tγj is a 1-Lipschitz operator (see Lemma 18), we have 

 E Tγ (θj−1 ) − θ? , ηj Fj−1 ≤ kθj−1 − θ? k kE [ ηj | Fj−1 ] k = kθj−1 − θ? k (1) . j j−1 The corollary now follows from Theorem 10. 6.9. Proof of Theorem 13. Set for any j ≥ 0, def

∆j+1 = tj θj+1 − (tj − 1)θj = tj (θj+1 − θj ) + θj . ¯ j = tj (θj+1 − Tγ (ϑj )) and by Lemma 19, Note that ∆j+1 − ∆ j+1 ¯ j − θ? k ≤ γj+1 tj kˇ k∆j+1 − θ? k − k∆ ηj+1 k .

(62)

(63)

Lemma 23. Assume H 1 and H 2. Let {θn , n ∈ N} be given by Algorithm 3, with {tn , n ∈ N} and {γn , n ∈ N} be positive sequences such that γn ∈ (0, 1/L] for any n ≥ 1. For any minimizer θ? ∈ L, any j ≥ 1,  1 γj t2j−1 − γj+1 tj (tj − 1) V (θj ) + γj+1 t2j V (θj+1 ) + k∆j+1 − θ? k2 2 1 2 2 ≤ γj tj−1 V (θj ) + k∆j − θ? k − γj+1 tj h∆j+1 − θ? , ηˇj+1 i (64) 2

1 2 ¯ j − θ? , ηˇj+1 ≤ γj t2j−1 V (θj ) + k∆j − θ? k2 + γj+1 t2j kˇ ηj+1 k2 − γj+1 tj ∆ (65) 2 def

where V (θ) = F (θ) − F (θ? ).

´ GERSENDE FORT, AND ERIC MOULINES YVES F. ATCHADE,

30

Proof. Let j ≥ 1. We first apply (48) with ϑ ← θj , ξ ← ϑj , θ ← ϑj − γj+1 Hj+1 and γ ← γj+1 to get 2γj+1 V (θj+1 ) ≤ 2γj+1 V (θj ) + kθj − ϑj k2 − kθj+1 − θj k2 − 2γj+1 hθj+1 − θj , ηˇj+1 i . We apply again (48) with ϑ ← θ? to get 2γj+1 V (θj+1 ) ≤ kθ? − ϑj k2 − kθj+1 − θ? k2 − 2γj+1 hθj+1 − θ? , ηˇj+1 i . We now compute a combination of these two inequalities with coefficients tj (tj − 1) and tj . This yields 2γj+1 t2j V (θj+1 ) + tj (tj − 1)kθj+1 − θj k2 + tj kθj+1 − θ? k2 ≤ 2tj (tj − 1)γj+1 V (θj ) + tj (tj − 1)kθj − ϑj k2 + tj kϑj − θ? k2 − 2γj+1 tj h∆j+1 − θ? , ηˇj+1 i . Then, by using the definition of ϑj , we obtain 2γj+1 t2j V (θj+1 ) + k∆j+1 − θ? k2 ≤ 2γj t2j−1 V (θj ) + k∆j − θ? k2 − 2γj+1 tj h∆j+1 − θ? , ηˇj+1 i  − 2 γj t2j−1 − γj+1 tj (tj − 1) V (θj ) . This concludes the proof.



Proof of Theorem 13. First note that under (29), γj t2j−1 − γj+1 tj (tj − 1) ≥ 0. (i) Lemma 22 and (65) show that supn γn+1 t2n V (θn+1 ) < ∞. Since limn γn+1 t2n = +∞, this implies that limn V (θn ) = 0 from which we deduce that the limit points of {θn , n ∈ N} are in L. (ii) is a trivial consequence of Lemma 23. (iii) By iterating (64), we have for any n ≥ 1 n

X 1 1 k∆n+1 − θ? k2 ≤ γ1 V (θ1 ) + k∆1 − θ? k2 + γk+1 tk k∆k+1 − θ? kkˇ ηk+1 k . 2 2 k=1

This inequality that supn k∆n − θ? k < ∞ (see Lemma 21). By (63) and the P implies 2 t2 kˇ 2 ¯ assumption n γn+1 n ηn+1 k < ∞, this yields supn k∆n k < ∞. 6.10. Proof of Corollary 14 and Corollary 15. Corollary 14 and Corollary 15 are a consequence of Theorem 13 and the following lemma. For B > 0, define the stopping-time def

τB = inf{n ≥ 1, kϑn k > B} . Lemma 24. Assume H 1 and H 2. Let {θn , n ∈ N} be given by Algorithm 3, with {tn , n ∈ N} and {γn , n ∈ N} be positive sequences satisfying (29) and γn ∈ (0, 1/L] for any n ≥ 1. h i P (1) (2) (i) If n γn+1 tn {kˇ n 1n