Stochastic Compositional Gradient Descent - Optimization Online

Report 6 Downloads 378 Views
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang∗ Ethan X. Fang∗ Han Liu∗

Abstract Classical stochastic gradient methods are well suited for minimizing expected-valued objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values, i.e., problems of the form minx f Ew [gw (x)] . In this paper, we propose a class of stochastic compositional gradient descent (SCGD) algorithms that can be viewed as stochastic versions of quasi-gradient method. SCGD update the solutions based on random sample gradients/subgradients of f, gw and use an auxiliary variable to track the unknown quantity E [gw (x)]. We prove that the SCGD converge almost surely to an optimal solution for convex optimization problems, as long as such a solution exists. The convergence involves the interplay of two iterations with different time scales. For nonsmooth convex problems, the average iterates of SCGD achieve a convergence rate of O(k −1/4 ) in the general case and O(k −2/3 ) in the strongly convex case. For smooth convex problems, the SCGD can be accelerated to converge at a rate of O(k −2/7 ) in the general case and O(k −4/5 ) in the strongly convex case. For nonconvex optimization problems, we prove that the SCGD converges to a stationary point and provide the convergence rate analysis. Indeed, the stochastic setting where one wants to optimize nonlinear functions of expected values using noisy samples is very common in practice. The proposed SCGD methods may find wide application in learning, estimation, dynamic programming, etc.

1

Introduction

Stochastic gradient descent (SGD) methods have been prominent in minimizing convex functions by using noisy gradients. They find wide applications in simulation, distributed optimization, data-based optimization, statistical estimation, online learning, etc. SGD (also known as stochastic approximation, incremental gradient) methods have been extensively studied and well recognized as fast first-order methods which can be adapted to deal with problems involving large-scale or streaming data. For certain special cases, it has been shown that SGD exhibit the optimal sample error complexity in the statistical sense. Classical SGD methods update iteratively by using “unbiased” samples of the iterates’ gradients. In other words, the objective function is required to be linear in the sampling probabilities. Indeed, this linearity is the key to analyze SGD and leads to many nice properties of SGD. However, there has been little study on how to use SGD and how it performs without linearity in the sampling probabilities. In this paper, we aim to explore the regime where the linearity in sampling probabilities is lost. We will develop a class of methods that we refer to as stochastic compositional gradient methods (SCGD), analyze their convergence properties, and demonstrate their potential applications to a broader range of stochastic problems. Consider the optimization problem n o min F (x) = (f ◦ g)(x) , (1) x∈X



Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; e-mail: {mengdiw,xingyuan,[email protected]}. Revised in August 2015.

1

where f : <m 7→ < is a continuous function, g:

∞ X

αk k∇F (xk )k2 ≥

k=0

X

αk k∇F (xk )k2 > 2

k∈K

X

αk .

k∈K

By using the uniform boundedness of random variables generated from the SO, we have for some e w (xk )∇fv (yk+1 )k ≤ M αk for all k. It follows that M > 0 that kxk+1 − xk k ≤ αk k∇g k k X X kxk+1 − xk k ≤ M αk < ∞. k∈K

k∈K

P

This yields a contradiction with the result k∈K kxk+1 − xk k = ∞ we just obtained. It follows that there does not exist a limit point x e such that k∇F (e x)k > 2. Since  can be made arbitrarily small, there does not exist any limit point that is nonstationary. Finally, we note that the set of such sample paths (to which the preceding analysis applies) has a probability measure 1. In other words, any limit point of xk is a stationary point of F (x) with probability 1.  Theorem 1(a) establishes that “for convex optimization, xk converges almost surely to an optimal solution.” We remark that the limit optimal solution is a random variable that depends on the realization of specific sample paths. For each realization of the sample path {xk (ω)}, the iterate is a convergent sequence and the corresponding limit is an optimal solution. Theorem 1(b) establishes that “for nonconvex optimization, any limit point generated by Algorithm 1 is a stationary point with probability 1.” We remark that this does not exclude the possibility that {xk } does not have any limit point. If F (x) is a “bad” function, e.g., F (x) = 1/|x|, it is possible for xk to diverge without any limit point. This is a general limitation of the class of gradient descent methods for nonconvex optimization. To guarantee the existence of a limit point, we need some additional assumptions or mechanisms to ensure the boundedness of iterates.

2.3

Rate of Convergence

Now we establish the convergence rate and sample complexity of the basic SCGD algorithm. To do this, we consider the averaged iterates of Algorithm 1, given by x bk =

1 Nk

k X

xt ,

t=k−Nk

where we take Nk = dk/2e for simplicity. Note that the convergence rate is related to the stepsizes of the algorithm. By convention, we choose stepsizes {αk } and {βk } as powers of k, i.e., αk = k −a ,

βk = k −b ,

and we aim to minimize the error bound over the parameters a, b. In what follows, we analyze three cases separately: (i) the case where F = f ◦ g is a convex function; (ii) the case where F = f ◦ g is a strongly convex function; (iii) the case where F = f ◦ g is not necessarily convex. For the convex and strongly convex cases (i) and (ii), we consider the rate of convergence of Algorithm 1 in terms of the optimality error F (xk ) − F ∗ and the distance to optimal solution kxk − x∗ k, respectively. For the nonconvex case (iii), we consider the convergence rate in terms of a metric of nonstationarity. Theorem 6 (Convergence Rate of basic SCGD for Convex Problems) Suppose that Assumption 1 holds and F is convex. Let Dx > 0 be such that supx∈X kx−x∗ k2 ≤ Dx , and let Dy > 0 be the scalar defined in Lemma 2. Let the stepsizes be αk = k −a ,

βk = k −b ,

where a, b are scalars in (0, 1). Then: 16

(a) The averaged iterates generated by Algorithm 1 is such that   E [F (b xk ) − F ∗ ] = O (1 + L2f Cg2 )(Dx + Dy ) k a−1 + k b−a (log k)1a=b+1  + Cf Cg k −a (log k)1a=1 + Vg k a−2b (log k)1a=2b+1 + Cg2 Cf k b−a (log k)1a=b+1 . (b) For k sufficiently large, the preceding bound is minimized at a = 3/4, b = 1/2, yielding ∗

E [F (b xk ) − F ] = O

 (1 + L2 C 2 )(Dx + Dy ) + Vg + C 2 Cf f

g

g

k 1/4

Cf Cg  + 3/4 k

 1  = O 1/4 . k Proof. Define the random variable Jk = kxk − x∗ k2 + kyk − g(xk−1 )k2 , so we have E [Jk ] ≤ Dx + Dy ≡ D for all k. We multiply Eq. (5) by (1 + βk ) and take its sum with Eq. (9), and we obtain  α2  E [Jk+1 | Fk ] ≤ 1 + L2f Cg2 k Jk − 2αk (F (xk ) − F ∗ ) βk  Cg (1 + βk ) + Cf Cg αk2 + 2Vg βk2 (1 + βk ) + kxk − xk−1 k2 . βk Taking expectation of both sides and using the fact 1 + βk ≤ 2, we obtain E [Jk+1 ]  2 α2 2 2 αk ≤ 1 + Lf Cg E [Jk ] − 2αk E [F (xk ) − F ∗ ] + Cf Cg αk2 + 4Vg βk2 + 2Cg2 Cf k . βk βk Let N > 0. By reordering the terms in the preceding relation and taking its sum over k −N, . . . , k,

17

we have k X

2

E [F (xt ) − F ∗ ]

t=k−N



k X t=k−N

+

1 αt

 1+

 k X

α2 L2f Cg2 t βt

Cf Cg αt + 4Vg

t=k−N

=



 E [Jt ] − E [Jt+1 ]

βt2 αt + 2Cg2 Cf αt βt



  k X 1 1 1 1 − E [Jt ] − E [Jk+1 ] + E [Jk−N ] αt αt−1 αk αk−N −1

t=k−N

k k k k X X X X βt2 αt αt E [Jt ] + Cf Cg αt + 4Vg + 2Cg2 Cf βt αt βt

+ L2f Cg2

t=k−N

t=k−N



k X



t=k−N

+ Cf Cg

1 1 − αt αt−1 k X

1 D + L2f Cg2 αk + 4Vg

k X t=k−N

D+

1 αk−N −1

t=k−N

D + L2f Cg2

k X t=k−N

t=k−N

αt D βt

k k X X αt βt2 + 2Cg2 Cf αt βt t=k−N t=k−N ! k k X X αt D + Cf Cg αt βt

αt + 4Vg

t=k−N





t=k−N

βt2 + 2Cg2 Cf αt

t=k−N

k X t=k−N

αt . βt

Let αk = k −a , and βk = k −b , where a, b are scalars in (0, 1). We have k X

2

E [F (xt ) − F ∗ ]

t=k−N

  ≤ O k a D + L2f Cg2 k 1+b−a − (k − N )1+b−a D(log k)1a=b+1 + Cf Cg (k 1−a − (k − N )1−a )(log k)1a=1 + Vg (k 1+a−2b − (k − N )1+a−2b )(log k)11+a=2b ! +

Cg2 Cf (k 1+b−a

− (k − N )

1+b−a

)(log k)

1a=b+1

,

where 1A = 1 if A is true and 1AP = 0 if A is false. Note that the log k term only occurs in rare situations when we take the sum kt=1 1t , which is not of substantial importance to our analysis.

18

Using the convexity of F and taking N = Nk = k/2, we obtain E [F (b xk ) − F ∗ ] 1 ≤ Nk

k X

E [F (xt ) − F ∗ ]

t=k−Nk

≤O (k a−1 + L2f Cg2 k b−a (log k)1a=b+1 )D + Cf Cg k −a (log k)1a=1  + Vg k a−2b (log k)11+a=2b + Cg2 Cf k b−a (log k)1a=b+1 . The order of the bound is minimized when a = 3/4 and b = 1/2, which completes the proof.



Next we consider a special case where F = f ◦ g is strongly convex in the following sense: there exists a scalar σ > 0 such that F (x) − F ∗ ≥ σkx − x∗ k2 ,

∀ x ∈ X.

(14)

In the next theorem we show that a faster convergence rate can be obtained assuming strong convexity. This is consistent with the well known complexity results for convex optimization. Theorem 7 (Convergence Rate of basic SCGD for Strongly Convex Problems) Suppose that Assumption 1 holds and F is strongly convex satisfying (14). Let the stepsizes be αk = Then

and

1 , kσ

βk =

1 k 2/3

.

 1   C C log k L2 Cg (Cf C 2 /σ 2 + Vg ) 1    g f f g ∗ 2 + = O , E kxk − x k = O σ2 k σ2 k 2/3 k 2/3  1    E kb xk − x∗ k2 = O 2/3 . k

Proof. We follow the line of analysis as in Lemma 3. We first derive a different bound for the uk term given by Eq. (10). This time we have αk L2f Cg   E [uk |Fk ] ≤ αk σkxk − x k + E kg(xk ) − yk+1 k2 | Fk . σ ∗ 2

(15)

Plugging Eqs. (14) and (15) into Eq. (11), and taking expectation of both sides, we obtain   E kxk+1 − x∗ k2 Fk αk L2f Cg     ≤ (1 − σαk ) E kxk − x∗ k2 + Cf Cg αk2 + E kg(xk ) − yk+1 k2 . σ

(16)

    αk L2 Cg We denote ak = E kxk −x∗ k2 and bk = E kyk −g(xk−1 )k2 . We multiply Eq. (5) by Λk+1 + σf  L2 Cg αk /σ αk L2f Cg L2 Cg αk  and take its sum with Eq. (16), where Λk+1 = max βfk −σαk − , 0 = Θ fβk σ . We σ obtain   αk L2f Cg  αk L2f Cg  ak+1 + Λk+1 + bk+1 ≤(1 − σαk )ak + (1 − βk ) Λk+1 + bk σ σ αk L2f Cg + ξk + bk+1 , σ 19

where ξk = Cf Cg αk2 +

 L2 C    3 f g 2 αk + V α β , implying that O C C g k k f g β2 σ k

 ak+1 + Λk+1 bk+1 ≤ (1 − σαk ) ak + Λk+1 bk + ξk . Note that 0 < Λk+1 ≤ Λk for sufficiently large k 1 . Thus, if k is large enough, we have  ak+1 + Λk+1 bk+1 ≤ (1 − σαk ) ak + Λk bk + ξk . Letting Jk = kxk − x∗ k2 + Λk kg(xk−1 ) − yk k2 , we have for k sufficiently large that E [Jk+1 ] ≤ (1 − σαk ) E [Jk ] + Cf Cg αk2 +

 L2 Cg    α3 f O Cf Cg2 k2 + Vg αk βk . σ βk

Taking αk = 1/(σk), βk = k −2/3 and multiplying the preceding inequality with k on both sides, we have for k sufficiently large that  L2 Cg (Cf C 2 /σ 2 + Vg )  Cf Cg g f +O . kE [Jk+1 ] ≤ (k − 1)E [Jk ] + 2/3 kσ 2 k σ2 Applying the preceding inequality inductively, we have kE [Jk+1 ] ≤O

C C L2f Cg (Cf Cg2 /σ 2 + Vg ) 1/3  f g . log k + k σ2 σ2

Finally, we have  C C log k L2 Cg (Cf C 2 /σ 2 + Vg )    g f f g E kxk+1 − x∗ k2 ≤ E [Jk+1 ] ≤ O + σ2k σ 2 k 2/3 = O(k −2/3 ). By the convexity of k · k2 , the averaged iterates x bk satisfy the same bound.  Let us compare the rates of convergence for convex and strongly convex problems after k queries to the SO. For convex problems, the error is of the order of k −1/4 ; while for strongly convex problems, the error is of the order of k −2/3 . It is as expected that strongly convex problems are “easier” to solve than those problems lacking strong convexity. Finally we analyze the behavior of the basic SCGD for nonconvex optimization problems. Without convexity, the algorithm is no longer guaranteed to find a global optimum. However, we have shown that any limit point of iterates produced by the algorithm must be a stationary point for the nonconvex problem. In the next theorem, we provide an estimate that quantifies how fast the non-stationary metric k∇F (xk )k decrease to zero. Theorem 8 (Convergence Rate of basic SCGD for Nonconvex Problems) Suppose Assumption 1 holds, F is Lipschitz differentiable with parameter LF , and X = 0, we have, the derivative of h satisfies − 1 k−2/3 L2f Cg L2f Cg L2f Cg  1 1  h(k)0 = 23 1/3 + 2 2 ≤ − 4/3 < 0, if k > 33/2 . 2 2 2 σ k σ k σ (k − 1) 3k 1

20

where a, b are scalars in (0, 1). Let 



T = min k : inf E k∇F (xt )k 0≤t≤k

2



 ≤

then T ≤ O(−1/p ), where p = min{1 − a, a − b, 2b − a, a}. By minimizing the complexity bound over a, b, we obtain T ≤ O(−4 ) with a = 3/4, b = 1/2. Proof. Let us define the random variable Jk = F (xk ) + kyk − g(xk−1 )k2 . Multiplying Eq. (5) with (1 + βk ) and take its sum with Eq. (12), we have for k sufficiently large that E[Jk+1 |Fk ] ≤Jk − (αk /2)k∇F (xk )k2 + 2βk−1 Cg kxk − xk+1 k2 + 4Vg βk2 + αk2 LF Cf Cg . Note that {E [F (xk )]}, {E [Jk ]} are bounded from below. We follow the same line of analysis as given by Theorem 6. By using a similar argument, we can show that k  1X E[k∇F (xt )k2 ] = O k a−1 J0 + k b−a Cf2 Cg + 4Vg k a−2b + k −a LF Cf Cg k t=1

= O(k −p ), where p(a, b) = min{1 − a, a − b, 2b − a, a}. By the definition of T , we have E[k∇F (xk )k2 ] > ,

if k < T .

Combining the preceding two relations, we obtain T  1 X ≤ E[k∇F (xk )k2 ] = O T−p , T k=1

implying that T ≤ O(−1/p ). By minimizing the complexity bound O(−1/p ) over a, b, we obtain T ≤ O(−4 ) with a = 3/4, b = 1/2. 

3

Acceleration for Smooth Convex Optimization

In this section, we propose an accelerated version of SCGD that achieves faster rate of convergence when the objective function is differentiable, which we refer to as accelerated SCGD. Recall our optimization problem min (f ◦ g)(x), x∈X

where g(x) = E [gw (x)] ,

f (y) = E [fv (y)] ,

∀ x ∈ 0 be a constant such that supx∈X kx − x∗ k2 ≤ D, and let the stepsizes be αk = k −a ,

βk = k −b ,

where a, b are scalars in (0, 1). Then: (a) The averaged iterates generated by the accelerated SCGD Algorithm 2 is such that h i  E F x bk − F ∗   = O Dk a−1 + Cf Cg k −a + C1 k −b/2 (log k)1b=1/2 + C2 k −2a+2b (log k)14a=3b+1 , where C1 =

p p DCg Vg Lf , and C2 = DCg Lf Lg Cg Cf . 28

(b) For k sufficiently large, the preceding error bound is minimized at a = 57 , = 74 , yielding h   i  E F x bk − F ∗ = O k −2/7 . Proof. By the proof of Theorem 9, we have   E kxk+1 − x∗ k2 | Fk ≤ kxk − x∗ k2 + αk2 Cf Cg − 2αk (F (xk ) − F ∗ ) + E [uk | Fk ] , where uk = 2αk (xk − x∗ )0 ∇gwk (xk )(∇fvk (g(xk )) − ∇fvk (yk )). Taking expectation of both sides and reordering the terms, we obtain 2E [F (xk ) − F ∗ ] ≤

 1  1 E kxk − x∗ k2 − kxk+1 − x∗ k2 + αk Cf Cg + E [uk ] . αk αk

Taking the sum of the preceding inequalities over t = k − N, . . . , k, we have k X

2

E [F (xt ) − F ∗ ]

t=k−N



k X t=k−N

+

    1 E kxt − x∗ k2 − E kxt+1 − x∗ k2 αt

k X

Cf Cg α t +

t=k−N

=

k X

  1 1   1  E kxk − x∗ k2 − E kxt − x∗ k2 − αt αt−1 αk

t=k−N

+

 1 E [ut ] αt

1 αk−N −1

k k X X   αt + E kxk−N − x∗ k2 + Cf Cg t=k−N

t=k−N

1 E [ut ] . αt

Using the fact kxk −x∗ k2 ≤ D, the Lipschitz continuity of ∇f , and the Cauchy-Schwartz inequality, we obtain 1 E [uk ] ≤ 2kxk − x∗ kE [k∇gwk (xk )kk∇fvk (yk ) − ∇fvk (g(xk ))k] αk √  1/2  1/2 ≤ 2 DLf E k∇gwk (xk )k2 E kyk − g(xk )k2 p  1/2 . ≤ 2 DCg Lf E e2k Using kxk − x∗ k2 ≤ D again, we further obtain 2

k X

E [F (xt ) − F ∗ ]

t=k−N

  k X 1 1 1 ≤ − D+ D αt αt−1 αk−N −1 t=k−N

+ Cf Cg

k X

k X p  1/2 αt + 2 DCg Lf E e2t

t=k−N

D = + Cf Cg αk

k X

t=k−N k X p  1/2 . αt + 2 DCg Lf E e2t

t=k−N

t=k−N

29

(22)

 1/2 P By taking αk = k −a , βk = k −b , we find an upper bound for the quantity kt=k−N E e2t . Using Lemma 12, we have       (23) E e2k+1 ≤ (1 − k −b /2)E e2k + O Vg k −2b + L2g Cg2 Cf2 k −4a+3b , and we can easily prove by induction that for all k,     E e2k+1 ≤ O Vg k −2b+1 (log k)1b=1/2 + L2g Cg2 Cf2 k −4a+3b+1 (log k)14a=3b+1 , where 1a is the indicator function if a occurs, and the log k term only occurs if b = 1/2 or 4a = 3b + 1. Letting N = k/2, we reorder Eq. (23) and take its sum over t = k − N, . . . , k to obtain k   1 X −b (t /2)E e2t N t=k−N



k     1 X   2 E et − E e2t+1 + O Vg t−2b + L2g Cg2 Cf2 t−4a+3b N t=k−N

k   1 X 1  2  O Vg t−2b + L2g Cg2 Cf2 t−4a+3b ≤ E ek−N + N N t=k−N   =O Vg k −2b (log k)1b=1/2 + L2g Cg2 Cf2 k −4a+3b (log k)14a=3b+1 .

Using the Cauchy-Schwartz inequality , we have !1/2 !1/2 k k k  2 1/2 1 X 1 X b 1 X −b  2  E et ≤ t t E et N N N t=k−N t=k−N t=k−N   1/2 −b/2 1b=1/2 = O Vg k (log k) + Lg Cg Cf k −2a+2b (log k)14a=3b+1 . Finally, we return to Eq. (22). Using the convexity of F and applying the inequality above, we obtain   k k h i X 1 1 X ∗   E F xt − F ≤ E [F (xt ) − F ∗ ] Nk Nk t=k−Nk t=k−Nk  a−1 −a ≤Dk + Cf Cg O k   p 1/2 −b/2 1b=1/2 −2a+2b 14a=3b+1 + DCg Lf O Vg k (log k) + Lg Cg Cf k (log k) , where the second inequality uses the fact Nk = Θ(k). To minimize the upper bound given by Theorem 13 for k sufficiently large, we take 5 a∗ = , 7

b∗ =

4a∗ 4 = . 5 7

Then we have h   i  D + pDC V L + pDC L L C C Cf Cg  g g f g f g g f ∗ E F x bt − F = O + k 2/7 k 5/7  −2/7 =O k . This proves acceleration as compared to the O(k −1/4 ) error bound of the basic SCGD as shown in Theorem 5.  Next we consider the case when the problem is strongly convex, i.e., there exists σ > 0 and a unique optimal solution x∗ such that F (x) − F ∗ ≥ σkx − x∗ k2 for all x ∈ X . 30

Theorem 14 (Accelerated Convergence Rate for Strongly Convex Problems) Let Assumptions 1 and 2 hold, and let F be strongly convex with parameter σ. Let the stepsizes be αk =

1 , σk

βk =

1 k 4/5

.

Then the iterates generated by Algorithm 2 is such that  L2 C 2 C 2 /σ 3 + Vg   Cf Cg log k  g f g E kxk − x∗ k2 = O + = O(k −4/5 ). σ2k σ 2 k 4/5 Proof. Recalling the proof of Theorem 9, we have   E kxk+1 − x∗ k2 | Fk ≤kxk − x∗ k2 + αk2 Cf Cg − 2αk (F (xk ) − F ∗ ) + E [uk | Fk ] ,

(24)

where uk = 2αk (xk − x∗ )0 ∇gwk (xk )(∇fvk (g(xk )) − ∇fvk (yk )). By strong convexity, we have F (xk ) − F ∗ ≥ σkxk − x∗ k2 . By Assumption 1, we have E [uk | Fk ] = 2αk (xk − x∗ )0 E [∇gwk (xk )(∇fvk (g(xk )) − ∇fvk (yk )) | Fk ] ∗ 2

≤ αk σkxk − x k + ≤ αk σkxk − x∗ k2 +

L2f Cg σ L2f Cg σ

  αk E kg(xk ) − yk k2 | Fk   αk E e2k | Fk ,

where the third relation uses the fact e2k ≥ kyk − g(xk )k2 from Lemma 12. Taking expectation of both sides of (24) and applying the preceding relations , we obtain L2f Cg       E kxk+1 − x∗ k2 ≤ (1 − σαk ) E kxk − x∗ k2 + Cf Cg αk2 + αk E e2k . σ

(25)

Recalling Lemma 12, we have   E e21 ≤ 2Vg ,

    α4 βk   2  E ek + O L2g Cf2 Cg2 k3 + Vg βk2 . E e2k+1 ≤ 1 − 2 βk

(26)

In what follows we analyze the convergence rate based on these two iterative inequalities. Let us define the variable Jk = kxk − x∗ k2 + Λk e2k . where Λk is a scalar given by ( Λk+1 = max

αk L2f Cg σ(2−1 βk − σαk )

) ,0 .

By our choice of αk and βk , the scalar Λk satisfies 0 ≤ Λk+1 ≤ Λk = sufficiently large since its derivative with respect to k is negative.

31

L2f Cg −1 σ Θ(αk βk )

for k

By applying Eq. (25), Eq. (26), and the properties of Λk , we obtain for k sufficiently large that     E [Jk+1 ] = E kxk+1 − x∗ k2 + Λk+1 e2k+1 ≤ E kxk+1 − x∗ k2 + Λk e2k+1 ≤ (1 − σαk )E [Jk ] + Cf Cg αk2 +

L2f Cg σ

 O 2αk βk Vg + αk5 βk−4 L2g Cg2 Cf2 .

Taking αk = (σk)−1 and βk = k −4/5 , we have for k sufficiently large that    L2 Cg Vg + L2 C 2 C 2 /σ 3  Cf Cg 1 g g f f E [Jk ] + 2 2 + O . · E [Jk+1 ] ≤ 1 − k k σ σ2 k 9/5 Multiplying both sides with k and using induction, we obtain  L2 Cg Vg + L2 C 2 C 2 /σ 3  Cf Cg g g f f kE [Jk+1 ] ≤ (k − 1)E [Jk ] + +O · 2 2 kσ σ k 4/5  C C (log k + 1) k 1/5 L2 Cg (Vg + L2 C 2 C 2 /σ 3 )  g g f f g f =O + . σ2 σ2     L2f Cg (Vg +L2g Cg2 Cf2 /σ 3 ) Cf Cg log k ∗ 2 + Therefore E kxk+1 − x k ≤ E [Jk+1 ] ≤ O = O(k −4/5 ). σ2 k σ 2 k4/5



Lastly, we consider the case where the objective function is not necessarily convex.   We obtain the following convergence rate in terms of the nonstationarity metric E k∇F (xk )k2 . Theorem 15 (Convergence Rate of accelerated SCGD for Nonconvex Problems) Suppose that Assumptions 1 and 2 hold, F has Lipschitz continuous gradient, and X =