Asynchronous Stochastic Gradient Descent with Variance Reduction ...

Comment

Report 2 Downloads 427 Views

arXiv:1604.03584v2 [cs.LG] 14 Apr 2016

Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo [email protected]

Heng Huang [email protected]

April 15, 2016 Abstract We provide the first theoretical analysis on the convergence rate of the asynchronous stochastic variance reduced gradient (SVRG) descent algorithm on nonconvex optimization. Recent studies have shown that the asynchronous stochastic gradient descent (SGD) based algorithms with variance reduction converge with a linear convergent rate on convex problems. However, there is no work to analyze asynchronous SGD with variance reduction technique on non-convex problem. In this paper, we study two asynchronous parallel implementations of SVRG: one is on a distributed memory system and the other is on a shared memory system. We provide the theoretical analysis that both algorithms can obtain a convergence rate of O(1/T ), and linear speed up is achievable if the number of workers is upper bounded.

1 Introduction We consider the following non-convex finite-sum problem: n

min f (x) =

x∈Rd

1X fi (x) , n i=1

(1)

where f (x) and fi (x) are Lipschitz smooth. In this paper, we use Fn to denote all functions of the form above. Due to its efficiency and effectiveness, stochastic gradient descent method has been widely used to solve this kind of problem. However, because we use ∇fi (x) to simulate full gradient, variance of√stochastic gradient and decreasing learning rate lead to a slow convergence rate O(1/ T ) for convex problem. Recently, variance reduced SGD algorithms [6, 14, 3] have gained much attention to solve machine learning problems like (1). These methods can achieve linear convergence rate on convex problems. In [12], they analyze variance reduced stochastic gradient methods on non-convex problem, and prove that a sublinear convergence rate O(1/T ) can be obtained. Although a faster convergence rate can be achieved using variance reduction technique, sequential method on one single machine may be still not enough to solve largescale problem. There is growing interest in asynchronous distributed machine learning and optimization [11, 7, 8, 15, 16, 1, 5, 9, 10, 2]. The key idea of asynchronous 1

parallelism is to allow workers to work independently and have no need to synchronization. In general, there are mainly two distributed architecture categories, one is shared memory architecture [11, 16] and the other is distributed memory architecture [8, 15]. [8] considered the asynchronous stochastic gradient descent on non-convex problem, however they did not use any variance reduction technique, which can gain much acceleration. [15, 16, 13] proposed distributed variance reduced stochastic gradient method, and prove a linear convergence rate can be obtained on convex problem. However, there is no theoretical analysis on the corresponding non-convex situation. To fill these gaps, in this paper, we focus on asynchronous stochastic gradient descent with variance reduction on non-convex optimization. Two different algorithms and analysis are proposed in this paper as per two different distributed categories, one for shared memory system (multicore, multiGPU) and the other one for distributed memory system. The key difference between these two categories lies on that distributed memory system can ensure the atomicity of reading and writing the whole vector of x, while the shared memory system can usually just ensure atomic reading and writing on a single coordinate of x. We apply asynchronous SVRG on two different systems and analyze that both of them can get an ergodic convergence rate O(1/T ). Besides, we also prove that the linear speedup is achievable if the number of workers is upper bounded. We list our main contributions as follows: • Our asynchronous SVRG on distributed memory system improve the earlier convergence rate analysis of ASYSG-CON for non-convex optimization in [8] and extend the asynchronous distributed semi-stochastic gradient optimization [15] to non-convex case. We obtain a non-linear convergence rate O(1/T ) on nonconvex problem. • Our asynchronous SVRG on shared memory system improve the earlier convergence rate analysis of ASYSG-INCON for non-convex optimization in [8] and extend the AysSVRG[16] to non-convex case. We obtain a non-linear convergence rate O(1/T ) on non-convex problem.

2 Background In convex case, f (x) − f (x∗ ) or ||x − x∗ || are used as convergence criterion. Unfortunately, due to the fact that we just focus on non-convex problem, such criterion can not be used in this case. Following [8, 12], we use the weighted average of the ℓ2 norm of all gradients ||∇f (x)||2 as metric. Although f (x) − f (x∗ ), ||x − x∗ || and ||∇f (x)||2 are not comparable, they can be assumed to be in the same order [4]. For further analysis, throughout this paper, we make the following assumptions for problem (1). All of them are very common assumptions in the theoretical analysis of stochastic gradient algorithms. Assumption 1 We assume the following conditions holds, • Independence: All random samples it are selected independent to each other. • Unbiased Gradient: The stochastic gradient ∇fit (x) is unbiased, E [∇fit (x)] = ∇f (x)

2

(2)

• Lipschitz Gradient: We say f (x) is L-smooth if there is a constant L such that ||∇f (x) − ∇f (y)|| ≤ L||x − y||

(3)

Throughout, we assume that the function fit (x) are L-smooth, so that ||∇fit (x)− ∇fit (y)|| ≤ L||x − y|| • Bounded Delay: Delay variable τ are bounded: max τ ≤ ∆.

3 Asynchronous SVRG for Distributed Memory System In this section, we propose asynchronous SVRG algorithm for distributed memory system, and analyze its convergence rate.

3.1 Algorithm Description In each iteration, the parameter x is updated through the following update rule, s+1 xs+1 − ηvts+1 t+1 = xt

(4)

where learning rate η is constant, vts+1 represents the variance reduced gradient xs ) + ∇f (˜ xs ) vts+1 = ∇fit (xs+1 t−τ ) − ∇fit (˜

(5)

where it denotes index of sample, τ denotes time delay, x˜s denotes snapshot of x after m iterations. We summarize the asynchronous stochastic gradient method with variance reduction on distributed memory system in the following algorithm. Algorithm 1 AsySVRG 1 Initialize x0 ∈ Rd . for s = 0, 1, 2, , .., S − 1 do x ˜s ← xs ;

Compute full gradient ∇f (˜ xs ) ←

1 n

n P

it =1

xs ); ∇fit (˜

for t = 0, 1, 2, ..., m − 1 do Randomly select it from {1, ....n}; xs ) + ∇f (˜ xs ) Compute the update vector: vts+1 ← ∇fit (xs+1 t−τ ) − ∇fit (˜ s+1 s+1 s+1 Update xt+1 ← xt − ηvt end for xs+1 ← xsm end for

3.2 Convergence Analysis Assumption 2 For distributed memory architecture specifically, • xt−τ denotes old parameter, where τ ≤ ∆. 3

The intuition of variance reduced SGD methods is to reduce the variance of stochastic gradients. To analyze its convergence, it is nontrivial to obtain an upper bound of ℓ2 norm of ||vts+1 ||2 . Lemma 3.1 For the definition of the variance reduced gradient vts+1 in Equation (5), and we define: xs ) + ∇f (˜ xs ) (6) ) − ∇fit (˜ us+1 = ∇fit (xs+1 t t

We have the following inequality: m−1 X t=0

E ||vts+1 ||2

E ||us+1 ||2 ≤ t

≤

m−1 X 2 E ||us+1 ||2 t 2 2 2 1 − 2L ∆ η t=0

2E ||∇f (xs+1 )||2 + 2L2 E ||xs+1 −x ˜s ||2 t t

(7)

(8)

Furthermore, we are able to show the convergence rate of Algorithm 1 based on Lemma 3.1 as follows. Theorem 3.2 Suppose f ∈ Fn , x ∈ Rd . We define: ct = ct+1 (1 + ηβt +

Γt =

4L2 L2 ∆2 η 3 η2 L 4L2 η 2 ) + ( + ) 1 − 2L2 ∆2 η 2 1 − 2L2 ∆2 η 2 2 2

η L2 ∆2 η 3 4 η2 L ( − + + ct+1 η 2 ) , 2 1 − 2L2 ∆2 η 2 2 2

(9)

(10)

where cm = 0, learning rate ηt = η > 0 is constant, βt = β > 0, such that Γt > 0, ∀t ∈ [0, m − 1], T denotes total iteration. Define γ = min Γt , x∗ is the optimal t solution. Then, we have the following ergodic convergence rate for iteration T : S−1 m−1 E f (x0 ) − f (x∗ ) 1 XX s+1 2 (11) E ||∇f (xt )|| ≤ T s=0 t=0 Tγ We note that γ depends on n, L, ∆. To clarify its dependence, we simply set η and β and achieve the following theorem. Theorem 3.3 Suppose f ∈ Fn . Let η = 3α n 2

L α n2

µ0 Lnα ,

where 0 < u0 < 1 and 0 < α ≤ 1,

β= , m = ⌊ 5u0 ⌋, T denotes total iteration. Then there exists universal constant σ u0 , σ, such that it holds that γ ≥ Ln α in Theorem 3.2, and if time delay ∆ is upper bounded, ∆2 < min{

1 − 8u0 1 − 8u0 n−α , } 8u20 + 2L2 η 2 6L2 η 2

(12)

then we have: S−1 m−1 Lnα E f (x0 ) − f (x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 Tσ

(13)

Since this rate does not depend on the delay parameter ∆, the negative effect of using old values of x for stochastic gradient evaluation vanishes asymptoticly, namely, we can achieve linear speedup when we increase the number of workers. 4

3.3 Mini-Batch Extension In this section, we extend Algorithm 1 to mini-batch version. Mini-batch strategy is widely used in distributed computing, and it not only greatly reduces the communication costs and can also reduce the variance of stochastic gradient. We use a mini-batch It of size b, and gradient vts+1 in Algorithm 1 can be replaced with the following function, vts+1 =

1 X xs ) + ∇f (˜ xs ) ∇fit (xs+1 t−τi ) − ∇fit (˜ |It |

(14)

it ∈It

it denotes index of sample, τi denotes time delay for each sample i and mini-batch size |It | = b.

Lemma 3.4 For definition of the variance reduced gradient vts+1 in Equation (14), and we define: 1 X xs ) + ∇f (˜ xs ) (15) ) − ∇fit (˜ ∇fit (xs+1 us+1 = t t |It | it ∈It

where |It | = b, we have the following inequality: m−1 X

E

t=0

||vts+1 ||2

≤

m−1 X 2 E ||us+1 ||2 t 1 − 2L2 ∆2 η 2 t=0

(16)

2L2 s+1 ≤ 2E ||∇f (xs+1 )||2 + (17) E ||xt − x ˜s ||2 t b Theorem 3.5 Suppose f ∈ Fn . Let cm = 0, learning rate ηt = η > 0 is constant, βt = β > 0, b denotes the size of mini-batch. We define 2 2 3 4L2 L ∆ η η2 L 4L2 η 2 + (18) + ct = ct+1 1 + ηβt + (1 − 2L2 ∆2 η 2 )b (1 − 2L2 ∆2 η 2 )b 2 2 E ||us+1 ||2 t

Γt =

η 4 L2 ∆2 η 3 η2 L − ( + + ct+1 η 2 ) 2 (1 − 2L2 ∆2 η 2 ) 2 2

(19)

such that Γt > 0 for 0 ≤ t ≤ m − 1. Define γ = mint Γt , x∗ is the optimal solution. Then, we have the following ergodic convergence rate for iteration T : S−1 m−1 E f (x0 ) − f (x∗ ) 1 XX s+1 2 (20) E ||∇f (xt )|| ≤ T s=0 t=0 Tγ Theorem 3.6 Suppose f ∈ F . Let ηt = η = β=

L α n2

3α n 2

u0 b Lnα ,

where 0 < u0 < 1 and 0 < α ≤ 1,

, m = ⌊ 5u0 b ⌋ and T is total iteration. If the time delay ∆ is upper bounded by ∆2 < min{

1 − 8u0 b b − 8u0 bn−α , }, 2 2 + 2L η 6L2 η 2

8u20 b2

(21)

σb then there exists universal constant u0 , σ, such that it holds that: γ ≥ Ln α in Theorem 3.5 and S−1 m−1 Lnα E f (x0 ) − f (x∗ ) 1 XX s+1 2 (22) E ||∇f (xt )|| ≤ T s=0 t=0 bT σ

5

4 Asynchronous SVRG for Shared Memory System In this section, we propose asynchronous SVRG algorithm for shared memory system, and analyze its convergence rate.

4.1 Algorithm Description Following the setting in [8], we define one iteration as a modification on any single component of x in the shared memory. We use xs+1 to denote the value of parameter t x in the shared memory after (ms + t) iterations, and x ˆs+1 to denote the value of t parameter used to compute current gradient. s+1 (xs+1 )kt − η(vts+1 )kt , t+1 )kt = (xt

(23)

where kt is the index of parameter in x, kt ∈ {1, ..., d}, and learning rate η is constant. x ˆs+1 = xs+1 − t t

X

j∈J(t)

(xj+1 − xj )

xs ) + ∇f (˜ xs ) xs+1 ) − ∇fit (˜ vts+1 = ∇fit (ˆ t

(24)

(25)

it denotes index of sample, J(t) ∈ {t − 1, ...., t − ∆} is a subset of index numbers of previous iterations, ∆ is the upper bound of time delay. The definition of x ˆs+1 is t s+1 different from the analysis in [11], where x ˆt is assumed to be some earlier state of x in the shared memory like in distributed memory system. However, it is not true in practice. In Algorithm 2, we summarize the asynchronous SVRG on shared memory system. Algorithm 2 AsySVRG 2 Initialize x0 ∈ Rd . for s = 0, 1, 2, , .., S − 1 do x ˜s ← xs ;

Compute full gradient ∇f (˜ xs ) ←

1 n

n P

it =1

xs ); ∇fit (˜

for t = 0, 1, 2, ..., m − 1 do Randomly select it from {1, ....n}; xs ) + ∇f (˜ xs ) Compute the update vector: vts+1 ← ∇fit (xs+1 t−τ ) − ∇fit (˜ Randomly select kt from {1, ..., d} s+1 )kt − η(vts+1 )kt Update (xs+1 t+1 )kt ← (xt end for xs+1 ← xsm end for

4.2 Convergence Analysis As per the definition of x ˆs+1 above, the time delay assumption can be represented as t follows: 6

Assumption 3 For shared memory architecture specifically, P • x ˆs+1 = xs+1 − (xj+1 − xj ) denotes old parameter, where J(t) ⊂ {t − t t j∈J(t)

1, ..., t − ∆}.

In this case, we can also get a upper bound of ||vts+1 ||2 .

Lemma 4.1 For the definition of the variance reduced gradient vts+1 in Equation (25), and we define: xs ) + ∇f (˜ xs ) ) − ∇fit (˜ us+1 = ∇fit (xs+1 t t

(26)

We have the following inequality, m−1 X t=0

E ||vts+1 ||2

E ||us+1 ||2 ≤ t

≤

m−1 X 2d E ||us+1 ||2 t 2 2 2 d − 2L ∆ η t=0

−x ˜s ||2 2E ||∇f (xs+1 )||2 + 2L2 E ||xs+1 t t

(27)

(28)

Furthermore, the convergence rate of Algorithm 2 is as follows: Theorem 4.2 Suppose f ∈ Fn , x ∈ Rd . We define, ct = ct+1 (1 +

Γt =

4L2 L2 ∆2 η 3 4L2 η 2 η2 L ηβt )+ ( + + ) 2 2 2 2 2 2 d d − 2L ∆ η d − 2L ∆ η 2d 2

(29)

η L2 ∆2 η 3 4 η2 L ( − + + ct+1 η 2 ) , 2 2 2 2d d − 2L ∆ η 2d 2

(30)

where cm = 0, learning rate ηt = η > 0 is constant, βt = β > 0, such that Γt > 0, ∀t ∈ [0, m − 1], T denotes total iteration. Defining γ = min Γt , x∗ as the optimal t solution, then we have the following ergodic convergence rate for iteration T : S−1 m−1 E f (x0 ) − f (x∗ ) 1 XX s+1 2 (31) E ||∇f (xt )|| ≤ T s=0 t=0 Tγ So far, we can conclude that this algorithm follows a sublinear convergence rate O(1/T ). To further illustrate the dependence of γ, we have the following theorem. Theorem 4.3 Suppose f ∈ Fn . Let η = 3α n 2

L α n2

µ0 Lnα ,

where 0 < u0 < 1 and 0 < α ≤ 1,

, m = ⌊ 5u0 ⌋, T denotes total iteration. Then there exists universal constant β= σ u0 , σ, such that it holds that γ ≥ Ln α in Theorem 4.2, and if time delay has an upper bound ∆2 < min{

d − 8du0 d2 − 8du0 n−α , 2 } 2 2 2 2 2dL η + 4L η 8u0 + 2L2 η 2

(32)

then we have S−1 m−1 Lnα E f (x0 ) − f (x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 Tσ

(33)

Since this rate does not depend on the delay parameter ∆, the negative effect of using old values of x for stochastic gradient evaluation vanishes asymptoticly, namely, we can achieve linear speedup when we increase the number of workers. 7

4.3 Mini-Batch Extension In this section, we extend Algorithm 2 to mini-batch version. We use a mini-batch It of size b, and gradient vts+1 in Algorithm 2 can be replaced with the following function, vts+1 =

1 X xs+1 xs ) + ∇f (˜ xs ) ∇fit (ˆ t,it ) − ∇fit (˜ |It |

(34)

it ∈It

x ˆs+1 t,it means the parameter used to compute gradient with sample it , it denotes index of sample, J(t) ∈ {t − 1, ...., t − ∆} is a subset of index numbers of previous iterations, ∆ is the upper bound of time delay. Lemma 4.4 For the definition of the variance reduced gradient vts+1 in Equation (34), and we define: xs ) + ∇f (˜ xs ) (35) ) − ∇fit (˜ us+1 = ∇fit (xs+1 t t

We have the following inequality: m−1 X t=0

E ||vts+1 ||2

E ||us+1 ||2 t

≤

m−1 X 2d E ||us+1 ||2 t 2 2 2 d − 2L ∆ η t=0

2L2 s+1 E ||xt − x ˜s ||2 ≤ 2E ||∇f (xs+1 )||2 + t b

(36)

(37)

Theorem 4.5 Suppose f ∈ Fn . Let cm = 0, learning rate ηt = η > 0 is constant, βt = β > 0, b denotes the size of mini-batch. We define: ct = ct+1 (1 +

ηβt 4L2 η 2 4L2 L2 ∆2 η 3 η2 L + ) + ( + ) (38) d (d − 2L2 ∆2 η 2 )b (d − 2L2 ∆2 η 2 )b 2d 2

Γt =

L2 ∆2 η 3 4 η2 L η ( − + + ct+1 η 2 ) 2d d − 2L2 ∆2 η 2 2d 2

(39)

such that Γt > 0 for 0 ≤ t ≤ m − 1. Define γ = mint Γt , x∗ is the optimal solution. Then, we have the following ergodic convergence rate for iteration T , S−1 m−1 E f (x0 ) − f (x∗ ) 1 XX s+1 2 (40) E ||∇f (xt )|| ≤ T s=0 t=0 Tγ Theorem 4.6 Suppose f ∈ F . Let ηt = η = β=

L α n2

,m=⌊

3α dn 2

5u0 b

u0 b Lnα ,

where 0 < u0 < 1 and 0 < α ≤ 1,

⌋ and T is total iteration. If time delay ∆ is upper bounded by

∆2 < min{

d2 − 8u0 bdn−α d − 8u0 d , } 2dL2 η 2 + 4L2 η 2 8u20 b + 2L2 η 2

Then there exists universal constant u0 , σ, such that it holds that γ ≥ Theorem 3.5 and S−1 m−1 Lnα E f (x0 ) − f (x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 bT σ 8

(41) σb Lnα

in

(42)

5 Conclusion In this paper, we propose and analyze two different asynchronous stochastic gradient descent with variance reduction on non-convex optimization as per two different distributed categories, one for shared memory system and the other one for distributed memory system. We also extend these two methods to mini-batch version. We analyze and prove that both of them can get an ergodic convergence rate O(1/T ) and a linear speedup is achievable if the number of workers is upper bounded.

References [1] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011. [2] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012. [3] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014. [4] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341– 2368, 2013. [5] Mingyi Hong. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An admm based approach. arXiv preprint arXiv:1412.6058, 2014. [6] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013. [7] Mu Li, David G Andersen, Alex J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27, 2014. [8] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2719–2727, 2015. [9] Ji Liu, Stephen J Wright, and Srikrishna Sridhar. An asynchronous parallel randomized kaczmarz algorithm. arXiv preprint arXiv:1401.4780, 2014. [10] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970, 2015. [11] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011. 9

[12] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnab´as P´ocz´os, and Alex Smola. Stochastic variance reduction for nonconvex optimization. arXiv preprint arXiv:1603.06160, 2016. [13] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnab´as P´oczos, and Alex J Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pages 2629–2637, 2015. [14] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013. [15] Ruiliang Zhang, Shuai Zheng, and James T Kwok. Fast distributed asynchronous sgd with variance reduction. arXiv preprint arXiv:1508.01633, 2015. [16] Shen-Yi Zhao and Wu-Jun Li. Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee. 2016.

A

Proof of Lemma 3.1 and Lemma 3.4

Proof 1 (Proof of Lemma 3.1 and Lemma 3.4) Because of the definition of vts+1 and us+1 in Lemma 3.4, t + us+1 ||2 E ||vts+1 ||2 = E ||vts+1 − us+1 t t ≤ 2E ||vts+1 − us+1 ||2 + 2E ||us+1 ||2 t t # " 2 1X s+1 s+1 ||2 ∇fit (xt−τi ) − ∇fit (xt ) || + 2E ||us+1 = 2E || t b i∈It

2

2L X s+1 ||2 ||2 + 2E ||us+1 E ||xt−τi − xs+1 ≤ t t b i∈It   t−1 2 X X 2L 2 + 2E ||us+1 ||2 E || (xs+1 − xs+1 ≤ t j j+1 )|| b j=t−τ i∈It

≤

i

t−1 X

2L2 ∆η 2 X E ||vjs+1 ||2 + 2E ||us+1 ||2 t b j=t−τ i∈It

(43)

i

where the first inequality follows from ||a + b||2 ≤ 2||a||2 + 2||b||2 , second inequality follows from Lipschitz continuation of f (x). Then, we sum up vts+1 from 0 to m − 1,   m−1 m−1 t−1 X 2L2 ∆η 2 X X X  ||2  E ||vjs+1 ||2 + 2E ||us+1 E ||vts+1 ||2 ≤ t b t=0 t=0 j=t−τ i∈It

≤

2L2 ∆2 η 2

m−1 X t=0

i

m−1 X E ||us+1 ||2 E ||vts+1 ||2 + 2 t

(44)

t=0

In the following proof, we constrain 1 − 2L2 ∆2 η 2 > 0, so we can get the inequality, m−1 X t=0

E ||vts+1 ||2

≤

m−1 X 2 E ||us+1 ||2 t 2 2 2 1 − 2L ∆ η t=0

10

(45)

E

||us+1 ||2 t "

"

1 X = E || xs ) + ∇f (˜ xs ) ||2 ) − ∇fit (˜ ∇fit (xs+1 t b it ∈It

#

1 X 1 X xs ) − ∇fit (xts+1 ) − ∇fit (˜ ∇f (xs+1 ) − ∇f (˜ xs ) ||2 ≤ 2E || t b b it ∈It it ∈It s+1 2 + 2E ||∇f (xt )|| " # 2 1 X s+1 2 s+1 s ≤ 2E ||∇f (xt )|| + 2E || x ) || ∇fit (xt ) − ∇fit (˜ b

#

it ∈It

2L2 s+1 ≤ 2E ||∇f (xs+1 )||2 + E ||xt − x ˜s ||2 t b where second inequality follows from E ||ξ − E [ξ] ||2 ≤ E ||ξ||2 If |It | = 1, we proved Lemma 3.1.

(46)

B Proof of Theorem 3.2 Proof 2 (Proof of Theorem 3.2 ) s+1 + xs+1 − x˜s ||2 E ||xs+1 ˜s ||2 = E ||xs+1 t t+1 − xt t+1 − x

s+1 2 s+1 = E ||xs+1 || + ||xs+1 −x ˜s ||2 + 2 xs+1 , xs+1 −x ˜s t t t+1 − xt t+1 − xt

s+1 = E η 2 ||vts+1 ||2 + ||xs+1 − x˜s ||2 − 2η ∇f (xs+1 − x˜s t t−τ ), xt βt s+1 1 s+1 2 s 2 s 2 ||∇f (x )|| + ||x − x ˜ || + 2ηE − x ˜ || ≤ η 2 E ||vts+1 ||2 + E ||xs+1 t t−τ 2βt 2 t η 2 (47) −x ˜s ||2 + E ||∇f (xs+1 = η 2 E ||vts+1 ||2 + (1 + ηβt )E ||xs+1 t t−τ )|| βt where the first inequality follows 2 ha, bi ≤ ||a||2 + ||b||2

L s+1

s+1 s+1 s+1 s+1 2 + ||xt+1 − xt || E ≤E + ∇f (xt ), xt+1 − xt 2 2

η L s+1 2 = E f (xs+1 ) − ηE ∇f (xs+1 ), ∇f (xs+1 + E ||vt || t t t−τ ) 2 η s+1 2 2 )||2 + ||∇f (xs+1 ) − ∇f (xs+1 = E f (xs+1 ) − E ||∇f (xs+1 t t−τ )|| − ||∇f (xt t−τ )|| t 2 η 2 L s+1 2 E ||vt || + 2 (48)

f (xs+1 t+1 )

f (xs+1 ) t

where the first inequality follows from Lipschitz continuity of f (x).

11

2 ||∇f (xs+1 ) − ∇f (xs+1 t t−τ )||

≤

2 L2 ||xs+1 − xs+1 t t−τ ||

=

L2 ||

t−1 X

d=t−τ t−1 X

≤

L2 ∆

≤

L2 ∆η 2

2 (xs+1 − xs+1 d d+1 )||

d=t−τ

2 ||xs+1 − xs+1 d d+1 ||

t−1 X

d=t−τ

E ||vds+1 ||2

(49)

where the first inequality follows from Lipschitz continuity of f (x). The second inr n P P ||ai ||2 . ∆ denotes the upper bound of time ai ||2 ≤ r equality follows from || i=1

i=1

delay and τ ≤ ∆. We plug inequality 49 in inequality 48, we have the following inequality,

η η s+1 2 E f (xs+1 ) − E ||∇f (xs+1 )||2 − E ||∇f (xs+1 t t−τ )|| t+1 ) ≤ E f (xt 2 2 t−1 η 2 L s+1 2 L2 ∆η 3 X s+1 2 + E ||vt || + E ||vd || 2 2

(50)

d=t−τ

Define Lyapunov function, Rts+1 = E f (xs+1 ) + ct ||xs+1 −x ˜s ||2 t t

(51)

s+1 s+1 Rt+1 = E f (xs+1 ˜s ||2 t+1 ) + ct+1 ||xt+1 − x η η η 2 L s+1 2 2 ≤ E f (xs+1 ) − E ||∇f (xs+1 + )||2 − E ||∇f (xs+1 E ||vt || t t t−τ )|| 2 2 2 s+1 2 s+1 η s+1 2 2 s 2 + ct+1 η E ||vt || + (1 + ηβt )E ||xt − x˜ || + E ||∇f (xt−τ )|| βt t−1 L2 ∆η 3 X s+1 2 E ||vd || 2 d=t−τ η η ct+1 η 2 = E f (xs+1 ) − E ||∇f (xs+1 )E ||∇f (xs+1 )||2 − ( − t t−τ )|| t 2 2 βt η2 L + ct+1 η 2 )E ||vts+1 ||2 + ct+1 (1 + ηβt )E ||xs+1 −x ˜s ||2 +( t 2 t−1 L2 ∆η 3 X s+1 2 + E ||vd || 2

+

d=t−τ

t−1 L2 ∆η 3 X η s+1 2 + E ||∇f (x )|| E ||vds+1 ||2 ≤ E f (xs+1 ) − t t 2 2 d=t−τ

2

+(

η L + ct+1 η 2 )E ||vts+1 || 2

2

+ ct+1 (1 + ηβt )E ||xs+1 −x ˜s ||2 t 12

(52)

In the final inequality, we make ( η2 − m−1 X t=0

ct+1 η βt )

s+1 > 0. Then we sum over Rt+1

m−1 X

t−1 η L2 ∆η 3 X s+1 2 E f (xs+1 ) − + E ||∇f (x )|| E ||vds+1 ||2 t t 2 2 t=0 d=t−τ 2 s+1 s+1 2 η L s 2 2 ˜ || + ct+1 η )E ||vt || + ct+1 (1 + ηβt )E ||xt − x +( 2 m−1 X η E f (xs+1 ) − E ||∇f (xs+1 ≤ )||2 t t 2 t=0

s+1 Rt+1 ≤

L2 ∆2 η 3 η2 L s 2 + + ct+1 η 2 )E ||vts+1 ||2 + ct+1 (1 + ηβt )E ||xs+1 − x ˜ || t 2 2 m−1 X η −x ˜s ||2 )||2 + ct+1 (1 + ηβt )E ||xs+1 E f (xs+1 ) − E ||∇f (xs+1 ≤ t t t 2 t=0 2 2 3 s+1 2 2 L ∆ η η2 L 2 + ( + + ct+1 η )E ||ut || 1 − 2L2 ∆2 η 2 2 2 m−1 X E f (xs+1 ) ≤ t +(

t=0

s+1 L2 ∆2 η 3 η2 L 4L2 2 s 2 ( + + ct+1 η ) E ||xt − x ˜ || + ct+1 (1 + ηβt ) + 1 − 2L2 ∆2 η 2 2 2 m−1 X η L2 ∆2 η 3 4 η2 L s+1 2 2 − ( − + + ct+1 η ) E ||∇f (xt )|| 2 1 − 2L2 ∆2 η 2 2 2 t=0

(53)

where the last inequality follows from Lemma 3.1. We define ct and Γt ct = ct+1 (1 + ηβt ) +

Γt =

4L2 L2 ∆2 η 3 η2 L ( + + ct+1 η 2 ) 2 2 2 1 − 2L ∆ η 2 2

L2 ∆2 η 3 4 η2 L η ( − + + ct+1 η 2 ) 2 2 2 2 1 − 2L ∆ η 2 2

(54)

(55)

So the above inequality can be represented as, m−1 X

s+1 Rt+1

t=0

≤

m−1 X

Rts+1

t=0

−

m−1 X t=0

Γt E ||∇f (xs+1 )||2 t

(56)

Setting cm = 0, and x ˜s+1 = xs+1 m , and γ = min Γt , t

s+1 Rm

= E f (xs+1 xs+1 ) m ) = E f (˜

R0s+1 = E f (xs+1 xs )] 0 ) = E [f (˜ m−1 X t=0

E

||∇f (xs+1 )||2 t

E f (˜ xs ) − f (˜ xs+1 ) ≤ γ 13

(57) (58)

(59)

Summing over all epochs, we prove Theorem 3.2 S−1 m−1 E f (x0 ) − f (x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 Tγ

(60)

where T denotes total iterations, x0 is initial parameter and x∗ is the optimal solution of min f (x). x

C

Proof of Theorem 3.3

Proof 3 (Proof of Theorem 3.3) Setting cm = 0, ηt = η = 0 < u0 < 1, and 0 < α < 1. Define θ and we are able to get an upper bound, θ

= ηβ + = ≤

u0 3α

n2 5u0 n

u0 Lnα ,

βt = β =

L α n2

,

4L2 η 2 1 − 2L2 ∆2 η 2 4u20 + 2α n − 2∆2 u20 (61)

3α 2

In the final inequality, we constrain that n

3α 2

≤ n2α −2∆2 u20 . Because of the definition 3α

of ct , it is easy to know that c0 > c1 > ... > cm−1 > cm . Then we set m = ⌊ n5u20 ⌋ c0

= ≤

(1 + θ)m − 1 2L2 L2 ∆2 η 3 + η 2 L 2 2 2 1 − 2L ∆ η θ 3 2 u20 u0 ∆ 2L n3α + n2α ((1 + θ)m − 1) 4u20 u0 2 2 2 (1 − 2L ∆ η ) 3α + n2α −2∆2 u2 n

≤ ≤

2

0

2L(u30 ∆2 + u20 ) ((1 + θ)m − 1) α (1 − 2L2 ∆2 η 2 )(n 2 u0 + 4u20 ) 2L(u20 ∆2 + u0 ) − α n 2 (e − 1) 1 − 2L2 ∆2 η 2

(62)

where the final inequality follows from that (1 + 1l )l is increasing for l > 0, and lim (1 + 1l )l = e. ct is decreasing with respect to t, and c0 is also upper bounded.

l→∞

Then, we can get an upper bound of γ as follows,

14

γ

= ≥ ≥

min Γt t

η L2 ∆2 η 3 4 η2 L ( − + + c0 η 2 ) 2 2 2 2 1 − 2L ∆ η 2 2 2η 2 L 4c0 η 2 η 2L2 ∆2 η 3 − − − 2 2 2 2 2 2 2 1 − 2L ∆ η 1 − 2L ∆ η 1 − 2L2 ∆2 η 2

3α

≥ = ≥

η 2u0 n−α 2u0 n− 2 2L2 ∆2 η 2 η − η − η − 2 1 − 2L2 ∆2 η 2 1 − 2L2 ∆2 η 2 1 − 2L2 ∆2 η 2 ! 3α 1 2L2 ∆2 η 2 + 2u0 n−α + 2u0 n− 2 η − 2 1 − 2L2 ∆2 η 2 σ Lnα

(63)

where the third inequality follows from η2 L =

c0 η 2 ≤

u0 η nα

β 2 u 0 n− η = 2 2

3α 2

η

(64)

There exists a small value σ that the final inequality holds if 2L2 ∆2 η 2 + 2u0 n−α + 2u0 n− 1 > 2 1 − 2L2 ∆2 η 2

3α 2

2L(u20 ∆2 + u0 ) − α L −α n 2 (e − 1) < n 2 2 2 2 1 − 2L ∆ η 2

(65)

(66)

Thus ∆2 < min{

1 − 8u0 n−α 1 − 8u0 , } 2 2 2 8u0 + 2L η 6L2 η 2

(67)

Above all, we prove Theorem 3.3, S−1 m−1 Lnα E f (˜ x0 ) − f (˜ x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 Tσ

15

(68)

D

Proof of Theorem 3.5

Proof 4 (Proof of Theorem 3.5) s+1 + xs+1 −x ˜s ||2 E ||xs+1 ˜s ||2 = E ||xs+1 t t+1 − xt t+1 − x

s+1 2 s+1 = E ||xs+1 || + ||xs+1 −x ˜s ||2 + 2 xs+1 , xs+1 −x ˜s t t t+1 − xt t+1 − xt " * +# 1 X s+1 2 s+1 s+1 s+1 2 s 2 s = E η ||vt || + ||xt − x ˜ || − 2η ˜ ∇f (xt−τi ), xt − x b it ∈It # " s+1 2 βt s+1 1 1 X s+1 s 2 2 2 || ˜ || ∇f (xt−τi )|| + ||xt − x ≤ η E ||vt || + 2ηE 2βt b 2 it ∈It + E ||xs+1 − x˜s ||2 t # " s+1 s+1 2 η 1 X s+1 s 2 2 2 ˜ || + E || = η E ||vt || + (1 + ηβt )E ||xt − x ∇f (xt−τi )|| βt b it ∈It

(69)

where the first inequality follows 2 ha, bi ≤ ||a||2 + ||b||2 L s+1

s+1 s+1 s+1 s+1 s+1 2 + E f (xs+1 ) + ∇f (x ), x ) ≤ E f (x − x ||x − x || t t t t t+1 t+1 2 t+1 "* +# 1 X η 2 L s+1 2 = E f (xs+1 ) − ηE ∇f (xs+1 ), E ||vt || + ∇f (xs+1 t t t−τi ) b 2 it ∈It # " 1 X 1 X η s+1 s+1 s+1 s+1 2 2 2 ∇f (xt−τi )|| − ||∇f (xt ) − ∇f (xt−τi )|| = − E ||∇f (xt )|| + || 2 b b it ∈It

it ∈It

η 2 L s+1 2 E ||vt || + E f (xs+1 ) + t 2

(70)

where the first inequality follows from Lipschitz continuity of f (x). ||∇f (xs+1 )− t

1 X 2 ∇f (xs+1 t−τi )|| b it ∈It

≤ ≤ =

1 X 2 ||∇f (xs+1 ) − ∇f (xs+1 t t−τi )|| b it ∈It 2 X

L b

it ∈It

2 ||xs+1 − xs+1 t t−τi ||

t−1 X L2 X 2 (xs+1 − xs+1 || j j+1 )|| b j=t−τ it ∈It

i

t−1 L ∆ X X 2 ||xs+1 − xs+1 j j+1 || b j=t−τ 2

≤

it ∈It

=

i

t−1 L ∆η X X ||vjs+1 ||2 b j=t−τ 2

2

it ∈It

(71)

i

where the second inequality follows from Lipschitz continuity of f (x). ∆ denotes the upper bound of time delay. τ ≤ ∆. 16

Above all. # " X η 1 η s+1 2 )||2 − E || ∇f (xs+1 E f (xs+1 ) − E ||∇f (xs+1 t t−τi )|| t+1 ) ≤ E f (xt 2 2 b

i∈It

t−1 η L s+1 2 L ∆η X X E ||vt || + + E ||vjs+1 ||2 (72) 2 2b j=t−τ 2

2

3

i∈It

i

Define Lyapunov function, Rts+1 = E f (xs+1 ) + ct ||xs+1 −x ˜s ||2 t t

(73)

s+1 s+1 Rt+1 = E f (xs+1 ˜s ||2 t+1 ) + ct+1 ||xt+1 − x

" # η η 1X s+1 s+1 2 s+1 2 ≤ E f (xt ) − E ||∇f (xt )|| − E || ∇f (xt−τi )|| 2 2 b i∈It

t−1 X

η 2 L s+1 2 L2 ∆η 3 X E ||vt || + E ||vjs+1 ||2 2 2b i∈It j=t−τi " " ## s+1 2 s+1 η 1X s+1 2 s 2 2 ˜ || + E || + ct+1 η E ||vt || + (1 + ηβt )E ||xt − x ∇f (xt−τi )|| βt b i∈It " # η η ct+1 η 1X s+1 2 s+1 s+1 2 )E || ∇f (xt−τi )|| = E f (xt ) − E ||∇f (xt )|| − ( − 2 2 βt b

+

i∈It

t−1 η2 L L ∆η X X + ct+1 η 2 )E ||vts+1 ||2 E ||vjs+1 ||2 + ( + 2b 2 i∈It j=t−τi s+1 + ct+1 (1 + ηβt )E ||xt − x ˜s ||2 2

3

t−1 η L2 ∆η 3 X X s+1 2 ≤ E f (xs+1 ) − + E ||∇f (x )|| E ||vjs+1 ||2 t t 2 2b j=t−τ i∈It

i

η2 L −x ˜s ||2 + ct+1 η 2 )E ||vts+1 ||2 + ct+1 (1 + ηβt )E ||xs+1 +( t 2 In the final inequality, we make ( η2 −

ct+1 η βt )

17

> 0.

(74)

s+1 Then we sum over Rt+1 t−1 L2 ∆η 3 X X η s+1 2 E ||vjs+1 ||2 E − E ||∇f (xt )|| + ≤ 2 2b t=0 t=0 i∈It j=t−τi 2 s+1 2 s+1 η L 2 s 2 + ct+1 η )E ||vt || + ct+1 (1 + ηβt )E ||xt − x ˜ || +( 2 m−1 X η ) − E ||∇f (xs+1 E f (xs+1 ≤ −x ˜s ||2 )||2 + ct+1 (1 + ηβt )E ||xs+1 t t t 2 t=0 s+1 2 L2 ∆2 η 3 η2 L 2 +( + + ct+1 η )E ||vt || 2 2 m−1 X η E f (xs+1 ) − E ||∇f (xs+1 ≤ −x ˜s ||2 )||2 + ct+1 (1 + ηβt )E ||xs+1 t t t 2 t=0 s+1 2 L2 ∆2 η 3 η2 L 2 2 ( + + c η )E ||u || + t+1 t 1 − 2L2 ∆2 η 2 2 2

m−1 X

=

s+1 Rt+1

m−1 X t=0

m−1 X

Rts+1 −

m−1 X t=0

f (xs+1 ) t

Γt E ||∇f (xs+1 )||2 t

(75)

where the last inequality follows from Lemma 3.4, and we define 2 2 3 4L2 η 2 4L2 L ∆ η η2 L ct = ct+1 1 + ηβt + + (76) + (1 − 2L2 ∆2 η 2 )b (1 − 2L2 ∆2 η 2 )b 2 2 Γt =

4 L2 ∆2 η 3 η2 L η − ( + + ct+1 η 2 ) 2 (1 − 2L2 ∆2 η 2 ) 2 2

(77)

We set cm = 0, and x ˜s+1 = xs+1 m , and γ = min Γt , thus t

s+1 Rm

R0s+1

m−1 X t=0

E

= E f (xs+1 xs+1 ) m ) = E f (˜ =

(78)

E f (xs+1 xs )] 0 ) = E [f (˜

(79)

||∇f (xs+1 )||2 t

E f (˜ xs ) − f (˜ xs+1 ) ≤ γ

(80)

sum over all epochs, the following inequality holds, S−1 m−1 E f (x0 ) − f (x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 Tγ

18

(81)

E Proof of Theorem 3.6 Proof 5 (Proof of Theorem 3.6) Follows from the proof of Theorem 3.6, we let cm = u0 b L 0, ηt = η = Ln α , 0 < u0 < 1, and 0 < α < 1. We define θ, and get its α , βt = β = n2 upper bound,

θ

= = ≤

4L2 η 2 (1 − 2L2 ∆2 η 2 )b 4u20 b u0 b 3α + n2α − 2∆2 u20 b2 n2 5u0 b

ηβ +

n

(82)

3α 2

3α

n 2 ⌋, and c0 is upper bounded, We set m = ⌊ 5u 0b

c0

=

≤

(1 + θ)m − 1 2L2 2 2 3 2 L ∆ η + η L (1 − 2L2 ∆2 η 2 ) θ 3 23 2 2 u b u ∆ b 2L 0n3α + n02α ((1 + θ)m − 1) 4u2 b 0b + n2α −2∆0 2 u2 b2 (1 − 2L2 ∆2 η 2 ) u3α n

≤ ≤

0

2

2L(u20 ∆2 b2 + u0 b) ((1 + θ)m − 1) α (1 − 2L2 ∆2 η 2 )(n 2 + 4u0 ) 2L(u20 ∆2 b2 + u0 b) − α n 2 (e − 1) 1 − 2L2 ∆2 η 2

(83)

where the final inequality follows from that (1 + 1l )l is increasing for l > 0, and lim (1 + 1l )l = e. ct is decreasing with respect to t, and c0 is also upper bounded.

l→∞

Now, we can get a lower bound of γ, γ

= ≥ ≥

min Γt t

4 L2 ∆2 η 3 η2 L η − ( + + c0 η 2 ) 2 2 2 2 (1 − 2L ∆ η ) 2 2 2L2 ∆2 η 3 2η 2 L 4c0 η 2 η − − − 2 2 2 2 2 2 2 (1 − 2L ∆ η ) (1 − 2L ∆ η ) (1 − 2L2 ∆2 η 2 ) 3α

≥ = ≥

η 2L2 ∆2 η 2 2u0 bn−α 2u0 bn− 2 − η − η − η 2 (1 − 2L2 ∆2 η 2 ) (1 − 2L2 ∆2 η 2 ) (1 − 2L2 ∆2 η 2 ) ! 3α 1 2L2 ∆2 η 2 + 2u0 bn−α + 2u0 bn− 2 η − 2 (1 − 2L2 ∆2 η 2 )

σb Lnα

(84)

where the third inequality follows from η2 L

= 19

u0 b η nα

c0 η 2 ≤

β 2 u0 bn− η = 2 2

3α 2

η

(85)

There exists a small value σ that the final inequality holds if 1 2L2 ∆2 η 2 + 2u0 bn−α + 2u0 bn− > 2 (1 − 2L2 ∆2 η 2 )

3α 2

(86)

β 2L(u20 ∆2 b2 + u0 b) − α n 2 (e − 1) < 2 2 2 1 − 2L ∆ η 2

(87)

So, if ∆2 has an upper bound as follows, the final inequality holds. ∆2 < min{

b − 8u0 bn−α 1 − 8u0 b , } 8u20 b2 + 2L2 η 2 6L2 η 2

(88)

Above all, we replace γ in Theorem 3.6, S−1 m−1 Lnα E f (˜ x0 ) − f (˜ x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 bT σ

F

(89)

Proof of Lemma 4.1 and Lemma 4.4

Proof 6 (Proof of Lemma 4.1 and Lemma 4.4) + us+1 ||2 E ||vts+1 ||2 = E ||vts+1 − us+1 t t ||2 ≤ 2E ||vts+1 − us+1 ||2 + 2E ||us+1 t t " # 1 X s+1 s+1 2 = 2E || xt,it ) − ∇fit (xt )|| ) + 2E ||us+1 ||2 ∇fit (ˆ t b it ∈It

2

≤ ≤

2L X s+1 ||2 E ||ˆ xt,it − xs+1 ||2 + 2E ||us+1 t t b it ∈It   X 2L2 X  2 (xs+1 − xs+1 E || + 2E ||us+1 ||2 t j j+1 )kj || b it ∈It

2

≤

j∈J(t,it )

2

2L ∆η X bd

X

it ∈It j∈J(t,it )

||2 E ||vjs+1 ||2 + 2E ||us+1 t

(90)

where the first inequality follows from ||a + b||2 ≤ 2||a||2 + 2||b||2 . m−1 X t=0

E ||vts+1 ||2 ≤

m−1 X t=0

2

≤



2 2 X  2L ∆η bd

X

it ∈It j∈J(t,it )

2 2 m−1 X

2L ∆ η d

t=0

E ||vjs+1 ||

2

 s+1 2 + 2E ||ut || 

m−1 X E ||us+1 ||2 E ||vts+1 ||2 + 2 t t=0

20

(91)

Thus, ||vts+1 ||2 can be bounded by ||us+1 ||2 , t m−1 X t=0

E ||vts+1 ||2

≤

From Lemma 3.4, we know that

m−1 X 2d E ||us+1 ||2 t 2 2 2 d − 2L ∆ η t=0

2L2 s+1 E ||xt − x ˜s ||2 )||2 + E ||us+1 ||2 ≤ 2E ||∇f (xs+1 t t b

(92)

(93)

When |It | = 1, we can derive Lemma 4.1 too.

G

Proof of Theorem 4.2

Proof 7 (Proof of Theorem 4.2) As the proof above, we first get the upper bound of s 2 term E ||xs+1 − x ˜ || . t+1 s+1 s+1 + xs+1 − x˜s ||2 E ||xt+1 − x˜s ||2 = E ||xs+1 t t+1 − xt

s+1 2 s+1 = E ||xs+1 || + ||xs+1 −x ˜s ||2 + 2 xs+1 , xs+1 −x ˜s t t t+1 − xt t+1 − xt 2 2η

η s+1 s+1 s s 2 ∇f (ˆ x ), x − x ˜ =E ||vts+1 ||2 + ||xs+1 − x ˜ || − t t t d d 2η η2 βt s+1 1 s+1 2 s 2 s 2 ≤ E ||vts+1 ||2 + E ||xs+1 + − x ˜ || ||∇f (ˆ x )|| + E ||x − x ˜ || t t d d 2βt 2 t η2 ηβt s+1 η = E ||vts+1 ||2 + (1 + E ||∇f (ˆ xs+1 )||2 (94) )E ||xt − x˜s ||2 + t d d dβt where the third equality follows from the update function from Algorithm 2.

L s+1

s+1 s+1 s+1 s+1 s+1 2 + E f (xs+1 ) ≤ E f (x ) + ∇f (x ), x − x ||x − x || t t t t t+1 t+1 2 t+1 η

η 2 L s+1 2 = E f (xs+1 ) − E ∇f (xs+1 ), ∇f (ˆ xs+1 ) + E ||vt || t t t d 2d η )||2 + ||∇f (ˆ xs+1 )||2 − ||∇f (xs+1 ) − ∇f (ˆ xs+1 )||2 = E f (xs+1 ) − E ||∇f (xs+1 t t t t t 2d η 2 L s+1 2 + E ||vt || 2d (95) where the first inequality follows from Lipschitz continuity of f (x). E ||∇f (xs+1 ) − ∇f (ˆ xs+1 )||2 t t

≤ =

L2 E ||xs+1 −x ˆs+1 ||2 t t   X 2 (xs+1 − xs+1 L2 E || j j+1 )|| j∈J(t)

≤

2

L ∆

j∈J(t) 2

≤ 21

X

2 E ||xs+1 − xs+1 j j+1 ||

L ∆η 2 X s+1 2 E ||vj || d j∈J(t)

(96)

where the first inequality follows from Lipschitz continuity of f (x). The second inequality follows from triangular inequality. ∆ denotes the upper bound of time delay. Above all. η η s+1 )||2 − E ||∇f (ˆ xs+1 )||2 E f (xs+1 ) − E ||∇f (xs+1 t t t+1 ) ≤ E f (xt 2d 2d η 2 L s+1 2 L2 ∆η 3 X s+1 2 (97) E ||vj || E ||vt || + + 2d 2d2 j∈J(t)

Define Lyapunov function, Rts+1 = E f (xs+1 ) + ct ||xs+1 −x ˜s ||2 t t

(98)

s+1 s+1 Rt+1 = E f (xs+1 ˜s ||2 t+1 ) + ct+1 ||xt+1 − x η 2 L s+1 2 η η ≤ E f (xs+1 ) − E ||∇f (xs+1 )||2 − E ||∇f (ˆ xs+1 )||2 + E ||vt || t t t 2d 2d 2 2d 2 3 X η s+1 2 ηβt s+1 L ∆η E ||vjs+1 ||2 + ct+1 E ||vt || + (1 + )E ||xt − x ˜s ||2 + 2 2d d d j∈J(t) η E ||∇f (ˆ xs+1 )||2 + t dβt η ct+1 η η )E ||∇f (ˆ xs+1 )||2 )||2 − ( − = E f (xs+1 ) − E ||∇f (xs+1 t t t 2d 2d dβt L2 ∆η 3 X s+1 2 η 2 L ct+1 η 2 s+1 2 + E ||vj || + ( + )E ||vt || 2 2d 2d d j∈J(t)

ηβt s+1 )E ||xt − x ˜s ||2 d L2 ∆η 3 X s+1 2 η s+1 2 ≤ E f (xs+1 ) − E ||vj || + E ||∇f (x )|| t t 2d 2d2

+ ct+1 (1 +

j∈J(t)

2

+(

2

ηβt s+1 η L ct+1 η + )E ||vts+1 ||2 + ct+1 (1 + )E ||xt − x ˜s ||2 2d d d

η − In the final inequality, we make ( 2d

ct+1 η dβt )

22

> 0.

(99)

s+1 Then sum over Rt+1 m−1 X

m−1 X

L2 ∆η 3 X s+1 2 η E ||vj || )||2 + E f (xs+1 ) − E ||∇f (xs+1 t t 2d 2d2 t=0 t=0 j∈J(t) η 2 L ct+1 η 2 s+1 2 ηβt s+1 +( + )E ||vt || + ct+1 (1 + )E ||xt − x ˜s ||2 2d d d m−1 X η ηβt s+1 E f (xs+1 ) − E ||∇f (xs+1 ≤ )||2 + ct+1 (1 + )E ||xt − x˜s ||2 t t 2d d t=0 2 2 2 2 3 η L ct+1 η L ∆ η + + )E ||vts+1 ||2 +( 2d2 2d d m−1 X η ηβt s+1 ≤ E f (xs+1 ) − E ||∇f (xs+1 )||2 + ct+1 (1 + )E ||xt − x˜s ||2 t t 2d d t=0 L2 ∆2 η 3 η 2 L ct+1 η 2 s+1 2 2d ( + + )E ||u || + t d − 2L2 ∆2 η 2 2d2 2d d m−1 m−1 X X s+1 s+1 2 s 2 (100) Γ E ||∇f (x )|| − E f (xs+1 ) + c E ||x − x ˜ || ≤ t t t t t s+1 Rt+1

≤

t=0

t=0

where we set, ct = ct+1 (1 +

Γt =

ηβt L2 ∆2 η 3 4L2 η2 L ( )+ + + ct+1 η 2 ) d d − 2L2 ∆2 η 2 2d 2

η L2 ∆2 η 3 4 η2 L ( − + + ct+1 η 2 ) 2d d − 2L2 ∆2 η 2 2d 2

(101)

(102)

s+1 Setting cm = 0, and x ˜s+1 = xs+1 , and γ = min Γt , then Rm = E f (xs+1 m m ) = E f (˜ xs+1 ) and R0s+1 = E f (xs+1 xs )]. Thus we can get, 0 ) = E [f (˜

m−1 X t=0

E

||∇f (xs+1 )||2 t

E f (˜ xs ) − f (˜ xs+1 ) ≤ γ

(103)

sum over all epochs, we can have the final inequality, S−1 m−1 E f (x0 ) − f (x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 Tγ

H

(104)

Proof of Theorem 4.3

Proof 8 (Proof of Theorem 4.3) following the proof of Theorem 4.5, we let cm = 0, u0 L ηt = η = Ln α , 0 < u0 < 1, and 0 < α < 1. α , βt = β = 2 n

23

θ

ηβ 4L2 η 2 + d d − 2L2 ∆2 η 2 4u20 u0 + 3α dn2α − 2∆2 u20 dn 2 5u0

= = ≤

dn

In the final inequality, we constrain that dn We set m = ⌊ c0

=

≤

3α dn 2

5u0

⌋

≤

3α 2

≤ dn2α − 2∆2 u20 .

L2 ∆2 η 3 (1 + θ)m − 1 2 +η L d θ 3 2 2 u ∆ du 2L n03α + n2α0 ((1 + θ)m − 1) 4u20 u0 2 2 2 (d − 2L ∆ η ) d 3α + dn2α −2∆2 u2 2L2 d − 2L2 ∆2 η 2

dn

≤

(105)

3α 2

2

0

2L(u30 ∆2 + du20 ) ((1 + θ)m − 1) α (d − 2L2 ∆2 η 2 )(n 2 u0 + 4u20 ) 2L(u20 ∆2 + du0 ) − α n 2 (e − 1) 1 − 2L2 ∆2 η 2

(106)

where the final inequality follows from that (1 + 1l )l is increasing for l > 0, and lim (1 + 1l )l = e. ct is decreasing with respect to t, and c0 is also upper bounded.

l→∞

γ

= min Γt t

≥ ≥

η L2 ∆2 η 3 4 η2 L ( − + + c0 η 2 ) 2d d − 2L2 ∆2 η 2 2d 2 4c0 η 2 2L2 ∆2 η 3 2η 2 L η − − − 2 (d − 2L2 ∆2 η 2 )d d − 2L2 ∆2 η 2 d − 2L2 ∆2 η 2

3α

η 2u0 n− 2 2L2 ∆2 η 2 2u0 n−α η− η − η− 2 2 2 2 2 2 2 (d − 2L ∆ η )d d − 2L ∆ η d − 2L2 ∆2 η 2 ! 3α 1 2L2 ∆2 η 2 + 2du0 n−α + 2du0 n− 2 = η − 2 (d − 2L2 ∆2 η 2 )d σ (107) ≥ Lnα where the third inequality follows from u0 η2 L = η nα ≥

β 2 u0 − 3α η = n 2 η 2 2 There exists a small value σ that the final inequality holds if c0 η 2 ≤

2L2 ∆2 η 2 + 2du0 n−α + 2du0 n− 1 > 2 (d − 2L2 ∆2 η 2 )d 24

(108)

3α 2

(109)

2L(u20 ∆2 + du0 ) − α β n 2 (e − 1) < 2 2 2 d − 2L ∆ η 2

(110)

Thus ∆2 < min{

d2 − 8du0 n−α d − 8du0 , 2 } 2 2 2 2 2dL η + 4L η 8u0 + 2L2 η 2

(111)

Above all, we get S−1 m−1 Lnα E f (˜ x0 ) − f (˜ x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 Tσ

(112)

I Proof of Theorem 4.5 Proof 9 (Proof of Theorem 4.5) At first, we obtain the upper bound of E ||xs+1 ˜s ||2 , t+1 − x s+1 + xs+1 − x˜s ||2 E ||xs+1 ˜s ||2 = E ||xs+1 t t+1 − xt t+1 − x

s+1 2 s+1 = E ||xs+1 || + ||xs+1 −x ˜s ||2 + 2 xs+1 , xs+1 −x ˜s t t t+1 − xt t+1 − xt * " +# X 1 2η η 2 s+1 2 s+1 ||v || + ||xs+1 − x˜s ||2 − =E ∇f (ˆ xs+1 −x ˜s t t,it ), xt d t d b it ∈It # " 2 X 2η 1 1 βt s+1 η s+1 2 s+1 2 s 2 || ˜ || ∇f (ˆ xt,it )|| + ||xt − x ≤ E ||vt || + E d d 2βt b 2 it ∈It + E ||xs+1 −x ˜s ||2 t " # η 2 s+1 2 1 X η ηβt s+1 s+1 2 = E ||vt || + E || )E ||xt − x ˜s ||2 ∇f (ˆ xt,it )|| + (1 + d dβt b d it ∈It

(113)

Then E f (xs+1 t+1 ) is also upper bounded, L s+1

s+1 2 s+1 s+1 s+1 s+1 ||x − x || + E f (xs+1 ) ≤ E f (x ) + ∇f (x ), x − x t t t t t+1 t+1 2 t+1 "* +# X η η 2 L s+1 2 s+1 1 s+1 = E f (xs+1 ) − E ∇f (x ), E ||vt || ∇f (ˆ x ) + t t t,it d b 2d it ∈It 1 X η 2 E ||∇f (xs+1 )||2 + || = E f (xs+1 ) − ∇f (ˆ xs+1 t t t,it )|| 2d b it ∈It 2 X η L s+1 2 1 2 E ||vt || + ∇f (ˆ xs+1 − ||∇f (xs+1 )− t t,it )|| b 2d it ∈It

(114)

where the first inequality follows from Lipschitz continuity of f (x).

25

E

"

||∇f (xs+1 ) t

1 X 2 − ∇f (ˆ xs+1 t,it )|| b it ∈It

#

≤ =

L2 X s+1 2 E ||xt − x ˆs+1 t,it || b it ∈It   X L2 X  2 (xs+1 − xs+1 E || j j+1 )|| b it ∈It

j∈J(t,it )

2

≤

L ∆ X b

it ∈It j∈J(t,it )

2

≤

X

L ∆η 2 X bd

X

2 E ||xs+1 − xs+1 j j+1 ||

it ∈It j∈J(t,it )

E ||vjs+1 ||2 (115)

where the first inequality follows from Lipschitz continuity of f (x). ∆ denotes the upper bound of time delay. Above all. " # η 1 X η s+1 2 s+1 2 s+1 s+1 ∇f (ˆ xt,it )|| E f (xt+1 ) ≤ E f (xt ) − E ||∇f (xt )|| − E || 2d 2d b it ∈It

η L s+1 2 L2 ∆η 3 X E ||vt || + + 2d 2bd2 2

X

it ∈It j∈J(t,it )

E ||vjs+1 ||2 (116)

As per the definition of Lyapunov function, s+1 s+1 Rt+1 = E f (xs+1 ˜s ||2 t+1 ) + ct+1 ||xt+1 − x

# " η 1 X η s+1 2 s+1 2 s+1 ∇f (ˆ xt,it )|| ≤ E f (xt ) − E ||∇f (xt )|| − E || 2d 2d b it ∈It

2

η L s+1 2 L2 ∆η 3 X X + E ||vt || + E ||vjs+1 ||2 2 2d 2bd it ∈It j∈J(t,it ) " " ## X η 2 s+1 2 1 ηβt s+1 η 2 + ct+1 E || E ||vt || + (1 + )E ||xt − x ˜s ||2 + ∇f (ˆ xs+1 t,it )|| d d dβt b it ∈It " # X η 1 ct+1 η η s+1 2 s+1 2 s+1 )E || ∇f (ˆ xt,it )|| = E f (xt ) − E ||∇f (xt )|| − ( − 2d 2d dβt b it ∈It

2

3

L ∆η X + 2bd2

X

it ∈It j∈J(t,it )

2

2

η L ct+1 η + )E ||vts+1 ||2 E ||vjs+1 ||2 + ( 2d d

ηβt s+1 )E ||xt − x ˜s ||2 d L2 ∆η 3 X η s+1 )||2 + ≤ E f (xt ) − E ||∇f (xs+1 t 2d 2bd2

+ ct+1 (1 +

X

it ∈It j∈J(t,it )

2

+(

2

E ||vjs+1 ||2

η L ct+1 η ηβt s+1 + )E ||vts+1 ||2 + ct+1 (1 + )E ||xt − x ˜s ||2 2d d d

26

(117)

η − In the final inequality, we make ( 2d s+1 over Rt+1 , m−1 X

ct+1 η dβt )

> 0. As per Lemma 4.4, and we sum

m−1 X

L2 ∆η 3 X X η E f (xs+1 ) − E ||∇f (xs+1 )||2 + E ||vjs+1 ||2 t t 2 2d 2bd t=0 t=0 it ∈It j∈J(t,it ) 2 2 s+1 2 η L ct+1 η ηβt s+1 s 2 +( + )E ||vt || + ct+1 (1 + )E ||xt − x ˜ || 2d d d m−1 X ηβt s+1 η )||2 + ct+1 (1 + )E ||xt − x ˜s ||2 E f (xs+1 ) − E ||∇f (xs+1 ≤ t t 2d d t=0 L2 ∆2 η 3 η 2 L ct+1 η 2 s+1 2 +( + + )E ||v || t 2d2 2d d m−1 X η ηβt s+1 E f (xs+1 ) − E ||∇f (xs+1 ≤ )||2 + ct+1 (1 + )E ||xt − x ˜s ||2 t t 2d d t=0 L2 ∆2 η 3 η 2 L ct+1 η 2 s+1 2 2d ( + + )E ||ut || + 2 2 2 2 d − 2L ∆ η 2d 2d d ≤

s+1 Rt+1 ≤

m−1 X t=0

Rts+1 −

m−1 X t=0

Γt E ||∇f (xs+1 )||2 t

(118)

where ct = ct+1 (1 +

Γt =

ηβt 4L2 L2 ∆2 η 3 η2 L )+ ( + + ct+1 η 2 ) d (d − 2L2 ∆2 η 2 )b 2d 2

L2 ∆2 η 3 4 η2 L η ( − + + ct+1 η 2 ) 2d d − 2L2 ∆2 η 2 2d 2

(119)

(120)

s+1 Setting cm = 0, x ˜s+1 = xs+1 γ = min Γt , then Rm = E f (xs+1 m , and m ) = E f (˜ xs+1 ) and R0s+1 = E f (xs+1 xs )]. Thus we can get, 0 ) = E [f (˜

m−1 X t=0

E

||∇f (xs+1 )||2 t

E f (˜ xs ) − f (˜ xs+1 ) ≤ γ

(121)

Summing over all epochs, we can have the final inequality, S−1 m−1 E f (x0 ) − f (x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 Tγ

(122)

J Proof of Theorem 4.6 Proof 10 (Proof of Theorem 4.6) Setting cm = 0, ηt = η = 0 < u0 < 1, and 0 < α < 1.

27

u0 b Lnα ,

βt = β =

L α n2

,

θ

= = ≤

4L2 η 2 ηβ + d (d − 2L2 ∆2 η 2 )b u0 b 4u20 b 3α + dn2α − 2∆2 u20 b2 dn 2 5u0 b dn

In the final inequality, we constrain that dn We set m = ⌊ c0

=

≤ ≤ ≤

3α dn 2

5u0 b

(123)

3α 2 3α 2

⌋

≤ dn2α − 2∆2 u20 b2 .

2 2 3 2L2 L ∆ η (1 + θ)m − 1 2 + η L 2 2 2 (d − 2L ∆ η )b d θ 3 23 u20 b2 d u0 ∆ b 2L n3α + n2α ((1 + θ)m − 1) 4u20 b2 u0 b2 2 2 2 (d − 2L ∆ η ) d 3α + dn2α −2∆2 u2 b2 dn 2 3 2 2L(u0 ∆ b + u20 d) ((1 α (d − 2L2 ∆2 η 2 )(n 2 u0 + 4u20 ) 2L(u20 ∆2 b + u0 d) − α n 2 (e − 1) (d − 2L2 ∆2 η 2 )

0

+ θ)m − 1)

(124)

where the final inequality follows from that (1 + 1l )l is increasing for l > 0, and lim (1 + 1l )l = e. ct is decreasing with respect to t, and c0 is also upper bounded.

l→∞

γ

= min Γt t

≥ ≥

L2 ∆2 η 3 η 4 η2 L ( − + + c0 η 2 ) 2 2 2 2d d − 2L ∆ η 2d 2 η 4c0 η 2 2L2 ∆2 η 3 2η 2 L − − − 2 2 2 2 2 2 2 (d − 2L ∆ η )d d − 2L ∆ η d − 2L2 ∆2 η 2

3α

≥ = ≥

η 2u0 bn− 2 2L2 ∆2 η 2 2u0 bn−α η − η − η − 2 (d − 2L2 ∆2 η 2 )d d − 2L2 ∆2 η 2 d − 2L2 ∆2 η 2 ! 3α 1 2L2 ∆2 η 2 + 2u0 bdn−α + 2u0 bdn− 2 η − 2 (d − 2L2 ∆2 η 2 )d σ (125) Lnα

where the third inequality follows from η2 L

c0 η 2 ≤

=

u0 Lη Lnα

β 2 u0 bn− η = 2 2

28

3α 2

η

(126)

There exists a small value σ that the final inequality holds if 1 2L2 ∆2 η 2 + 2u0 bdn−α + 2u0 bdn− > 2 (d − 2L2 ∆2 η 2 )d

3α 2

β 2L(u20 ∆2 b + u0 d) − α n 2 (e − 1) < d − 2L2 ∆2 η 2 2

(127)

(128)

Thus ∆2 < min{

d2 − 8u0 bdn−α d − 8u0 d , 2 } 2 2 2 2 2dL η + 4L η 8u0 b + 2L2 η 2

(129)

Above all, we get S−1 m−1 Lnα E f (˜ x0 ) − f (˜ x∗ ) 1 XX s+1 2 E ||∇f (xt )|| ≤ T s=0 t=0 Tσ

29

(130)

Recommend Documents

Variance Reduction for Stochastic Gradient ... - Semantic Scholar

ASYNCHRONOUS STOCHASTIC COORDINATE DESCENT ...

Neighborhood Watch: Stochastic Gradient Descent with Neighbors

Semi-Stochastic Gradient Descent Methods

Amortized Analysis on Asynchronous Gradient Descent