Stochastic and Accelerated Algorithms for Minimizing

Report 3 Downloads 73 Views
Extending the Reach of Big Data Optimization: Stochastic and Accelerated Algorithms for Minimizing Relatively Smooth Functions Filip 1KAUST

Optimization Problem

Main Assumptions

1,2 Hanzely 2University

Peter

1,2,3 Richtárik 3MIPT

of Edinburgh

Baseline Algorithm – Relative Gradient Descent [1, 2]

Convergence result

Lin

4 Xiao

4Microsoft

Research

Relative Randomized Coordinate Descent •

is separable

• -ESO (Expected Separable Overapproxmation, defined in [3] in case when is L2 norm):

relative smoothness of w.r.t. relative strong convexity of w.r.t. • Standard smoothness and strong convexity arises as a special case for • Key tool – Bregman divergence • Equivalent formulations ( - relative smoothness)

Convergence result

References

✓)x + ✓y, (1 Dh (y, z)

t

E mintT f (x )





f (x )  O

= g(x, y, z, ✓)⌫(✓)

Fast decay for ✓

!0

Stepsize sequence (determined)

✓k+1 )xk + ✓k+1 z k

(1

Find Gk+1 2 R s. t. Gk+1 g(xk , z k+1 , z k , ✓k+1 ) and Gk+1 ⇢ k+1 ⌫(✓ ) k+1 k+1 k+1 k z argminz2Q hrf(y ), zi + k+1 G LDh (z, z ) ✓ xk+1 (1 ✓k+1 )xk + ✓k+1 z k+1

Convergence result



✓)x + ✓z)

Upper bounded by constant

y k+1

• Analysis for a range of stepsize parameters 3 new algorithms: • Relative Stochastic Gradient Descent • Relative Randomized Coordinate Descent • Accelerated Relative Gradient Descent

• Algorithm/rate depends on properties of Bregman distance -- Triangle Scaling Equality:

• w -relative strong convexity with different strong convexity parameter for each coordinate

Strong convexity parameter of i-th coordinate

Main Contributions

• f is relative smooth and convex 2 • Goal: get rate O(T )

Dh ((1

Relative Stochastic Gradient Descent

• Access to unbiased estimator of gradient • Useful when gradient estimator is much cheaper than full gradient

Accelerated Relative Gradient Descent



1

wi p1 mini vi

◆T !

• Strictly better than Relative Gradient [1] Bauschke, Heinz H., Jérôme Bolte, and Marc Teboulle. "A descent Lemma beyond Lipschitz gradient continuity: first-order methods Descent revisited and applications." Mathematics of Operations Research (2016). [3] Qu, Zheng, and Peter Richtárik. "Coordinate descent with arbitrary [2] Lu, Haihao, Robert M. Freund, and Yurii Nesterov. "Relativelysampling II: Expected separable overapproximation." Optimization Smooth Convex Optimization by First-Order Methods, and Methods and Software 31.5 (2016): 858-884. Applications." arXiv preprint arXiv:1610.05708 (2016).

Gk

Convergence result f (x⇤ )  O(maxt