Extending the Reach of Big Data Optimization: Stochastic and Accelerated Algorithms for Minimizing Relatively Smooth Functions Filip 1KAUST
Optimization Problem
Main Assumptions
1,2 Hanzely 2University
Peter
1,2,3 Richtárik 3MIPT
of Edinburgh
Baseline Algorithm – Relative Gradient Descent [1, 2]
Convergence result
Lin
4 Xiao
4Microsoft
Research
Relative Randomized Coordinate Descent •
is separable
• -ESO (Expected Separable Overapproxmation, defined in [3] in case when is L2 norm):
relative smoothness of w.r.t. relative strong convexity of w.r.t. • Standard smoothness and strong convexity arises as a special case for • Key tool – Bregman divergence • Equivalent formulations ( - relative smoothness)
Convergence result
References
✓)x + ✓y, (1 Dh (y, z)
t
E mintT f (x )
⇤
⇤
f (x ) O
= g(x, y, z, ✓)⌫(✓)
Fast decay for ✓
!0
Stepsize sequence (determined)
✓k+1 )xk + ✓k+1 z k
(1
Find Gk+1 2 R s. t. Gk+1 g(xk , z k+1 , z k , ✓k+1 ) and Gk+1 ⇢ k+1 ⌫(✓ ) k+1 k+1 k+1 k z argminz2Q hrf(y ), zi + k+1 G LDh (z, z ) ✓ xk+1 (1 ✓k+1 )xk + ✓k+1 z k+1
Convergence result
⇥
✓)x + ✓z)
Upper bounded by constant
y k+1
• Analysis for a range of stepsize parameters 3 new algorithms: • Relative Stochastic Gradient Descent • Relative Randomized Coordinate Descent • Accelerated Relative Gradient Descent
• Algorithm/rate depends on properties of Bregman distance -- Triangle Scaling Equality:
• w -relative strong convexity with different strong convexity parameter for each coordinate
Strong convexity parameter of i-th coordinate
Main Contributions
• f is relative smooth and convex 2 • Goal: get rate O(T )
Dh ((1
Relative Stochastic Gradient Descent
• Access to unbiased estimator of gradient • Useful when gradient estimator is much cheaper than full gradient
Accelerated Relative Gradient Descent
✓
1
wi p1 mini vi
◆T !
• Strictly better than Relative Gradient [1] Bauschke, Heinz H., Jérôme Bolte, and Marc Teboulle. "A descent Lemma beyond Lipschitz gradient continuity: first-order methods Descent revisited and applications." Mathematics of Operations Research (2016). [3] Qu, Zheng, and Peter Richtárik. "Coordinate descent with arbitrary [2] Lu, Haihao, Robert M. Freund, and Yurii Nesterov. "Relativelysampling II: Expected separable overapproximation." Optimization Smooth Convex Optimization by First-Order Methods, and Methods and Software 31.5 (2016): 858-884. Applications." arXiv preprint arXiv:1610.05708 (2016).
Gk
Convergence result f (x⇤ ) O(maxt