An Inexact Regularized Stochastic Newton Method

Report 0 Downloads 129 Views
An Inexact Regularized Stochastic Newton Method for Nonconvex Optimization Industrial and Systems Engineering

Xi He

Ioannis Akrotirianakis

Empirical Risk Minimization

Amit Chakraborty

Approximated Model

To minimize a finite sum problem: minm f(x) :=

x∈R

n X 1

n

kd (λ)k

fi(x)

(P)

i=1

fi : Rm → R is nonconvex loss function

||∇f(x)|| ≤ εg λmin(∇2f(x)) ≥ −εH

Each loss function fi : Rm → R is

1 σLk

H + λI  0 ||d + (H + λI)−1g|| ≤ ε||g|| σ L||d|| ≤ λ ≤ σ U ||d||

(d k , λk ) kd (λk )k + ²kg k k

Expected Solution

Assumption

L

kd (λk )k − ²kg k k

U

where σ and σ bound the Newton regularization parameter λ

1 σU k

(0, 0)

− min(λ(Hk ))

λ

Typical Methods

• twice continuous and differentiable

Main Algorithm (irSNT)

• bounded below by finf

Repeat :

For any given x, y ∈ Rm,

• Find dk ≈ arg mind∈D mk (d)

• ∇fi is Lipschitz continuous with Lg ||∇fi(x) − ∇fi(y)|| ≤ Lg||x − y|| • ∇2fi is Lipschitz continuous with LH ||∇ fi(x) − ∇ fi(y)|| ≤ LH ||x − y|| 2

Martin Takáč

2

• Check if (fk − fk+1)/mk (dk ) ≥ η • If True: xk+1 = xk + dk • If False: xk+1 = xk → Adjust m(d)

Algorithm 1 Inexact regularized stochastic Newton method (irSNT) 1: Input: Starting point x0, an acceptance constant η such that 0 < η < 1, two update constant γ1, γ2 such that γ2 ≥ γ1 > 1 L U ≤ σ ¯ , sample gradient 2: Initialization: Cubic regularization coefficient bounds 0 ≤ σ < σ 0q 0 ¯0 = max{KH + σ U ||g0||, 1}, g0 and Hessian H0 , regularization parameter λ 3: 4: 5:

Proposed Method • Why Newton method?

Sub-sampling: P 1 gk = |S g| i∈Skg ∇fi(xk ) k P 1 Hk = |S H | i∈SkH ∇2fi(xk ) k

Curvature information, fast local convergence,

second-order convergence

6: 7: 8: 9: 10: 11:

• Why Stochastic?

12:

Reduce computation, retain second-order convergence guarantee • Why inexact regularized?

for k = 0, 1, 2, . . . do ¯k , σ L , σ U ) (dk , λk ) ← AdaNT(gk , Hk , λ k k f(x )−f(x +dk )

Set ρk = k ||d ||k3 k if ρk ≥ η then xk+1 ← xk + dk

15:

,

(Accept the step) q L U = σ U, λ ¯k+1 = max{KH + σ U ||gk ||, 1}, σk+1 = 0, σk+1 k k Sample gradient gk+1 and Hessian Hk+1 else xk+1 ← xk (Reject step) q ¯k+1 = min{σ U ||dk ||, max{KH + σ U ||gk ||, 1}} σ L = γ1 λk , σ U = γ2 λk , λ k+1 ||dk || k+1 gk+1 = gk , Hk+1 = Hk

13: 14:

0

k+1

||dk ||

k

end if end for

Only Hessian-vector product required, solve subroutine in linear time

Sub-solver (adaNT) Main Contributions • Algorithm with second-order convergence guarantee −3/2 −3 ˜ → irSNT: Best worst-case complexity O(max{ε , ε }) g H

→ adaNT: Linear time efficient subroutine • Advantages → irSNT: Third order reduction regarding successful iterations → adaNT: Subroutine with only Hessian-vector operator required

Convergence Analysis

¯k , σ L , σ U ) Algorithm 2 Obtain descent direction (d, λk ) ← AdaNT(gk , Hk , λ k k 0 ¯k + κ, Power Method approximation tolerance δ 1: Input: λ ← λ k 2: for i = 0, 1, 2, . . . do 3: εk ← min{(1 − δ)/(8λ0k ), λik /(σkU − σkL), 1/||gk ||, εˆk } Linear time sub-routine i −1 4: Compute d such that ||d + (Hk + λk I) gk || ≤ εk 1/2 log(κ/ε)) O(κ 5: if σkL||d|| ≤ λik ≤ σkU ||d|| then 6: return: (d, λik ) 7: else if λik > σkU ||d|| then 8: Derive a (1−δ)-approximated unit eigenvector v corresponding to the largest eigenvalue of (Hk + λik I)−1  0, i.e., (1 − δ)λmax((Hk + λik I)−1) ≤ v T (Hk + λik I)−1v ≤ λmax((Hk + λik I)−1)

Theorem (First order global convergence) lim ||∇f(xk )|| = 0, w.h.p

k→∞

9: 10: 11: 12:

Theorem (First-order worst-case complexity) Under necessary assumptions, for a scalar 0 < ε < 1, the total number of iterations needed to achieve ||∇f(xk )|| ≤ ε is at most

13: 14: 15:

Compute a vector vˆ such that ||ˆ v − (Hk + λik I)−1v|| ≤ εk ||v|| = εk i − (1 − δ)/(4(v T v ˆ − εk )) Set λi+1 = λ k k else (d, λ∗k ) ←BinarySearch(λik , λi−1 k , εk ) return: (d, λ∗k ) end if end for

1 + b(1 + K1)K2ε −3/2c

Numerical Results

f0−finf 1 log( σmax )c. where K1 = η(3C +2L +2C and K = 1 + b 2 σinf log(γ1) g H H +2σmax +Lg )

cifar10, irSNT

10-2

Theorem (Second-order worst-case complexity) Under necessary assumptions, for two scalars 0 < εg, εH < 1, the total number of iterations needed to achieve ||∇f(xk )|| ≤ εg and λmin(∇2f(xk )) ≥ −εH is at most −3}K c, and K = b( f0−finf )c. 1 + bmax{(1 + K1)εg−3/2, K3εH 2 3 ηM −3

10-2

10-4

10-4

10-6

10-6

10-8

10-8

10-10

10-10

10-12

10-14

Corollary Under necessary assumptions, for two scalars 0 < εg, εH < 1, the total number of itera˜ g−3/2). Furthermore, we could achieve tion needed to achieve ||∇f(xk )|| ≤ εg is at most O(ε −3/2 −3 ˜ λmin(∇2f(xk )) ≥ −εH in at most O(max{ε , εH }) iterations. g

10-12

50

100

150

200

250

Iterations(k) cifar10, SGDmom

10-2

10-14 300

10-2

cifar10, irSNT

0.955

0.085

0.95

0.08

0.945

0.075

0.94

0.07

0.935

0.065

0.93

0.06

0.925

0.055

0.92

0.05

0.915

50

100

150

200

250

Iterations(k) cifar10, SGDmom

0.9204

0.045 300

0.082

0.9202

10

-3

0.0815

0.92

10-3 10-4

0.9198 0.081 0.9196

10-4

10-5 0.9194 0.0805

10-6 10-5

0.919

Reference

10

0.08

-7

0.9188

10-6

1. He Xi, Ioannis Akrotirianakis, Amit Chakraborty, and Martin Takáč. An inexact regularized stochastic Newton method for non-convex optimization. 2. Agarwal Naman, Allen-Zhu, Zeyuan, etc.. Finding approximate local minima for nonconvex optimization in linear time. arXiv preprint arXiv:1611.01146, 2016. 3. Frank E. Curtis, Daniel P. Robinson, etc.. An inexact regularized newton framework with a worst-case iteration complexity of O(ε −3/2) for non-convex optimization. arXiv preprint arXiv:1708.00475, 2017a.

0.9192

50

100

150

Iterations(k)

200

250

10-8 300

0.9186

50

100

150

200

250

0.0795 300

Iterations(k)

Fig. 1: Comparing number of iterations between different algorithms. Run on fully connected network using a down-sampled Cifar10 dataset and start from a point with ||∇f(x0)|| ≤ 10−3. SGDmom stands for stochastic gradient descent with momentum under fine-tuning.