An Inexact Regularized Stochastic Newton Method for Nonconvex Optimization Industrial and Systems Engineering
Xi He
Ioannis Akrotirianakis
Empirical Risk Minimization
Amit Chakraborty
Approximated Model
To minimize a finite sum problem: minm f(x) :=
x∈R
n X 1
n
kd (λ)k
fi(x)
(P)
i=1
fi : Rm → R is nonconvex loss function
||∇f(x)|| ≤ εg λmin(∇2f(x)) ≥ −εH
Each loss function fi : Rm → R is
1 σLk
H + λI 0 ||d + (H + λI)−1g|| ≤ ε||g|| σ L||d|| ≤ λ ≤ σ U ||d||
(d k , λk ) kd (λk )k + ²kg k k
Expected Solution
Assumption
L
kd (λk )k − ²kg k k
U
where σ and σ bound the Newton regularization parameter λ
1 σU k
(0, 0)
− min(λ(Hk ))
λ
Typical Methods
• twice continuous and differentiable
Main Algorithm (irSNT)
• bounded below by finf
Repeat :
For any given x, y ∈ Rm,
• Find dk ≈ arg mind∈D mk (d)
• ∇fi is Lipschitz continuous with Lg ||∇fi(x) − ∇fi(y)|| ≤ Lg||x − y|| • ∇2fi is Lipschitz continuous with LH ||∇ fi(x) − ∇ fi(y)|| ≤ LH ||x − y|| 2
Martin Takáč
2
• Check if (fk − fk+1)/mk (dk ) ≥ η • If True: xk+1 = xk + dk • If False: xk+1 = xk → Adjust m(d)
Algorithm 1 Inexact regularized stochastic Newton method (irSNT) 1: Input: Starting point x0, an acceptance constant η such that 0 < η < 1, two update constant γ1, γ2 such that γ2 ≥ γ1 > 1 L U ≤ σ ¯ , sample gradient 2: Initialization: Cubic regularization coefficient bounds 0 ≤ σ < σ 0q 0 ¯0 = max{KH + σ U ||g0||, 1}, g0 and Hessian H0 , regularization parameter λ 3: 4: 5:
Proposed Method • Why Newton method?
Sub-sampling: P 1 gk = |S g| i∈Skg ∇fi(xk ) k P 1 Hk = |S H | i∈SkH ∇2fi(xk ) k
Curvature information, fast local convergence,
second-order convergence
6: 7: 8: 9: 10: 11:
• Why Stochastic?
12:
Reduce computation, retain second-order convergence guarantee • Why inexact regularized?
for k = 0, 1, 2, . . . do ¯k , σ L , σ U ) (dk , λk ) ← AdaNT(gk , Hk , λ k k f(x )−f(x +dk )
Set ρk = k ||d ||k3 k if ρk ≥ η then xk+1 ← xk + dk
15:
,
(Accept the step) q L U = σ U, λ ¯k+1 = max{KH + σ U ||gk ||, 1}, σk+1 = 0, σk+1 k k Sample gradient gk+1 and Hessian Hk+1 else xk+1 ← xk (Reject step) q ¯k+1 = min{σ U ||dk ||, max{KH + σ U ||gk ||, 1}} σ L = γ1 λk , σ U = γ2 λk , λ k+1 ||dk || k+1 gk+1 = gk , Hk+1 = Hk
13: 14:
0
k+1
||dk ||
k
end if end for
Only Hessian-vector product required, solve subroutine in linear time
Sub-solver (adaNT) Main Contributions • Algorithm with second-order convergence guarantee −3/2 −3 ˜ → irSNT: Best worst-case complexity O(max{ε , ε }) g H
→ adaNT: Linear time efficient subroutine • Advantages → irSNT: Third order reduction regarding successful iterations → adaNT: Subroutine with only Hessian-vector operator required
Convergence Analysis
¯k , σ L , σ U ) Algorithm 2 Obtain descent direction (d, λk ) ← AdaNT(gk , Hk , λ k k 0 ¯k + κ, Power Method approximation tolerance δ 1: Input: λ ← λ k 2: for i = 0, 1, 2, . . . do 3: εk ← min{(1 − δ)/(8λ0k ), λik /(σkU − σkL), 1/||gk ||, εˆk } Linear time sub-routine i −1 4: Compute d such that ||d + (Hk + λk I) gk || ≤ εk 1/2 log(κ/ε)) O(κ 5: if σkL||d|| ≤ λik ≤ σkU ||d|| then 6: return: (d, λik ) 7: else if λik > σkU ||d|| then 8: Derive a (1−δ)-approximated unit eigenvector v corresponding to the largest eigenvalue of (Hk + λik I)−1 0, i.e., (1 − δ)λmax((Hk + λik I)−1) ≤ v T (Hk + λik I)−1v ≤ λmax((Hk + λik I)−1)
Theorem (First order global convergence) lim ||∇f(xk )|| = 0, w.h.p
k→∞
9: 10: 11: 12:
Theorem (First-order worst-case complexity) Under necessary assumptions, for a scalar 0 < ε < 1, the total number of iterations needed to achieve ||∇f(xk )|| ≤ ε is at most
13: 14: 15:
Compute a vector vˆ such that ||ˆ v − (Hk + λik I)−1v|| ≤ εk ||v|| = εk i − (1 − δ)/(4(v T v ˆ − εk )) Set λi+1 = λ k k else (d, λ∗k ) ←BinarySearch(λik , λi−1 k , εk ) return: (d, λ∗k ) end if end for
1 + b(1 + K1)K2ε −3/2c
Numerical Results
f0−finf 1 log( σmax )c. where K1 = η(3C +2L +2C and K = 1 + b 2 σinf log(γ1) g H H +2σmax +Lg )
cifar10, irSNT
10-2
Theorem (Second-order worst-case complexity) Under necessary assumptions, for two scalars 0 < εg, εH < 1, the total number of iterations needed to achieve ||∇f(xk )|| ≤ εg and λmin(∇2f(xk )) ≥ −εH is at most −3}K c, and K = b( f0−finf )c. 1 + bmax{(1 + K1)εg−3/2, K3εH 2 3 ηM −3
10-2
10-4
10-4
10-6
10-6
10-8
10-8
10-10
10-10
10-12
10-14
Corollary Under necessary assumptions, for two scalars 0 < εg, εH < 1, the total number of itera˜ g−3/2). Furthermore, we could achieve tion needed to achieve ||∇f(xk )|| ≤ εg is at most O(ε −3/2 −3 ˜ λmin(∇2f(xk )) ≥ −εH in at most O(max{ε , εH }) iterations. g
10-12
50
100
150
200
250
Iterations(k) cifar10, SGDmom
10-2
10-14 300
10-2
cifar10, irSNT
0.955
0.085
0.95
0.08
0.945
0.075
0.94
0.07
0.935
0.065
0.93
0.06
0.925
0.055
0.92
0.05
0.915
50
100
150
200
250
Iterations(k) cifar10, SGDmom
0.9204
0.045 300
0.082
0.9202
10
-3
0.0815
0.92
10-3 10-4
0.9198 0.081 0.9196
10-4
10-5 0.9194 0.0805
10-6 10-5
0.919
Reference
10
0.08
-7
0.9188
10-6
1. He Xi, Ioannis Akrotirianakis, Amit Chakraborty, and Martin Takáč. An inexact regularized stochastic Newton method for non-convex optimization. 2. Agarwal Naman, Allen-Zhu, Zeyuan, etc.. Finding approximate local minima for nonconvex optimization in linear time. arXiv preprint arXiv:1611.01146, 2016. 3. Frank E. Curtis, Daniel P. Robinson, etc.. An inexact regularized newton framework with a worst-case iteration complexity of O(ε −3/2) for non-convex optimization. arXiv preprint arXiv:1708.00475, 2017a.
0.9192
50
100
150
Iterations(k)
200
250
10-8 300
0.9186
50
100
150
200
250
0.0795 300
Iterations(k)
Fig. 1: Comparing number of iterations between different algorithms. Run on fully connected network using a down-sampled Cifar10 dataset and start from a point with ||∇f(x0)|| ≤ 10−3. SGDmom stands for stochastic gradient descent with momentum under fine-tuning.