Optimal Newton-type methods for nonconvex smooth optimization Coralia Cartis (University of Edinburgh, UK) joint with
Nick Gould (RAL, UK) & Philippe Toint (Namur, Belgium)
WID-DOW Seminar Series University of Wisconsin, Madison November 14, 2011
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 1/37
Unconstrained optimization — a “mature” area? Local unconstrained optimization: n
n
f (x) where f ∈ C 1 (IR ) or C 2 (IR ). minimize n x∈IR
Currently two main competing methodologies: Linesearch methods Trust-region methods Much reliable, efficient software for (large-scale) problems. Cubic regularization methods ... Is there anything more to say?...
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 2/37
Unconstrained optimization — a “mature” area? Local unconstrained optimization: n
n
f (x) where f ∈ C 1 (IR ) or C 2 (IR ). minimize n x∈IR
Currently two main competing methodologies: Linesearch methods Trust-region methods Much reliable, efficient software for (large-scale) problems. Cubic regularization methods ... Is there anything more to say?... Global rates of convergence for optimization algorithms Cubic regularization: better than Newton and steepest descent in the worst-case; optimal in second-order class University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 2/37
Global efficiency of standard methods number of function & gradient evaluations ≃ iteration complexity. Steepest descent method (with linesearch or trust-region): f ∈ C 2 (IRn ) with Lipschitz continuous gradient.
to generate kg(xk )k ≤ ǫ, requires at most
κsd ǫ2
[c.f. Nesterov (’04), Gratton et al. (’08)]
function evaluations.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 3/37
Global efficiency of standard methods number of function & gradient evaluations ≃ iteration complexity. Steepest descent method (with linesearch or trust-region): f ∈ C 2 (IRn ) with Lipschitz continuous gradient.
to generate kg(xk )k ≤ ǫ, requires at most
κsd ǫ2
[c.f. Nesterov (’04), Gratton et al. (’08)]
function evaluations.
Newton’s method: when convergent, requires at most ???
function evaluations.
if Newton step taken within trust-region framework, then steepest descent bound applies. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 3/37
The worst-case bound is sharp for steepest descent given x0 , for any ǫ > 0 and τ > 0, steepest descent with (inexact) linesearch applied to f below takes precisely
1 ǫ2−τ
function evaluations
to generate |g(xk )| ≤ ǫ. 4
2.5
0
x 10
−0.1 2.5
−0.2 2.5
−0.3 2.5
−0.4 2.5
−0.5 2.4999
−0.6 2.4999
−0.7
2.4999
−0.8
2.4999
2.4999
−0.9
0
1
2
3
4
5
6
7
The objective function f .
−1
0
1
2
3
4
5
6
7
Its gradient g . University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 4/37
The bound is sharp for steepest descent ... f ∈ C(IR) bounded below by zero.
gradient g is Lipschitz continuous. f (xk ) → 0 = inf x∈IR f (x), as k → ∞. 3
100
80
2 60
1 40
0
20
0
−1 −20
−2 −40
−3
0
1
2
3
4
5
The Hessian of f .
6
7
−60
0
1
2
3
4
5
6
7
The third derivative of f .
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 5/37
The bound is sharp for steepest descent ... 0 < α ≤ αk ≤ α < 2,
Unidimensional example: x0 = 0, xk+1 = xk + αk
where η = η(τ ) such that f0 = 21 ζ(1 + 2η), gk = −
1 2
1
1 +η
,
1 . 2−τ
Also,
2
k+1
+η =
k ≥ 0,
fk+1 = fk − αk (1 − 21 αk )
1 k+1
1 +η 2
1 k+1
1+2η
,
and Hk = 1.
Use Hermite interpolation on [xk , xk+1 ] to construct f s.t. f (xk ) = fk , g(xk ) = gk and H(xk ) = Hk . =⇒
k=
1 ǫ2−τ
such that |gk | ≤ ǫ. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 6/37
Newton’s method: as slow as steepest descent Newton’s method may require as much as
κN ǫ2−τ
function evaluations
to generate kg(xk )k ≤ ǫ, for any ǫ and τ > 0. Corollary: trust-region methods’ worst-case bound of O(ǫ−2 ) is sharp. f ∈ C(IR2 ) with bounded
and (segmentwise) Lipschitz continuous Hessian. Path of iterates and contours of f . University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 7/37
Newton’s method: as slow as steepest descent ... Bidimensional example: x0 = (0, 0)T , η = η(τ ) s.t. xk+1 = xk +
1 k+1
1 +η 2
,1
!T
,
1 2
+η =
1 ; 2−τ
k ≥ 0, 1+2η
2
1 1 + , f0 = 21 [ζ(1 + 2η) + ζ(2)], fk+1 = fk − 12 k+1 k+1 1 +η 2 T 2 2 1 1 1 , gk = − and H = diag 1, . k k+1 k+1 k+1
Use previous example for x1 (with αk = 1) and Hermite 1 2 2 2 interpolation on [xk , xk+1 ] for x =⇒ k ≥ ǫ2−τ such that kgk k ≤ ǫ. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 8/37
Improved complexity for cubic regularization H is globally Lipschitz continuous with Lipschitz constant 2σ :
Taylor, Cauchy-Schwarz and Lipschitz =⇒ f (x + s)
= f (x) + sT g(x) + 21 sT H(x)s R1 + 0 (1 − α)sT [H(x + αs) − H(x)]sdα
≤ f (x) + sT g(x) + 12 sT H(x)s + 13 σksk32 {z } | m(s)
=⇒ reducing m from s = 0 decreases f since m(0) = f (x).
compute sk −→ mins mk (s). A. Griewank (1981, technical report). Y. Nesterov & B. Polyak (2006). M. Weiser, P. Deuflhard & B. Erdmann (2007). C. C., N. I. M. Gould & Ph. L. Toint (2009, 2010). University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 9/37
Minimizing the cubic model f nonconvex −→ mk (s) may be nonconvex! m(s) ≡ f + sT g + 12 sT Hs + 13 σksk32 6
2
0
−6 −4 −3
−1
0
−10
0
−1
−6
3 −1 2 −1
−8
−8
−2
−2
−2 2
−10
−3
−4 −1 −3 2
0
0
−2
−6
2
−2
−1 −3 −4
0 −3 −1−4−6 −8
−2 −4
0
2 −1 −3
2
− −2 4
−8 −2 −4 2
4
−3−1
−6
0 2
−4
−6 −6
−4
−2
0
2
4
6
Necessary and sufficient optimality: any global minimizer s∗ of m satisfies (H + λ∗ I)s∗ = −g and λ∗ = σks∗ k2 , H + λ∗ I is positive semidefinite [cf. Nesterov et al. (2006), CGT (2009)] University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 10/37
Adaptive cubic regularization Assume f ∈ C 1 (IRn ) (maybe C 2 (IRn )) f , g (and H ) at xk are fk , gk (and Hk )
symmetric approximation Bk (to Hk )
⊳
Bk bounded above independently of k
Use cubic regularization model at xk mk (s) ≡ fk + sT gk + 21 sT Bk s + 13 σk ksk32 σk is the iteration-dependent regularization weight
⊳
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 11/37
Adaptive Regularization with Cubics (ARC) Given x0 , and σ0 > 0, for k = 0, 1, . . . until convergence, compute a step sk for which mk (sk ) ≤ mk (sCk ) C C Cauchy point: sC & α = −α g k k = arg min mk (−αgk ) k k α∈IR+
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 12/37
Adaptive Regularization with Cubics (ARC) Given x0 , and σ0 > 0, for k = 0, 1, . . . until convergence, compute a step sk for which mk (sk ) ≤ mk (sCk ) C C Cauchy point: sC & α = −α g k k = arg min mk (−αgk ) k k α∈IR+
compute ρk =
f (xk ) − f (xk + sk ) f (xk ) − mk (sk )
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 12/37
Adaptive Regularization with Cubics (ARC) Given x0 , and σ0 > 0, for k = 0, 1, . . . until convergence, compute a step sk for which mk (sk ) ≤ mk (sCk ) C C Cauchy point: sC & α = −α g k k = arg min mk (−αgk ) k k α∈IR+
compute ρk = set xk+1 =
(
f (xk ) − f (xk + sk ) f (xk ) − mk (sk ) xk + s k xk
if ρk > 0.1 otherwise
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 12/37
Adaptive Regularization with Cubics (ARC) Given x0 , and σ0 > 0, for k = 0, 1, . . . until convergence, compute a step sk for which mk (sk ) ≤ mk (sCk ) C C Cauchy point: sC & α = −α g k k = arg min mk (−αgk ) k k α∈IR+
compute ρk = set xk+1 =
(
f (xk ) − f (xk + sk ) f (xk ) − mk (sk )
if ρk > 0.1 otherwise
xk + s k xk
given γ2 ≥ γ1 > 1, set σk+1 ∈
(0, σk ]
= 12 σk if ρk > 0.9
very successful
[σk , γ1 σk ] = σk if 0.1 ≤ ρk ≤ 0.9 successful [γ1 σk , γ2 σk ] = 2σk otherwise
unsuccessful
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 12/37
Adaptive Regularization with Cubics (ARC) Given x0 , and σ0 > 0, for k = 0, 1, . . . until convergence, compute a step sk for which mk (sk ) ≤ mk (sCk ) C C Cauchy point: sC & α = −α g k k = arg min mk (−αgk ) k k α∈IR+
compute ρk = set xk+1 =
(
f (xk ) − f (xk + sk ) f (xk ) − mk (sk )
if ρk > 0.1 otherwise
xk + s k xk
given γ2 ≥ γ1 > 1, set σk+1 ∈
(0, σk ]
= 12 σk if ρk > 0.9
very successful
[σk , γ1 σk ] = σk if 0.1 ≤ ρk ≤ 0.9 successful
[γ1 σk , γ2 σk ] = 2σk otherwise [cf. trust-region methods]
unsuccessful
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 12/37
Basic ARC: global convergence and complexity
f bounded below
=⇒
lim inf kgk k = 0. k→∞
additionally, if g is uniformly continuous
=⇒
limk→∞ kgk k = 0.
additionally, if g Lipschitz continuous =⇒ O(ǫ−2 ) function/gradient-evaluations for kgk k ≤ ǫ. [cf. steepest-descent-like methods]
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 13/37
Second-order ARC: beyond the Cauchy point mk (sk ) ≤ mk (sC k ) achieved if:
4 model contours function contours local minima
3 2
sk = sC k
−→ inefficient.
1 0
sk =global argmins∈IRn mk (s) −→ expensive (large-scale).
steepest descent direction
−1
exact solution
−2 −3 −4 −4
−3
−2
−1
0
1
2
3
4
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 14/37
Second-order ARC: beyond the Cauchy point mk (sk ) ≤ mk (sC k ) achieved if:
4 model contours function contours local minima
3 2
sk = sC k
−→ inefficient.
1 0
sk =global argmins∈IRn mk (s) −→ expensive (large-scale).
steepest descent direction
−1
exact solution
−2 −3 −4 −4
−3
−2
−1
0
1
2
3
4
ARCS : sk =global min of mk (s) over s ∈ S ≤ IRn , where g ∈ S −→ increase subspaces to satisfy termination criteria: k∇s mk (sk )k ≤ min(1, ksk k)kgk k.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 14/37
Second-order ARC: beyond the Cauchy point mk (sk ) ≤ mk (sC k ) achieved if:
4 model contours function contours local minima
3 2
sk = sC k
−→ inefficient.
1 0
sk =global argmins∈IRn mk (s) −→ expensive (large-scale).
steepest descent direction
−1
exact solution
−2 −3 −4 −4
−3
−2
−1
0
1
2
3
4
ARCS : sk =global min of mk (s) over s ∈ S ≤ IRn , where g ∈ S −→ increase subspaces to satisfy termination criteria: k∇s mk (sk )k ≤ min(1, ksk k)kgk k.
superlinear local rate with Dennis-Moré on Bk ; Q-quadratic if H locally Lipschitz continuous at x∗ . if H globally Lipschitz continuous: global convergence to 2nd order critical points in the subspaces. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 14/37
Minimizing the cubic model over subspaces m(s) ≡ f + sT g + 12 sT Bs + 13 σksk32
Seek global minimizer of m(s) in a j -dimensional (j ≪ n) subspace S ⊆ IRn with g ∈ S Q orthogonal basis for S =⇒ s = Qu where u = arg minj f + uT (QT g) + 12 uT (QT BQ)u + 31 σk kuk32 u∈IR
=⇒ use secular equation to find u
if S is the Krylov space generated by {B i g}j−1 i=0 =⇒ QT BQ = T , tridiagonal =⇒ can factor T + λI to solve sec. eq. even if j is large
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 15/37
Average-case performance of ARC variants Preliminary numerical experience — using Matlab Performance Profile: iteration count − 131 CUTEr problems 1
fraction of problems for which method within α of best
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2 ARC − g stopping rule (3 failures) ARC − s stopping rule (3 failures) ARC − s/σ stopping rule (3 failures) trust−region (8 failures)
0.1
0
1
1.5
2
2.5
3 α
3.5
4
4.5
5
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 16/37
Worst-case performance of ARCS How many function- & gradient-evaluations are needed to ensure that kgk k ≤ ǫ?
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 17/37
Worst-case performance of ARCS How many function- & gradient-evaluations are needed to ensure that kgk k ≤ ǫ? if H Lipschitz continuous on iterates’ path and k(Bk − Hk )sk k = O(ksk k2 ) =⇒ ARCS requires at most m l function evaluations. κS · ǫ−3/2
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 17/37
Worst-case performance of ARCS How many function- & gradient-evaluations are needed to ensure that kgk k ≤ ǫ? if H Lipschitz continuous on iterates’ path and k(Bk − Hk )sk k = O(ksk k2 ) =⇒ ARCS requires at most m l function evaluations. κS · ǫ−3/2
To also ensure that for the same k as above −λmin (Q⊤ k Bk Q k ) ≤ ǫ, where Qk orthogonal basis matrix of minimization subspace =⇒ ARCS algorithm requires at most
κcurv · ǫ
−3
function evaluations.
[cf. Nesterov & P.] University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 17/37
ARCS : first-order worst-case bound If H Lipschitz continuous on iterates’ path and k(Bk − Hk )sk k = O(ksk k2 ), then ARCS requires at most l
κS · ǫ−3/2
m
function evaluations.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 18/37
ARCS : first-order worst-case bound If H Lipschitz continuous on iterates’ path and k(Bk − Hk )sk k = O(ksk k2 ), then ARCS requires at most l
κS · ǫ−3/2
m
function evaluations.
For all k successful, f (xk ) − f (xk+1 ) ≥ η1 [f (xk ) − mk (sk )] ≥
η1 3 σ ks k k k 6
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 18/37
ARCS : first-order worst-case bound If H Lipschitz continuous on iterates’ path and k(Bk − Hk )sk k = O(ksk k2 ), then ARCS requires at most l
κS · ǫ−3/2
m
function evaluations.
For all k successful, f (xk ) − f (xk+1 ) ≥ η1 [f (xk ) − mk (sk )] ≥ ksk k ≥ Ckgk+1 k
1 2
η1 3 σ ks k k k 6
and σk ≥ σmin > 0
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 18/37
ARCS : first-order worst-case bound If H Lipschitz continuous on iterates’ path and k(Bk − Hk )sk k = O(ksk k2 ), then ARCS requires at most l
κS · ǫ−3/2
m
function evaluations.
For all k successful, f (xk ) − f (xk+1 ) ≥ η1 [f (xk ) − mk (sk )] ≥ ksk k ≥ Ckgk+1 k
1 2
η1 3 σ ks k k k 6
and σk ≥ σmin > 0
=⇒ Pj f (x0 ) − flow ≥ k=0 [f (xk ) − f (xk+1 )] ≥
η1 σmin 6
Pj
k=0
kgk k
3 2
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 18/37
ARCS : first-order worst-case bound If H Lipschitz continuous on iterates’ path and k(Bk − Hk )sk k = O(ksk k2 ), then ARCS requires at most l
κS · ǫ−3/2
m
function evaluations.
For all k successful, f (xk ) − f (xk+1 ) ≥ η1 [f (xk ) − mk (sk )] ≥ ksk k ≥ Ckgk+1 k
1 2
and σk ≥ σmin > 0
=⇒ Pj f (x0 ) − flow ≥ k=0 [f (xk ) − f (xk+1 )] ≥
η1 σmin 6
While kgk k ≥ ǫ,
=⇒ f (x0 ) − flow ≥
η1 3 σ ks k k k 6
η1 σmin 3 |Sj | 6 ǫ 2
=⇒ |Sj | ≤
Pj
k=0
kgk k
3 2
6(f (x0 )−flow ) − 3 ǫ 2. η1 σmin
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 18/37
ARCS : the worst-case first-order bound is sharp Unidimensional example: x0 = 0, η = η(τ ) s.t. xk+1 = xk +
1 k+1
1 +η 3
,
1 3
+η =
1 ; 3−2τ
k ≥ 0.
f0 = 32 ζ(1 + 3η), fk+1 = fk − 23 2 +2η 3 1 gk = − k+1 , Hk = 0 and
1 k+1
1+3η
,
σ k = 1.
Use Hermite interpolation on [xk , xk+1 ] to construct f s.t.
=⇒
fk = f (xk ), gk = g(xk ) and Hk = H(xk ). l 3 m k = ǫ− 2 +τ such that |gk | ≤ ǫ.
• here, sk = global minimizer of cubic model mk (s), s ∈ IRn . University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 19/37
ARCS : the first-order bound is sharp ... given x0 , for any ǫ > 0 and τ > 0, ARCS applied to f below takes precisely l
ǫ
−3 +τ 2
m
function evaluations
to generate |g(xk )| ≤ ǫ. 4
2.2223
0
x 10
−0.1 2.2222
−0.2
−0.3 2.2222
−0.4
−0.5
2.2222
−0.6 2.2221
−0.7
−0.8 2.222
−0.9
2.222
0
1
2
3
4
5
6
7
8
9
The objective function f .
−1
0
1
2
3
4
5
6
7
8
9
Its gradient g . University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 20/37
ARCS : the first-order bound is sharp ... f ∈ C(IR) bounded below by zero.
Hessian H is bounded above and Lipschitz continuous on the path of iterates. f (xk ) → 0 = inf x∈IR f (x), as k → ∞. 20
1.5
15 1
10 0.5
5
0
0
−0.5
−5
−1
0
1
2
3
4
5
6
7
The Hessian of f .
8
9
−10
0
1
2
3
4
5
6
7
8
9
The third derivative of f . University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 21/37
A general class of methods and objectives Class of methods M.α: xk+1 = xk + sk ,
k ≥ 0;
(Hk + λk I)sk = −gk with λk ≥ 0 and Hk + λk I 0 ksk k ≤ κs
and λk ≤ κλ ksk kα , for some α ∈ [0, 1].
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 22/37
A general class of methods and objectives Class of methods M.α: xk+1 = xk + sk ,
k ≥ 0;
(Hk + λk I)sk = −gk with λk ≥ 0 and Hk + λk I 0 ksk k ≤ κs
and λk ≤ κλ ksk kα , for some α ∈ [0, 1].
Class of objectives A.α: f ∈ C 2 bounded below; g globally Lipschitz continuous and H α-Hölder continuous on the path of the iterates. α ∈ [0, 1];
α = 1: H Lipschitz-continuous.
A.1 ⊂ A.α.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 22/37
A general class of methods and objectives Class of methods M.α: xk+1 = xk + sk ,
k ≥ 0;
(Hk + λk I)sk = −gk with λk ≥ 0 and Hk + λk I 0 ksk k ≤ κs
and λk ≤ κλ ksk kα , for some α ∈ [0, 1].
Class of objectives A.α: f ∈ C 2 bounded below; g globally Lipschitz continuous and H α-Hölder continuous on the path of the iterates. α ∈ [0, 1];
α = 1: H Lipschitz-continuous.
A.1 ⊂ A.α.
Properties of class M.α: f ∈ A.α and M ∈ M.α =⇒ ksk k ≥ Ckgk+1 k kgk+1 k ≤ ckgk k1+α =⇒
1 1+α
.
lower bound on the step. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 22/37
Examples of methods in M.α Class of methods M.α: xk+1 = xk + sk ,
k ≥ 0;
(Hk + λk I)sk = −gk with λk ≥ 0 and Hk + λk I 0 ksk k ≤ κs
and λk ≤ κλ ksk kα .
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 23/37
Examples of methods in M.α Class of methods M.α: xk+1 = xk + sk ,
k ≥ 0;
(Hk + λk I)sk = −gk with λk ≥ 0 and Hk + λk I 0 ksk k ≤ κs
and λk ≤ κλ ksk kα .
Step calculation and sufficient decrease: make use of model mk (s) = fk + gkT s + 12 sT (Hk + βk I)s,
with βk = βk (s) ≥ 0 and βk ≤ λk .
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 23/37
Examples of methods in M.α Class of methods M.α: xk+1 = xk + sk ,
k ≥ 0;
(Hk + λk I)sk = −gk with λk ≥ 0 and Hk + λk I 0 ksk k ≤ κs
and λk ≤ κλ ksk kα .
Step calculation and sufficient decrease: make use of model mk (s) = fk + gkT s + 12 sT (Hk + βk I)s,
with βk = βk (s) ≥ 0 and βk ≤ λk . Examples of methods in M.α (applied to functions in A.α): Newton’s method: λk = 0, βk = 0 and α ∈ [0, 1]. Regularization methods: λk = σk ksk kα , βk =
2 2+α σ ks k k k 2+α
=⇒ α = 1: cubic regularization.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 23/37
Examples of methods in M.α ... Examples of methods in M.α (applied to functions in A.α): Goldfeld-Quandt-Trotter: βk = 0; 0, λk = α −λmin (Hk ) + Rk kgk k 1+α ,
when λmin (Hk ) ≥ Rk kgk k otherwise,
α 1+α
;
with Rk > 0.
Trust-region methods, when λk is at least uniformly bounded above; βk = 0.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 24/37
Examples of methods in M.α ... Examples of methods in M.α (applied to functions in A.α): Goldfeld-Quandt-Trotter: βk = 0; 0, λk = α −λmin (Hk ) + Rk kgk k 1+α ,
when λmin (Hk ) ≥ Rk kgk k otherwise,
α 1+α
;
with Rk > 0.
Trust-region methods, when λk is at least uniformly bounded above; βk = 0. Remarks: α
λk ≤ κλ ksk k =⇒ λk + λmin (Hk ) ≤ κ max{|λmin (Hk )|, kgk k
α 1+α
}
choose σk ≥ σmin > 0 and Rk ≥ Rmin > 0. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 24/37
(Order) optimality of regularization methods Theorem: Let M ∈ M.α. Then there exists a function f M ∈ A.α such that M takes (at least) ǫ
+τ − 2+α 1+α
iterations/function-evaluations to generate kgk k ≤ ǫ, for any τ > 0 arbitrarily small. =⇒ (2 + α)-regularization method is optimal for the class M.α when applied to functions in A.α, as its complexity upper
bound coincides in order to the lower bound. Extension to examples with finite minimizers: possible.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 25/37
Construction of the function f M Assume {fk } given; gk = −
κh |gk |
1 k+1
α 1+α
x0 = 0,
t
, k ≥ 0, for some t ∈ (0, 1];
≥ Hk ≥ −κh |gk |
α 1+α
,
xk+1 − xk = sk = −
k ≥ 0; gk
Hk + λ k
,
with Hk + λk ≻ 0 and λk ≥ 0 satisfying M.α properties.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 26/37
Construction of the function f M Assume {fk } given; gk = −
κh |gk |
1 k+1
α 1+α
x0 = 0,
t
, k ≥ 0, for some t ∈ (0, 1];
≥ Hk ≥ −κh |gk |
α 1+α
,
xk+1 − xk = sk = −
k ≥ 0; gk
Hk + λ k
,
with Hk + λk ≻ 0 and λk ≥ 0 satisfying M.α properties. Use Hermite interpolation and let f M (x) = pk (x − xk ) + fk+1 , for x ∈ [xk , xk+1 ] and k ≥ 0, where pk is a 5th degree polynomial satisfying pk (0) = fk − fk+1 and pk (sk ) = 0; p′k (0) = gk and p′k (sk ) = gk+1 ; ′′ ′′ pk (0) = Hk and pk (sk ) = Hk+1 . A lot more details.... University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 26/37
ARC: improved complexity for structured problems f has bounded level sets and at local min x∗ , H(x∗ ) ≻ 0 =⇒ ARC Q-quadratic local rate when sufficiently close to x∗ =⇒ log | log ǫ| iteration complexity asymptotically, near x∗ .
Further, can estimate δ such that kgk+1 k ≤ δkgk k2 =⇒ N (x∗ ) ∩ {x : kg(x)k ≤ 1δ } = N quadratic convergence.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 27/37
ARC: improved complexity for structured problems f has bounded level sets and at local min x∗ , H(x∗ ) ≻ 0 =⇒ ARC Q-quadratic local rate when sufficiently close to x∗ =⇒ log | log ǫ| iteration complexity asymptotically, near x∗ .
Further, can estimate δ such that kgk+1 k ≤ δkgk k2 =⇒ N (x∗ ) ∩ {x : kg(x)k ≤ 1δ } = N quadratic convergence. If there is no x ∈ / N s.t. ǫ ≤ kg(x)k ≤ 1/δ , then kgk k ≤ ǫ requires at most l
κ1 · δ 3/2 + κ2 · log | log ǫ|
m
function evaluations
Example: Rosenbrock function. improved efficiency bounds for nonconvex problems based on gradient’s “phases” −→ additive complexity bounds. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 27/37
Second-order optimality complexity bounds are also tight for ARC and trust-region methods.
−→
O(ǫ−3 ) evaluations for ARC and trust-region to ensure both kgk k ≤ ǫ and λmin (Hk ) ≥ −ǫ.
this bound is tight for each method. 6 0
4
-0.5 2
0 -1
-2
-1.5 -4
-6 0
1
2
3
4
5
6
The gradient g .
7
8
0
1
2
3
4
5
6
7
8
The Hessian H .
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 28/37
First-order finite-difference and derivative-free ARC
forward gradient-differences to construct approximate −3/2 Hessian Bk −→ O n(ǫ + | log ǫ|) gradient- and −3/2 O ǫ function-evaluations for kgk k ≤ ǫ.
central difference scheme for approximating gk + function finite-differences for Bk with same stepsize (attention to 2 −3/2 termination criterion) −→ O n (ǫ + | log ǫ|) function-evaluations for kgk k ≤ ǫ.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 29/37
Evaluation complexity of constrained problems Consider the general constrained nonlinear programming: minimize f (x) subject to cE (x) = 0; cI (x) ≥ 0, n x∈IR
n
m,p
where f and cE,I : IR → IR are smooth and nonconvex. Complexity of computing an (approximate) KKT point? (Question not restricted to cubic regularization algorithms) A detour: minimizing non-smooth composite functions f (x) + h(c(x)), minimize n x∈IR
n
n
m
where f : IR → IR and c : IR → IR smooth and nonconvex, m (h = k · k) and h : IR → IR nonsmooth but convex. [considered by Nesterov (2006, 2007) and CGT (2011)] University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 30/37
Minimizing a non-smooth composite function minimize f (x) + h(c(x)) n x∈IR
First-order method: compute a step sk by solving the (convex) problem minimize l(xk , s) = f (xk ) + g(xk )T s + h(c(xk ) + J (xk )s), ksk≤∆k
for some trust-region radius ∆k (also possible using quadratic regularization).
Main result: Assume f , c, h are globally Lipschitz continuous. Then the “algorithm” takes at most O(ǫ−2 ) problem-evaluations to achieve Ψ(xk ) ≤ ǫ, where Ψ(xk ) is a first-order criticality measure Ψ(xk ) = l(xk , 0) − min l(xk , s) ksk≤1
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 31/37
A first-order algorithm for EC-NLO Consider now minimize f (x) subject to c(x) = 0. n x∈IR
Idea for a first-order algorithm: get feasible (if possible) by minimizing kc(x)k. track the trajectory n
T (t) = {x ∈ IR : c(x) = 0 and f (x) = t},
for decreasing values of t from some t0 (corresponding to the first feasible iterate).
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 32/37
A first-order algorithm for EC-NLO ... A Short-Step Steepest-Descent (SSSD) algorithm: feasibility: apply nonsmooth composite minimization to kc(x)k. minimize n x∈IR
=⇒
at most O(ǫ−2 ) function evaluations.
tracking: successively apply one (successful) step of nonsmooth composite minimization to Φ(x) = kc(x)k + |f (x) − t|. minimize n x∈IR
decrease t (proportionally to the decrease in Φ(x)) =⇒
at most O(ǫ−2 ) problem evaluations. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 33/37
A complexity result for EC-NLO Assume that f , its gradient g , constraints c and Jacobian J are globally Lipschitz continuous; f bounded below and above in an ǫ−neighbourhood of feasibility. Then the SSSD algorithm takes at most O ǫ
−2
problem evaluations
to find an iterate xk with either kc(xk )k ≤ ǫ and kJ (xk )T y + g(xk )k ≤ ǫ, for some y , or kc(xk )k > κf ǫ and kJ (xk )T zk ≤ ǫ,
for some z and for some user-defined κf > 0. also applies to inequality-constrained problems: replace kc(x)k by k min(c(x), 0)k. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 34/37
Evaluation complexity of constrained problems... A (first-order) exact penalty method for EC-NLO: To generate an approximate KKT point (within ǫ), requires at most: O(ǫ−2 ) problem-evaluations when the penalty parameter
is bounded O(ǫ−4 ) problem-evaluations, otherwise. −→ Use first-order trust-region or quadratic-regularization for
minimizing the composite nonsmooth penalty function Φρ (x) = f (x) + ρkc(x)k.
also applies to inequality-constrained problems: replace kc(x)k by k min(c(x), 0)k. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 35/37
Evaluation complexity of constrained problems... Second-order methods for nonlinear (nonconvex) equalityand inequality-constrained smooth problems −→ ARC variants for problems with convex constraints and nonconvex objective considered: at most O(ǫ−3/2 )
problem-evaluations required for approximate first-order. −→ general (nonconvex) constraints? (work in progress)
Remark: Often, subproblem solution does not require additional problem-evaluations −→ complexity bounds ignore the cost of solving the subproblem. Then, evaluation cost of constrained optimization ≡ unconstrained optimization. University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 36/37
Conclusions and work in progress Algorithm design profits from complexity analysis. Problem dimension-dependence of complexity bounds −→ Jarre’s example. Cubic regularization: the next generation of optimization software? Function-evaluation complexity of second-order methods for nonconvex constrained problems. Some references: •
CGT, On the complexity of steepest-descent, Newton’s and regularized Newton’s method
for unconstrained optimization. SIAM Journal on Optimization, 2010.
•
CGT, Optimal Newton-type methods for nonconvex smooth optimization, 2011
(Optimization Online).
•
CGT, Adaptive cubic regularization methods for unconstrained optimization. Part I and II.
Math Programming 2010, 2011.
University of Wisconsin Madison, WID-DOW Seminar Series, November 14, 2011 – p. 37/37