Interior-Point Theory for Convex Optimization with Self-Concordant ...

Comment

Report 2 Downloads 48 Views

Interior-Point Theory for Convex Optimization Robert M. Freund May, 2014

c

2014 Massachusetts Institute of Technology. All rights reserved.

1

1

Background

The material presented herein is based on the following two research texts: Interior-Point Polynomial Algorithms in Convex Programming by Yurii Nesterov and Arkadii Nemirovskii, SIAM 1994, and A Mathematical View of Interior-Point Methods in Convex Optimization by James Renegar, SIAM 2001.

2

Barrier Scheme for Solving Convex Optimization

Our problem of interest is P : minimizex cT x x∈S ,

s.t.

where S is some closed convex set, and denote the optimal objective value by V ∗ . Let f (·) be a barrier function for S, namely f (·) satisfies: (a) f (·) is strictly convex on its domain Df := intS , and (b) f (x) → ∞ as x → ∂S . The idea of the barrier method is to dissuade the algorithm from computing points too close to ∂S, effectively eliminating the complicating factors of dealing with ∂S. For every value of µ > 0 we create the barrier problem: Pµ : minimizex µcT x + f (x) x ∈ Df .

s.t.

Note that Pµ is effectively unconstrained, since the boundary of the feasible region will never be encountered. The solution of Pµ is denoted z(µ): 2

z(µ) := arg min{µcT x + f (x) : x ∈ Df } . x

Intuitively, as µ → ∞, the impact of the barrier function on the solution of Pµ should become less and less, so we should have cT z(µ) → V ∗ as µ → ∞. Presuming this is the case, the barrier scheme tries to use Newton’s method to solve for approximate solutions xi of Pµi for an increasing sequence of values of µi → ∞. In order to be more specific about how the barrier scheme might work, let us assume that at each iteration we have some value x ∈ Df that is an approximate solution of Pµ for a given value µ > 0. We will, of course, need a way to define “is an approximate solution of Pµ ” that will be developed later. We then will increase the barrier parameter µ by a multiplicative factor α > 1: µ ˆ ← αµ . Then we will take a Newton step at x for the problem Pµˆ to obtain a new point x ˆ that we would like to then be an approximate solution of Pµˆ . If so, we can continue the scheme inductively. We typically use g(·) and H(·) to denote the gradient and Hessian of f (·). Note that the Newton iterate for Pµˆ has the formula: x ˆ ← x − H(x)−1 (ˆ µc + g(x)) . The general algorithmic scheme is presented in Algorithm 1.

3

Some Plain Facts

Let f (·) : Rn → R be a twice-differentiable function. We typically use g(·) and H(·) denote the gradient and Hessian of f (·). Here are four facts about integrals and derivatives: Z

1

H(x + t(y − x))(y − x)dt

Fact 3.1 g(y) = g(x) + 0

3

Algorithm 1 General Barrier Scheme Initialize. Initialize with µ0 > 0, x0 ∈ Df that is “an approximate solution of Pµ0 .” i ← 0. Define α > 1. At iteration i : 1. Current values. µ ← µi x ← xi 2. Increase µ and take Newton step. µ ˆ ← αµ x ˆ ← x − H(x)−1 (ˆ µc + g(x)) 3. Update values. µi+1 ← µ ˆ xi+1 ← x ˆ

4

Fact 3.2 Let h(t) := f (x + tv). Then 0

(i) h (t) = g(x + tv)T v , and 00

(ii) h (t) = v T H(x + tv)v . Fact 3.3 f (y) = f (x) + g(x)T (y − x) + 12 (y − x)T H(x)(y − x) Z

1Z t

+ 0

Z Fact 3.4 0

r

(y − x)T [H(x + s(y − x)) − H(x)](y − x) ds dt

0

1 ar2 − 1 dt = (1 − at)2 1 − ar

This follows by observing that

R

1 dt (1−at)2

=

1 a(1−at) .

We also present five additional facts that we will need in our analyses. Fact 3.5 Suppose f (·) is a convex function on Rn , and S ⊂ Rn is a compact convex set, and suppose x ∈ intS satisfies f (x) ≤ f (y) for all y ∈ ∂S. Then f (·) attains its global minimizer on S. √ Fact 3.6 Let kvk := v T v be the Euclidean norm. Let λ1 ≤ . . . ≤ λn be the ordered eigenvalues of the symmetric matrix M , and define kM k := max{kM xk : kxk ≤ 1}. Then kM k = maxi {|λi |} = max{|λn |, |λ1 |}. Fact 3.7 Suppose A, B are symmetric and A + B = θI for some θ ∈ R. Then AB = BA. Furthermore, if A 0, B 0, then Aα B β = B β Aα for all α, β ≥ 0. To see why this is true, decompose A = P DP T where P is orthonormal (P T = P −1 ) and D is diagonal. Then B = P (θI − D)P T , whereby Aα B β = P Dα P T P (θI − D)β P T = P Dα (θI − D)β P T = P (θI − D)β Dα P T = P (θI − D)β P T P Dα P T = B β Aα . 5

Fact 3.8 Suppose λn ≥ . . . ≥ λ1 > 0. Then max{|λi − 1|} ≤ max{λn − 1, 1/λ1 − 1} . i

Fact 3.9 Suppose a, b, c, d > 0. Then na c o a + c na c o min , ≤ ≤ max , . b d b+d b d

4

Self-Concordant Functions and Properties

Let f (·) be a strictly convex twice-differentiable function defined on the open ¯ f := cl Df . Consider x ∈ Df . We will often set Df := domainf (·) and let D abbreviate Hx := H(x) for the Hessian at x. Here we assume that Hx 0, whereby Hx can be used to define the norm p kvkx := v T Hx v which is the “local norm” at x. Notice that p 1 kvkx = v T Hx v = kHx2 vk , √ where kwk = wT w is the standard Euclidean (L2 ) norm. Let Bx (x, 1) := {y : ky − xkx < 1} . This is called the open Dikin ball at x after the Russian mathematician I.I.Dikin. Definition 4.1 f (·) is said to be (strongly nondegenerate) self-concordant if for all x ∈ Df we have Bx (x, 1) ⊂ Df , and for all y ∈ Bx (x, 1) we have: 1 − ky − xkx ≤

kvky 1 ≤ kvkx 1 − ky − xkx

for all v 6= 0. Let SC denote the class of all such functions. 6

Remark 1 The following are the most-used self-concordant functions: (i) f (x) = − ln(x) for x ∈ Df = {x ∈ R : x > 0} , (ii) f (X) = − ln det(X) for X ∈ Df = {X ∈ S k×k : X 0} , and P (iii) f (x) = − ln(x21 − nj=2 x2j ) for x ∈ Df := {x : k(x2 , . . . , xn )k < x1 } . Before showing that these functions are self-concordant, let us see how we can combine self-concordant functions to obtain other self-concordant functions. Proposition 4.1 (self-concordance under addition/intersection) Suppose that fi (·) ∈ SC with domain Di := Dfi for i = 1, 2, and suppose that D := D1 ∩D2 6= ∅. Define f (·) = f1 (·)+f2 (·). Then Df = D and f (·) ∈ SC. Proof: Consider x ∈ D = D1 ∩ D2 . Let Bxi (c, r) denote the Dikin ball centered at c with radius r defined by fi (·) and let k · kx,i denote the norm induced at x using the Hessian Hi (x) of fi (x) for i = 1, 2. Then since x ∈ Di we have Bxi (x, 1) ⊂ Di and kvk2x = kvk2x,1 + kvk2x,2 because H(x) = H1 (x) + H2 (x). Therefore if ky − xkx < 1 it follows that ky − xkx,1 < 1 and ky − xkx,2 < 1, whereby y ∈ Bxi (x, 1) ⊂ Di for i = 1, 2, and hence y ∈ D1 ∩ D2 = D. Also, for any v 6= 0, using Fact 3.9 we have

kvk2y kvk2x

=

kvk2y,1 +kvk2y,2 kvk2x,1 +kvk2x,2

≤ max

≤ max

≤

kvk2y,1 kvk2y,2 , kvk2x,1 kvk2x,2

1 1−ky−xkx,1

1 1−ky−xkx

2

2 2 1 , 1−ky−xkx,2

.

The virtually identical argument can also be applied to prove the “≥” inequality of the definition of self-concordance by replacing “max” by “min” above and applying the other inequality of Fact 3.9. 7

Proposition 4.2 (self-concordance under affine transformation) Let A ∈ Rm×n satisfy rankA = n ≤ m. Suppose that f (·) ∈ SC with domain Df ⊂ Rm and define fˆ(·) by fˆ(x) = f (Ax − b). Then fˆ(·) ∈ SC with domain ˆ := {x : Ax − b ∈ Df }. D ˆ and s = Ax − b. Letting g(s) and H(s) denote Proof: Consider x ∈ D ˆ the gradient and Hessian of f (s) and gˆ(x) and H(x) the gradient and HesT ˆ ˆ sian of f (x), we have gˆ(x) = A g(s) and H(x) = AT H(s)A. Suppose that p ky − xkx < 1. Then defining t := Ay − b we have 1 > ky − xkx = ˆ (y T AT − xT AT )H(s)(Ay − Ax) = kt−sks , whereby t ∈ Df and so y ∈ D. ˆ Also, for any v 6= 0, we have Therefore Bx (x, 1) ⊂ D. p kvky kAvkt 1 1 v T AT H(t)Av =p = ≤ = . T T kvkx kAvk 1 − ks − tk 1 − ky − xkx s s v A H(s)Av The exact same argument can also be applied to prove the “≥” inequality of the definition of self-concordance. Proposition 4.3 The three functions defined in Remark 1 are self-concordant. Proof: We will prove that f (X) := − ln det(X) is self-concordant on its domain {X ∈ S k×k : X 0}. When k = 1, this is the logarithmic barrier function. Although it is true, we will not prove that f (x) = − ln(x21 − Pn 2 j=2 xj ) is a self-concordant barrier for the interior of the second-order cone n Q := {x : k(x2 , . . . , xn )k ≤ x1 }, as this proof is arithmetically uninspiring. Before we get started with the proof, we have to expand and amend our notation a bit. If we consider two vectors x, y in the n-dimensional space V (which we typically identify with Rn ),√we use the standard inner product xT y to define the standard norm kvk := v T v. When we consider the vector space V to be S k×k (the set of symmetric matrices of order k), we need to define the standard inner product of two symmetric k × k matrices X and Y and also to define the standard norm on V = S k×k . The standard inner product in this case is the trace inner product:

X • Y := Tr(XY ) =

n X

(XY )ii =

i=1

k X k X i=1 j=1

8

Xij Yij .

The trace inner product has the following properties which are elementary to establish: 1. A • BC = AB • C 2. Tr(AB) = Tr(BA), whereby A • B = B • A The trace inner product is used to define the standard norm: √ kXk := X • X . This norm is also called the Frobenius norm of a matrix X. Notice that the Frobenius norm of X is not the operator norm of X. For the remainder of this proof we use the trace inner product and the Frobenius norm, and the reader should ignore any mental mention of operator norms in this proof. To prove f (X) := − ln det(X) is self-concordant, let X 0 be given, and let Y ∈ BX (X, 1) and V ∈ S k×k be given. We need to verify three statements: 1. Y 0, 2.

kV kY 1 ≤ , and kV kX 1 − kY − XkX

3.

kV kY ≥ 1 − kY − XkX kV kX

To get started, direct expansion yields the following second-order expansion of f (X): 1 f (X + ∆X) ≈ f (X) − X −1 • ∆X + ∆X • X −1 ∆XX −1 2 and indeed it is easy to derive: • g(X) = −X −1 and • H(X)∆X = X −1 ∆XX −1 9

It therefore follows that k∆XkX

p = qTr(∆XX −1 ∆XX −1 ) 1 1 1 1 Tr(X − 2 ∆XX − 2 X − 2 ∆XX − 2 ) = q 1 1 = Tr([X − 2 ∆XX − 2 ]2 ) .

(1)

Now define two auxiliary matrices: 1

1

F := X − 2 Y X − 2 Note that

and

1

1

S := X − 2 V X − 2 .

q 1 1 1 1 kSk = Tr(X − 2 V X − 2 X − 2 V X − 2 ) = kV kX .

(2)

Furthermore let us write F = QDQT where Q is orthonormal (QT = Q−1 ) and the diagonal matrix D is comprised of the eigenvalues of F , and let λ denote the vector of eigenvalues, with minima and maxima λmin and λmax . To prove item (1.) above, we observe: 1 > kY − Xk2X

1

1

1

1

= Tr(X − 2 (Y − X)X − 2 X − 2 (Y − X)X − 2 ) = Tr(F − I)(F − I)) = Tr(Q(D − I)QT Q(D − I)QT ) (3)

= Tr((D − I)(D − I)) Pk

j=1 (λj

=

− 1)2

= kλ − ek22 where e = (1, . . . , 1). Since the last quantity above is less than 1, it follows that λ > 0 and hence F 0 and therefore Y 0, establishing (1.). In order to establish (2.) and (3.) we will need the following 1

1

kF − 2 SF − 2 k ≤

1 λmin

1

1

kSk and kF − 2 SF − 2 k ≥

10

1 λmax

kSk .

(4)

To prove (4), we proceed as follows: q 1 1 1 1 − 21 − 12 kF SF k = Tr(QD− 2 QT SQD− 2 QT QD− 2 QT SQD− 2 QT ) =

p Tr(D−1 QT SQD−1 QT SQ)

≤

√1 λmin

p Tr(QT SQD−1 QT SQ)

=

√1 λmin

p Tr(D−1 QT SSQ)

≤

1 λmin

p Tr(QT SSQ)

=

1 λmin

p Tr(SS) =

1 λmin kSk

.

The other inequality of (4) follows by substituting λmax for λmin and switching ≥ for ≤ in the above chain of equalities and inequalities. We now have: kV k2Y

= Tr(V Y −1 V Y −1 ) 1

1

1

1

1

1

1

1

= Tr(X − 2 V X − 2 X 2 Y −1 X 2 X − 2 V X − 2 X 2 Y −1 X 2 ) = Tr(SF −1 SF −1 ) 1

1

1

1

= Tr(F − 2 SF − 2 F − 2 SF − 2 ) 1

1

= kF − 2 SF − 2 k2 ≤

1 kSk2 λ2min

=

1 kV k2X λ2min

where the last inequality follows from (4) and the last equality from (2). Therefore kV kY 1 1 1 1 ≤ ≤ ≤ = kV kX λmin 1 − |1 − λmin | 1 − ke − λk2 1 − kY − XkX where the last equality is from (3). This proves (2.). To prove (3.), use the same equalities as above and the second inequality of (4) to obtain: 1

1

kV k2Y = kF − 2 SF − 2 k2 ≥ 11

1 λ2max

kV k2X

kV kY 1 and therefore kV kX ≥ λmax . If λmax ≤ 1 it follows directly that 1 λmax ≥ 1 ≥ 1 − kY − XkX , while if λmax > 1 we have:

kV kY kV kX

≥

kY − XkX = kλ − ek2 ≥ λmax − 1 , from which it follows that λmax kY − XkX ≥ kY − XkX ≥ λmax − 1 and so λmax (1 − kY − XkX ) ≤ 1 . From this it then follows that the proof of (3.).

kV kY kV kX

1 λmax

≥

≥ 1−kY −XkX , thus completing

Our next result is rather technical, as it shows further properties of changes in Hessian matrices under self-concordance. Recall from Fact 3.6 that the operator norm of a matrix M is defined using the Euclidean norm, namely kM k := max{kM xk : kxk ≤ 1} . Lemma 4.1 Suppose that f (·) ∈ SC and x ∈ Df . If ky − xkx < 1, then −1

−1

(i) kHx 2 Hy Hx 2 k ≤ 1

1

(ii) kHx2 Hy−1 Hx2 k ≤ −1

1 1−ky−xkx

1 1−ky−xkx

−1

(iii) kI − Hx 2 Hy Hx 2 k ≤ 1

1

(iv) kI − Hx2 Hy−1 Hx2 k ≤ −1

2

2

,

,

1 1−ky−xkx

1 1−ky−xkx

2

2

−1

− 1 , and

−1 .

Proof: Let Q := Hx 2 Hy Hx 2 , and observe that Q 0 with eigenvalues λn ≥ . . . ≥ λ1 > 0. From Fact 3.6 we have s r T p p v T Hy v kvky w Qw 1 kQk = λn = max = max = max ≤ w v v kvkx wT w v T Hx v 1 − ky − xkx 12

−1

(where the third equality uses the substitution v = Hx 2 w) and squaring yields the first assertion. Similarly, we have s r p v T Hy v kvky wT Qw 1 p = λ1 = min ≥ 1−ky−xkx = min = min T T −1 w v v kvkx w w v Hx v kQ k −1

(where the third equality again uses the substitution v = Hx 2 w) and squaring and rearranging yields the second assertion. Next observe kI − Qk = max{|λi − 1|} ≤ max{λn − 1, 1/λ1 − 1} ≤ i

1 1 − ky − xkx

2 −1

where the first inequality is from Fact 3.8 and the second inequality follows from the two equation streams above, thus showing the third assertion of the lemma. Finally, we have kI −Q

−1

k = max{|1/λi −1|} ≤ max{1/λ1 −1, λn −1} ≤ i

1 1 − ky − xkx

2 −1

where the first inequality is from Fact 3.8 and the second inequality follows from the two equation streams above, thus showing the fourth assertion of the lemma. The next result states that we can bound the error in the quadratic approximation of f (·) inside the Dikin ball Bx (x, 1). Proposition 4.4 Suppose that f (·) ∈ SC and x ∈ Df . If ky − xkx < 1, then ky − xk3x f (y) − f (x) + g(x)T (y − x) + 1 (y − x)T Hx (y − x) ≤ 3(1 − ky − xkx ) . 2

13

Proof: Let L denote the left-hand side of the inequality to be proved. From Fact 3.3 we have Z 1 Z t T L = (y − x) [H(x + s(y − x)) − H(x)](y − x) ds dt 0

Z =

0

1Z t

0

≤ ky −

T

(y − x)

1 2

−1 Hx [Hx 2 H(x

+ s(y −

−1 x))Hx 2

0 1Z t

Z

xk2x

0

−1 kHx 2 H(x

+ s(y −

−1 x))Hx 2

0

− I]Hx (y − x) ds dt 1 2

− Ik ds dt

Z Z 2 1 t 1 ≤ ky − xk2x − 1 ds dt 0 0 1 − sky − xkx = ky −

xk2x

Z

1

0

≤

ky − xk3x 1 − ky − xkx

ky − xkx t2 dt 1 − tky − xkx Z

1

t2 dt =

0

(from Fact 3.4)

ky − xk3x . 3(1 − ky − xkx )

Recall Newton’s method to minimize f (·). At x ∈ Df we compute the Newton step: n(x) := −H(x)−1 g(x) and compute the Newton iterate: x+ := x + n(x) = x − H(x)−1 g(x) . When f (·) ∈ SC, Newton’s method has some very wonderful properties as we now show. Theorem 4.1 Suppose that f (·) ∈ SC and x ∈ Df . If kn(x)kx < 1, then kn(x+ )kx+ ≤

kn(x)kx 1 − kn(x)kx

14

(from Lemma 4.1)

2 .

Proof: We will prove this by proving the following two results which together establish the result: −1

(a) kn(x+ )kx+

kHx 2 g(x+ )k ≤ , and 1 − kn(x)kx kn(x)k2x . 1 − kn(x)kx

−1

(b) kHx 2 g(x+ )k ≤ First we prove (a): kn(x+ )k2x+

= g(x+ )T H −1 (x+ )H(x+ )H −1 (x+ )g(x+ ) −1

1

1

−1

= g(x+ )T Hx 2 Hx2 H −1 (x+ )Hx2 Hx 2 g(x+ ) 1

1

−1

≤ kHx2 H −1 (x+ )Hx2 kkHx 2 g(x+ )k2 ≤

=

2

1 1−kx+ −xkx 1 1−kn(x)kx

2

−1

kHx 2 g(x+ )k2

(from Lemma 4.1)

−1

kHx 2 g(x+ )k2 ,

which proves (a). To prove (b), observe first that g(x+ ) = g(x+ ) − g(x) + g(x) = g(x+ ) − g(x) − Hx n(x) =

R1

=

R1

=

R1

0

0

0

H(x + t(x+ − x))(x+ − x)dt − Hx n(x) (from Fact 3.1) [H(x + tn(x)) − Hx ] n(x)dt −1

1

[H(x + tn(x)) − Hx ] Hx 2 Hx2 n(x)dt .

Therefore −1 Hx 2 g(x+ )

Z =

1

−1 Hx 2 H(x

+

0

15

−1 tn(x))Hx 2

1

− I Hx2 n(x)dt

which then implies −1

kHx 2 g(x+ )k ≤

R1 0

−1

− 12

kHx 2 H(x + tn(x))Hx 1

≤ kHx2 n(x)k

R1 0

1 1−tkn(x)kx

2

1

− IkkHx2 n(x)kdt − 1 dt

kn(x)kx = kn(x)kx 1−kn(x)k x

(from Lemma 4.1) (from Fact 3.4)

which proves (b). Theorem 4.2 Suppose that f (·) ∈ SC and x ∈ Df . If kn(x)kx ≤ 41 , then f (·) has a minimizer z, and kz − x+ kx ≤

3kn(x)k2x . (1 − kn(x)kx )3

Proof: First suppose that kn(x)kx ≤ 1/9, and define Qx (y) := f (x) + g(x)T (y − x) + 21 (y − x)T Hx (y − x). Let y satisfy ky − xkx ≤ 1/3. Then from Proposition 4.4 we have |f (y) − Qx (y)| ≤

ky − xk3x ky − xk2x ky − xk2x ≤ = , 3(1 − 1/3) 9(2/3) 6

and therefore 1

1

f (y) ≥ f (x) + g(x)T Hx−1 Hx2 Hx2 (y − x) + 12 ky − xk2x − 61 ky − xk2x ≥ f (x) − kn(x)kx ky − xkx + 13 ky − xk2x = f (x) + 13 ky − xkx (−3kn(x)kx + ky − xkx ) . Now if y˜ ∈ ∂S := ∂{y : ky − xkx ≤ 3kn(x)kx }, it follows that f (˜ y ) ≥ f (x). So, by Fact 3.5, f (·) has a global minimizer z ∈ S, and so kz − xkx ≤ 3kn(x)kx . Now suppose that kn(x)kx ≤ 1/4. From Theorem 4.1 we have 2 1/4 kn(x+ )kx+ ≤ = 1/9 , 1 − 1/4 16

so f (·) has a global minimizer z and kz − x+ kx+ ≤ 3kn(x+ )kx+ . Therefore kz − x+ kx ≤

kz − x+ kx+ 1 − kx − x+ kx

=

kz − x+ kx+ 1 − kn(x)kx

≤

3kn(x+ )kx+ 1 − kn(x)kx

≤

3kn(x)k2x . (1 − kn(x)kx )3

(from Definition 4.1)

(from Theorem 4.1)

Last of all in this section, we present a more traditional result about the convergence of Newton’s method for a self-concordant function. This result is not used elsewhere in our development, and is only included to relate the results herein to more traditional theory of Newton’s method. The proof of this result is left as a somewhat-challenging exercise. Theorem 4.3 Suppose that f (·) ∈ SC and x ∈ Df and f (·) has a minimizer z. If kx − zkz < 41 , then kx+ − zkz < 4kx − zk2z .

5

Self-Concordant Barriers

We begin with another definition. Definition 5.1 f (·) is a ϑ-(strongly nondegenerate self-concordant)-barrier if f (·) ∈ SC and ϑ = ϑf := max kn(x)k2x < ∞ . x∈Df

17

Note that kn(x)k2x = (−g(x)T H(x)−1 H(x)H(x)−1 (−g(x)) = g(x)T H(x)−1 g(x), so we can equivalently define ϑf := max g(x)T H(x)−1 g(x) x∈Df

or ϑf := max n(x)T H(x)n(x) . x∈Df

The quantity ϑf is called the complexity value of the barrier f (·). Let SCB denote the class of all such functions. The following property is very important. Theorem 5.1 Suppose that f (·) ∈ SCB and x, y ∈ Df . Then g(x)T (y − x) < ϑf . 0

Proof: Define φ(t) := f (x+t(y −x)), whereby φ (t) = g(x+t(y −x))T (y −x) 00 0 and φ (t) = (y−x)T H(x+t(y−x))(y−x). We want to prove that φ (0) < ϑf . 0 0 If φ (0) ≤ 0 there is nothing further to prove, so we can assume that φ (0) > 0 0 whereby from convexity it also follows that φ (t) > 0 for all t ∈ [0, 1]. (Notice that φ(1) = f (y) and so t = 1 is in the domain of φ(·).) Let t ∈ [0, 1] be given and let v = x + t(y − x). Then 0

φ (t) = g(v)T (y − x) 1

1

= g(v)T Hv−1 Hv2 Hv2 (y − x) 1

1

= −n(v)T Hv2 Hv2 (y − x) 1

1

≤ kHv2 n(v)kkHv2 (y − x)k = kn(v)kv ky − xkv ≤

p ϑf ky − xkv .

00

Also φ (t) = (y − x)Hv (y − x) = ky − xk2v , whereby 00

φ (t) ky − xk2v 1 ≥ = . 0 2 2 ϑf ky − xkv ϑf φ (t) 18

It follows that 1 ≤ ϑf

Z 0

1

00 φ (t) 1 1 1 1 dt = − 0 = 0 − , φ0 (t)2 φ (t) 0 φ (0) φ0 (1)

and we have

1 1 1 1 ≥ 0 + > , ϑf φ (0) φ (1) ϑf 0

which proves the result. The next two results show how the complexity value ϑ behaves under addition/intersection and affine transformation. Theorem 5.2 (self-concordant barriers under addition/intersection) Suppose that fi (·) ∈ SCB with domain Di := Dfi and complexity values ϑi := ϑfi for i = 1, 2, and suppose that D := D1 ∩ D2 6= ∅. Define f (·) = f1 (·) + f2 (·). Then f (·) ∈ SCB with domain D, and ϑf ≤ ϑ1 + ϑ2 . Proof: Fix x ∈ D and let g, g1 , g2 and H, H1 , H2 denote the gradients and Hessians of f (·), f1 (·), f2 (·) at x, whereby g = g1 + g2 and H = H1 + H2 . 1 1 Define Ai = H − 2 Hi H − 2 for i = 1, 2. Then Ai 0, A1 + A2 = I, so A1 , A2 1

1

−1

1

commute and A12 , A22 commute, from Fact 3.7. Also define ui = Ai 2 H − 2 gi

19

for i = 1, 2. We have g T H −1 g = g1T H −1 g1 + g2T H −1 g2 + 2g1T H −1 g2 1

1

= uT1 A1 u1 + uT2 A2 u2 + 2uT1 A12 A22 u2 1

1

= uT1 [I − A2 ] u1 + uT2 [I − A1 ] u2 + 2uT1 A22 A12 u2 =

uT1 u1

+

uT2 u2

−

1

uT1 A2 u1

1

+

uT2 A1 u2

(from Fact 3.7) 1

1

−

2uT1 A22 A12 u2

1

1

1

1

−2 −2 = g1T H − 2 A−1 g1 + g2T H − 2 A−1 g2 − kA22 u1 − A12 u2 k2 1 H 2 H 1

1

1

1

1

1

1

1

≤ g1T H − 2 H 2 H1−1 H 2 H − 2 g1 + g2T H − 2 H 2 H2−1 H 2 H − 2 g2 = g1T H1−1 g1 + g2T H2−1 g2 ≤ ϑ1 + ϑ2 thereby showing that ϑf ≤ ϑ1 + ϑ2 . Theorem 5.3 (self-concordant barriers under affine transformation) Let A ∈ Rm×n satisfy rankA = n ≤ m. Suppose that f (·) ∈ SCB with complexity value ϑf , with domain Df ⊂ Rm and define fˆ(·) by fˆ(x) = f (Ax−b). Then fˆ(·) ∈ SCB and ϑfˆ ≤ ϑf . ˆ and define s = Ax − b. Letting g and H denote the Proof: Fix x ∈ D ˆ the gradient and Hessian of gradient and Hessian of f (s) at s and gˆ and H T T ˆ ˆ f (x) at x, we have gˆ = A g and H = A HA. Then ˆ −1 gˆ = g T A(AT HA)−1 AT g = g T H − 21 H 12 A(AT H 12 H 12 A)−1 AT H 21 H − 12 g gˆT H 1 1 ≤ g T H − 2 H − 2 g = g T H −1 g ≤ ϑf 1

1

1

1

since the matrix H 2 A(AT H 2 H 2 A)−1 AT H 2 is a projection matrix. Remark 2 The complexity values of the three most-used barriers are as follows: 20

1. ϑf = 1 for the barrier f (x) = − ln(x) defined on Df = {x : x > 0} 2. ϑf = k for the barrier f (X) = − ln det(X) defined on Df = {X ∈ S k×k : X 0 } P 3. ϑf = 2 for the barrier f (x) = − ln(x21 − nj=2 x2j ) defined on Df = {x : k(x2 , . . . , xn )k ≤ x1 } Proof: Item (1.) follows from item (2.) so we first prove (2.). Recall the use of the √ trace inner product X • Y = Tr(XY ) and the Frobenius norm kXk := X • X as discussed early in the proof of Proposition 4.3. Also recall that for f (X) = − ln det(X) we have g(X) = −X −1 and H(X)∆X = X −1 ∆XX −1 . Therefore the Newton step at X, denoted by n(X), is the solution of the following equation: X −1 [n(X)]X −1 = X −1 and it follows that n(X) = X. Therefore, using (1) we have kn(X)k2X = n(X)•X −1 n(X)X −1 = X•X −1 XX −1 = X•X −1 = Tr(XX −1 ) = Tr(I) = k and therefore ϑf = maxX0 kn(X)k2X = k, which proves (2.) and hence (1.). In order to prove (3.) we amend our notation a bit, letting Qn = {(t, x) ∈ R1 × Rn−1 : kxk ≤ t}. For (t, x) ∈ intQn we have kxk < t and mechanically we can derive:  2    2t + 2xT x −4txT −2t  (t2 − xT x)2  T  2  (t2 − xT x)2  g(t, x) =  t −2 x x  H(t, x) =  2 T T  −4tx  2(t − x x)I + 4xx x 2 T 2 T 2 2 T 2 t −x x (t − x x) (t − x x) and the Hessian inverse is given by H(t, x)−1 =

t2 +xT x 2

tx

txT 2 T t −x x I + xxT 2

Directly plugging in yields g(t, x)T H −1 (t, x)g(t, x) = 2 21

! .

from which it follows that ϑf = max(t,x)∈Qn g(x)T H(x)−1 g(x) = 2. Finally, we present a result that shows that “one half of the Dikin ball” approximates the shape of a relevant portion of Df to within a factor of ϑf . Proposition 5.1 Suppose that f (·) ∈ SCB with complexity value ϑf , let x ∈ Df , and define the following three sets: (a) S1 := {y : ky − xkx < 1, g(x)T (y − x) ≥ 0} , (b) S2 :=⊂ {y : y ∈ Df , g(x)T (y − x) ≥ 0} , and (c) S3 :=⊂ {y : ky − xkx < 4ϑf + 1, g(x)T (y − x) ≥ 0} . Then S1 ⊂ S2 ⊂ S3 . The proof of Proposition 5.1 is not more involved than other results herein. It is not included because it is not necessary for subsequent results. It is stated here to show the power and beauty of self-concordant barriers. Corollary 5.1 If z is the minimizer of f (·), then Bz (z, 1) ⊂ Df ⊂ Bz (z, 4ϑf + 1) .

6

The Barrier Method and its Analysis

Our original problem of interest is P : minimizex cT x s.t.

x∈S ,

whose optimal objective value we denote by V ∗ . Let f (·) be a self-concordant barrier on Df = intS. For every µ > 0 we create the barrier problem:

22

Pµ : minimizex µcT x + f (x) x ∈ Df .

s.t.

The solution of this problem for each µ is denoted z(µ): z(µ) := arg min{µcT x + f (x) : x ∈ Df } . x

Intuitively, as µ → ∞, the impact of the barrier function on the solution of Pµ should become less and less, so we should have cT z(µ) → V ∗ as µ → ∞. Presuming this is the case, the barrier scheme will use Newton’s method to solve for approximate solutions xi of Pµi for an increasing sequence of values of µi → ∞. In order to be more specific about how the barrier scheme might work, let us assume that at each iteration we have some value x ∈ Df that is an approximate solution of Pµ for a given value µ > 0. (We will define “an approximate solution of Pµ ” shortly.) We then will increase the barrier parameter µ by a multiplicative factor α > 1: µ ˆ ← αµ . Then we will take a Newton step at x for the problem Pµˆ to obtain a new point x ˆ that we would like to then be an approximate solution of Pµˆ . If so, we can continue the scheme inductively. Let x ∈ Df be given, and let us compute the Newton step for Pµ at x. The objective function of Pµ is hµ (x) := µcT x + f (x) , whereby we have: (a) ∇hµ (x) = µc + g(x) and (b) ∇2 hµ (x) = H(x) = Hx .

23

Therefore the Newton step for hµ (·) at x is: nµ (x) := −Hx−1 (µc + g(x)) = n(x) − µHx−1 c and the new iterate is: x ˆ := x + nµ (x) = x − Hx−1 (µc + g(x)) . Remark 3 Notice that hµ (·) ∈ SC, since membership in SC has only to do with Hessians, and hµ (·) and f (·) have the same Hessian. However, membership in SCB depends also on gradients, and hµ (x) and f (x) have different gradients, whereby it will be typically true that hµ (·) ∈ / SCB (unless c = 0). We now define what we mean for y to be an “approximate solution” of Pµ . Definition 6.1 Let γ ∈ [0, 1) be given. We say that y ∈ Df is a γapproximate solution of Pµ if knµ (y)ky ≤ γ . Notice that this definition is equivalent to kn(y) − µHy−1 cky ≤ γ . Essentially, the definition states that y is a γ-approximate solution of Pµ if the Newton step for Pµ at y is small (measured using the local norm at y). The following theorem gives an explicit optimality gap bound for y if y is a γ = 1/4-approximate solution of Pµ . Theorem 6.1 Suppose γ ≤ 41 , and y ∈ Df is a γ-approximate solution of Pµ . Then ϑf 1 T ∗ c y≤V + µ 1−δ where δ = γ +

3γ 2 . (1−γ)3

24

Proof: From Theorem 4.2 we know that z(µ) exists and furthermore ky − z(µ)ky = ky + nµ (y) − z(µ) − nµ (y)ky ≤ ky+ − z(µ)ky + knµ (y)ky ≤

3γ 2 (1−γ)3

+γ =δ .

From basic first-order optimality conditions we know that z(µ) satisfies µc + g(z(µ)) = 0 and from Theorem 5.1 we have −µcT (w − z(µ)) = g(z(µ))T (w − z(µ)) < ϑf

for all w ∈ Df .

Rearranging we have cT w + whereby V ∗ +

ϑf µ

ϑf > cT z(µ) µ

for all w ∈ Df ,

≥ cT z(µ). Now for notational convenience denote z := z(µ)

25

and observe cT y = cT z + cT (y − z) −1

1

≤ V∗+

ϑf µ

+ cT Hz 2 Hz2 (y − z)

≤ V∗+

ϑf µ

+ kHz 2 ckk(y − z)kz

≤ V∗+

ϑf µ

+

≤ V∗+

ϑf µ

+

= V∗+

ϑf µ

+

≤ V∗+

ϑf µ

+

≤ V∗+

ϑf µ

= V∗+

ϑf µ(1−δ)

−1

p k(y−z)k cT Hz−1 c 1−k(y−z)ky y q √

δ (g(z)/µ)T Hz−1 (g(z)/µ) 1−δ

(g(z))T Hz−1 (g(z)) δ µ 1−δ

√

ϑf δ µ 1−δ

1+

δ 1−δ

.

The last inequality above follows from the fact (which we will not prove) that ϑf ≥ 1 for any f (·) ∈ SCB. Note that with γ = 1/9 we have 1/(1 − δ) ≤ 6/5, whereby cT y ≤ V ∗ + 1.2ϑf /µ.[ √

Theorem 6.2 Let β := 14 , γ := 19 , and α := √ϑ+β . Suppose x is a γϑ+γ approximate solution of Pµ . Define µ ˆ := αµ, and let x ˆ be the Newton iterate for Pµˆ at x, namely x ˆ := x − H(x)−1 (ˆ µc + g(x)) . Then

26

Algorithm 2 Barrier Method Initialize. Define γ := 1/9, β := 1/4, α :=

√ √ϑ+β . ϑ+γ

Initialize with µ0 > 0, x0 ∈ Df that is a γ-approximate solution of Pµ0 . i ← 0. At iteration i : 1. Current values. µ ← µi x ← xi 2. Increase µ and take Newton step. µ ˆ ← αµ x ˆ ← x − H(x)−1 (ˆ µc + g(x)) 3. Update values. µi+1 ← µ ˆ xi+1 ← x ˆ

27

1. x is a β-approximate solution of Pµˆ , and 2. x ˆ is a γ-approximate solution of Pµˆ . Proof: To prove (1.) we have: knµˆ (x)kx = knαµ (x)kx = kH(x)−1 (αµc + g(x))kx = kα H(x)−1 (µc + g(x)) + (1 − α)H(x)−1 g(x)kx ≤ αkH(x)−1 (µc + g(x))kx + (α − 1)kH(x)−1 g(x)kx ≤ αγ + (α − 1)kn(x)kx p ≤ αγ + (α − 1) ϑf = β . To prove (2.) we invoke Theorem 4.1: knµˆ (ˆ x)kxˆ ≤

knµˆ (x)k2x 1 β2 = =γ . ≤ 2 2 (1 − knµˆ (x)kx ) (1 − β) 9

Applying Theorem 6.2 inductively we obtain the basic barrier method for self-concordant barriers as presented in Algorithm 2. The complexity of this scheme is presented below. Theorem 6.3 Let ε > 0 be the desired optimality tolerance, and define √ 6ϑ J := 9 ϑ ln . 5µ0 ε Then by iteration J of the barrier method the current iterate x satisfies cT x ≤ V ∗ + ε. Proof: With the given values of γ, β, α we have δ := γ + 1 1−δ

≤ 6/5 and 1−

1 1 1 √ = ≥ √ . α 7.2 ϑ + 1.8 9 ϑ 28

3γ 2 (1−γ)3

satisfies

After J iterations the current iterate x is a γ-approximate solution of Pµ where µ = αJ µ0 . Therefore 1 1 ln µ0 − ln µ = J ln ≤J −1 . α α (Note that the inequality above follows from the fact that ln t ≤ t − 1 for t > 0, which itself is a consequence of the concavity of ln(·).) Therefore J 6ϑ ϑ 1 ≥ ln µ0 + √ ≥ ln µ0 +ln ≥ ln ln µ ≥ ln µ0 +J 1 − α 5µ0 ε (1 − δ)ε 9 ϑ Therefore µ ≥

ϑ (1−δ)ε ,

and from Theorem 6.1 we have cT x ≤ V ∗ +

7 7.1

ϑ ≤V∗+ε . µ(1 − δ)

Remarks and other Matters Nesterov-Nemirovskii definition of self-concordance

The original definition of self-concordance due to Nesterov and Nemirovskii is different than that presented here in Definition 4.1. Let f (·) be a function defined on an open set Df ⊂ Rn . For every x ∈ Df and every vector h ∈ Rn define the univariate function φx,h (α) := f (x + αh) . Then Nesterov and Nemirovskii’s definition of self-concordance is as follows. Definition 7.1 The function f (·) defined on the domain Df ⊂ Rn is (strongly nondegenerate) self-concordant if: (a) H(x) 0 for all x ∈ Df , (b) f (x) → ∞ as x → ∂Df , and 29

i3 h 00 000 2 (c) for all x ∈ Df and h ∈ Rn , φx,h (·) satisfies |φx,h (0)| ≤ 2 φx,h (0) . Definition 7.1 roughly states that the third derivative of f (·) is bounded by an appropriate function of the second derivative. This definition assumes f (·) is three-times differentiable, unlike Definition 4.1. It turns out that when f (·) is three-times differentiable, then Definition 7.1 implies the properties of Definition 4.1 and vice versa, thus the two definitions are equivalent. We prefer Definition 4.1 because proofs of later properties are far more easy to establish. The key advantage of Definition 7.1 is that it leads to an easier way to validate the self-concordance of most self-concordant functions, especially − ln det(X) and − ln(t2 − xT x) for the semidefinite and second-order cones, respectively.

7.2

Universal Barrier

One natural question is: given a convex set S ⊂ Rn with nonempty interior, does there exist a self-concordant barrier fS (·) whose domain is Df = intS? Furthermore, if the answer is yes, does there exist a barrier whose complexity value is, say, O(n)? It turns out that the answers to these two questions are both “yes,” but the proofs are incredibly difficult. Let S ⊂ Rn be a closed convex set with nonempty interior. For each point x ∈ intS, define the “local polar of S relative to x”: PS (x) := {w ∈ Rn : wT (y − x) ≤ 1 for all y ∈ S} . Basic geometry suggests that as x → ∂S, the volume of PS (x) should approach ∞. Now define the function: fS (x) := ln (volume (PS (x))) . It turns out that fS (·) is a self-concordant function, but this is quite hard to prove. Furthermore, there is a universal constant C¯ such that for any dimension n and any closed convex set S with nonempty interior, it is true that ϑfS (·) ≤ C¯ · n . 30

This is a remarkable and very deep result. However, it is computationally useless, since the function values of fS (·) are not only extremely difficult to compute, but gradients and Hessian information is also extremely difficult to compute.

7.3

Other Matters

(i) getting started, (ii) other formats for convex optimization, (iii) ϑ-logarithmic homogeneous barriers for cones, (iv) primal-dual methods, (v) computational practice, and (vi) other self-concordant functions and self-concordant calculus.

7.4

Exercises

1. Prove that if kn(x)kx > 1/4, then upon setting 1 n(x) , 5kn(x)kx

y =x+ we have f (y) ≤ f (x) − 1/37.5 .

2. Prove Theorem 4.3. To get started, observe that x+ − z = x − z − Hx−1 (g(x) − g(z)) (since z is a minimizer and hence g(z) = 0), and then use Fact 3.1 to show that Z 1 1 −1 −1 −1 x+ − z = Hx 2 Hx 2 [Hx − H(x + t(z − x))] Hx 2 Hx2 (x − z)dt . 0 1

Next multiply each side by by Hx2 and take norms, and then invoke Lemma 4.1. This should help get you started.

31

3. Use Theorem 4.3 to show that under the hypothesis of the theorem, the sequence of Newton iterates starting with x, x1 = x+ , x2 = (x1 )+ , . . ., satisfies i 1 kxi − zkz < (4kx − zkz )2 . 4 4. Complete the proof Proposition 4.1 by proving the “≥” inequality in the definition of self-concordance using the suggestions at the end of the proof earlier in the text. 5. Complete the proof Proposition 4.2 by proving the “≥” inequality in the definition of self-concordance using the suggestions at the end of the proof earlier in the text. 6. The proof of Proposition 4.3 uses the inequalities presented in (4), but only the left-most inequality is proved in the text herein. Prove the right-most inequality by substituting λmax for λmin and switching ≥ for ≤ in the chain of equalities and inequalities in the text.

32

Recommend Documents

Theory of Convex Optimization for Machine Learning

Adaptive Bound Optimization for Online Convex Optimization