On the Superlinear Convergence of the Variable Metric Proximal Point Algorithm Using Broyden and BFGS Matrix Secant Updating J.V. Burke
Maijian Qiany
Submitted to Mathematical Programming August 27, 1996 revised September 1997 Abstract
In previous work, the authors provided a foundation for the theory of variable metric proximal point algorithms for general monotone operators on a Hilbert space. In particular, they develop conditions for the global, linear, and super{linear convergence of their proposed algorithm. This paper focuses attention on two matrix secant updating strategies for the nite dimensional case. These are the Broyden and BFGS updates. The BFGS update is considered for application in the symmetric case, e.g., convex programming applications, while the Broyden update can be applied to general monotone operators. Subject to the linear convergence of the iterates and a quadratic growth condition on the inverse of the operator at the solution, both updates are shown to yield the super{linear convergence of the variable metric proximal point iterates. These results are applied to obtain conditions under which the Chen{Fukushima variable metric proximal point algorithm is super{linearly convergent when implemented with the BFGS update.
Keywords: maximal monotone operator, proximal point methods, variable metric, global convergence, super{linear convergence.
Abbreviated title: Variable Metric PPA II. AMS(MOS) subject classi cations (1991): primary 90C25; secondary 49J45, 47H05, 49M45.
Department of Mathematics, Box # 354350, University of Washington, Seattle, Washington 98195{ 4350. This authors research is supported by the National Science Foundation Grant No. DMS-9303772 y Department of Mathematics, California State University, Fullerton, CA 92634.
1
1 Introduction In [3], we introduced the variable metric proximal point algorithm (VMPPA) for general monotone operators and established a basic convergence theory. The algorithm builds on the classical proximal point algorithm and can be viewed as a Newton{like method for solving inclusions of the form 0 2 T (z) where T is a maximal monotone operator on a Hilbert space. In this paper, attention is restricted to the nite dimensional setting and we consider two quasi{Newton updating strategies for generating the Newton{like iterates: the BFGS and Broyden updates. The BFGS update is appropriate for application to convex programming and the Broyden update suitable for other applications such as mini{max problems. In the nal section, we show how these results can be used to establish the super{linear convergence of the Chen{Fukushima VMPPA for convex programming when the BFGS update is employed. Recently, the drive to develop a VMPPA in the context of nite{valued, nite dimensional convex programming has been joined by several authors [1, 3, 5, 7, 11, 12, 13, 14, 17, 18]. In convex programming, the goal is to derive a variable metric method for minimizing the Moreau{Yosida regularization of a convex function f : IRn ! IR [ f+1g: 1 ky ? zk2g f(z) := ymin f f ( y ) + (1) n 2IR 2 (in the nite{valued case, f cannot take the value +1). Here the operator T is the convex subdierential of the function f , denoted @f . The inclusion 0 2 @f (z) identi es z as a point at which f attains it global minimum value. It is well known that the set of points z yielding the minimum value of f and f coincide, and that the function f is continuously dierentiable with Lipschitz continuous derivative even if the function f is neither dierentiable or nite{valued. The challenge is to derive a super{linearly convergent method that does not require precise values for either f or its derivative, and does not require excessively strong smoothness hypotheses on the function f (we only consider quotient or Q{rates of convergence). In this paper, we concentrate on matrix secant updating strategies for the VMPPA in the operator setting. Previously, only applications to the nite valued convex programming have been considered. In the convex programming case, matrix secant updating strategies are also considered in [1], [5], [11], [12], and [14]. In [1], Bonnans, Gilbert, Lemarechal, and Sagastizabal consider an algorithmic pattern modeled on the approach suggested by Qian in [18]. In this regard, the quasi{Newton updates are applied to the function f instead of f. This approach allows one to circumvent the technical diculties associated with varying the value of in f. The authors provide an adaptation of Dennis and More's landmark characterization of super{linear convergence for Newton{like methods in nonlinear programming [8], and super{linear convergence results are established for the PSB, DFP, and BFGS updates. These results come at the cost of assuming that the function f possesses certain smoothness properties. Speci cally, it is assumed that the function f is continuously dierentiable at a unique solution with Lipschitz continuous derivative, the derivative is itself strongly directionally dierentiable, and the directional 2
derivative operator is positive de nite. In addition, the results for the BFGS update require that the directional derivative satisfy a kind of second{order approximation property. The approach that we take is to impose smoothness hypotheses on the operator T ?1 instead of T . In the context of convex programming, this allows us to avoid such strong hypotheses on the function f . In particular, our convergence results do not require that f to be nite{valued. In [5], Chen and Fukushima employ a bundle strategy for approximating f and its gradient. In addition, they employ a line search based on the function f instead of f. This is a very important innovation in practise since the evaluation of approximations to f can be costly. They establish an analog of Dennis and More's characterization theorem under the assumption that f is strongly convex, f is strongly twice dierentiable at a unique solution, the derivative approximations and their inverses are bounded, and that the error in the approximation of f is little{oh of the square of the norm of the approximation to rf. They also consider the BFGS update, but do not show that it yields super{linear convergence. This fact is established in the nal section of this paper. In [11], Lemarechal, and Sagastizabal consider a scalar quasi{Newton update of the form n I and establish super{linear convergence by showing that n ! 0 under the hypotheses that f is locally continuously dierentiable with Lipschitz continuous derivative and f satis es a quadratic growth condition near the solution set. In [12], Lemarechal, and Sagastizabal establish another analog of the Dennis and More characterization theorem under the hypothesis that requires the derivative approximation for @f be small in a certain sense. The authors then consider the SR1 update and a scalar quasi{Newton update which they label a poor man's update. A rate of convergence result is not established for the SR1 update. The poor man's update is derived by combining the reversal quasi{Newton formula (derived in [12]) with the scalar update formula that appears in [11]. A super{ linear convergence result for this update is derived along the lines of [11] by showing that the scalar converges to zero. In [14], Miin, Sun, and Qi obtain the rst super{linear convergence result for a variable metric proximal point algorithm using the BFGS matrix secant update in the setting of nite dimensional nite{valued convex programming. Their proposed algorithm uses a line search based on approximations to the function f and requires that the function f is strongly convex with rf Frechet dierentiable at the unique global solution to the convex program. In addition, the main super{linear convergence result [14, Theorem 5.3] assumes that the iterates satisfy a certain approximation property involving the Hessian r2f [14, Theorem 5.3]. On the other hand, in our analysis, we prove that the iterates satisfy this approximation property (Lemma 8). The smoothness assumptions used by Miin, Sun, and Qi are in the spirit of the ones employed in our analysis. We reiterate that the goal of this paper is to establish the super{linear convergence of the VMPPA as presented in [3] using the Broyden and BFGS updates. This algorithm applies to general monotone operators on IRn and is not con ned to applications in convex programming. In the context of convex programming, our convergence hypotheses dier from those used in [1, 5, 11, 12] since they are imposed on the operator (@f )?1 rather than @f . This dierence is signi cant since it allows us to handle the non{ nite valued case, i.e., constrained convex programs. Nonetheless, our convergence results do apply to standard 3
VMPPA theory for convex programming. By way of illustration, we apply our results to the Chen{Fukushima VMPPA in the nal section of this paper. Further details concerning the application of the VMPPA in the context of convex programming can be found in [2] along with some preliminary numerical results. The paper is organized as follows. In Section 2, we recall the basic features of the VMPPA. In Section 3, we study the Broyden and BFGS updating strategies for the VMPPA and provide conditions under which super{linear convergence is achieved. In Section 4, we consider the Chen{Fukushima VMPPA for nite{valued convex programming. We begin by establishing a linear convergence result for this algorithm based on hypotheses consistent with the analysis provided in Section 3. We then apply the results of Section 3 to provide conditions under which the Chen{Fukushima algorithm is locally super{linearly convergent when applied with the BFGS matrix secant updating strategy. This is a new convergence result for the Chen{Fukushima algorithm. A word about our notation is in order. We denote the closed unit ball in IRn by IB. Then the ball with center a and radius r is denoted by a + rIB . Given a set Z IRn and an element z 2 IRn , the distance of z to Z is dist (z; Z ) = inf fkz ? z0k : z0 2 Z g. Given a multi-function (also referred to as a mapping or an operator depending on the context) T : IRn ?! ?! are used to signify the fact ?! IRn (here the double arrows ?! that T is a multi{function), the graph of T , graph (T ), is the subset of the product space IRn IRn de ned by graph T = f(z; w) 2 IRn IRnjw 2 T (z)g. The domain of T is the set dom T := fz 2 IRnjT (z) 6= ;g. The identity mapping will be denoted by I . The inverse of an operator T is de ned by T ?1(w) := fz 2 IRn j(z; w) 2 graph T g. S Given a lower semi-continuous convex function f : IRn ! IR f+1g, the conjugate of f is de ned by f (z) = supz2IRn fhz; zi ? f (z)g, and the subdierential of f is the multi-function de ned by @f (z) = fy 2 IRn : f (z0) f (z) + hy; z0 ? zi for all z0 2 IRng.
2 The Variable Metric Proximal Point Algorithm We say that the multi-function T : IRn ?! ?! IRn is monotone if for every (z; w) and (z 0; w0 ) in graph (T ) we have hz ? z0; w ? w0i 0. The monotone operator T is said to be maximal if its graph is not properly contained in the graph of any other monotone operator. Recall that the proximal point algorithm for solving the inclusion 0 2 T (z) where T is maximal monotone generates a sequence fzk g satisfying the approximation rule zk+1 (I + k T )?1(zk ) for a given sequence of positive scalars fk g. In the case of convex programming, the function f (the Moreau{Yosida regularization of f ) is continuously dierentiable [15] with rf(z) = ?w(; z) where w(; z) = P (; z) ? z and P (; z) is the unique solution to the minimization problem in (1). The proximal point iteration then has the form zk+1 = zk + wk ; where wk ?rfk (zk ); that is, it is the method of steepest descent with unit step size applied to the function fk with k varying between iterations. The algorithm for a general maximal monotone 4
operator T can be formally derived from this iteration by replacing @f by T and ?rfk by the operator Dk = [(I + k T )?1 ? I ]: (2) The operator Dk yields the analog of the direction of steepest descent. The proximal point algorithm takes the form zk+1 = zk + wk ; where wk Dk (zk ) : A Newton{like variation on this iteration yields the VMPPA.
The Variable Metric Proximal Point Algorithm: Let z0 2 H and 0 1 be given. Having zk , set zk+1 := zk + Hk wk where wk Dk (zk ) and choose k+1 1.
In this article we focus on local convergence properties. For this we require the following approximation criterion from [3]: (L)
k
w
? Dk (zk )
k
wk
with
1 X k=0
k < +1 :
This approximation criterion diers slightly from the corresponding condition in [3] where it is only assumed that limk k = 0. We require the stronger condition on the k 's for our super{linear convergence results. Unfortunately, criterion (L) is impractical from the perspective of implementation. However, in [3, Proposition 1], it is shown that the approximation criterion
(L0) dist (0; Sk (wk )) k
wk
with
k
1 X k=0
k < +1 :
where Sk (w) = T (zk + w) + 1k w, implies that criterion (L) is satis ed. It is not necessary to locate an element of least norm in Sk (wk ) in order to assure that (L0) is satis ed. Before leaving this section we recall from [20] a few properties of the operators Dk and Pk := Dk + I that are essential in the analysis to follow.
Proposition 1 [20, Proposition 1]
a) The operator Dk can be expressed as Dk = ?(I + T ?1 1k )?1 and for any z 2 H, ? 1k Dk (z) 2 T (Pk (z)). b) For any z; z 0 2 H, hPk (z) ? Pk (z 0 ); Dk (z ) ? Dk (z 0)i 0 :
c) For any z; z 0 2 H, kPk (z) ? Pk (z 0)k2 + kDk (z ) ? Dk (z 0 )k2 kz ? z 0 k2 :
Remark An important consequence of Part c) above is that the operators Pk and Dk are non{expansive. We make free use of this fact in subsequent sections. 5
3 Matrix Secant Updating In this section, we consider speci c matrix secant updating rules for the matrices Hk and examine the local convergence behavior of a sequences fzk g generated by the VMPPA stated in Section 2. In [3, Theorem 19], we studied conditions yielding the global linear convergence of this sequence. Here we assume the linear convergence of the iterates and show that the Broyden and BFGS updating rules guarantee local super{linear convergence under the assumption that the operator T ?1 satis es the quadratic growth condition.
De nition 2 We say that an operator : IRn ?! ?! IRn is Lipschitz continuous at a point w (with modulus 0 ) if the set (w) is nonempty and there is a > 0 such that (w) (w) + kw ? w k IB whenever kw ? wk : We say that is dierentiable at a point w if (w) consists of a single element z and there is a continuous linear transformation J : IRn ! IRn such that for some > 0,
; 6= (w) ? z ? J (w ? w ) o(kw ? w k)IB whenever kw ? wk ; and write J = r (w). Finally, we say that the operator satis es the quadratic growth condition at w if is dierentiable at w and there are constants K 0 and > 0 such that
(w) ? (w) ? r (w)(w ? w) K kw ? wk2 IB
whenever kw ? wk :
Remarks 1) Rockafellar [20, Theorem 2] was the rst to use Lipschitz continuity to 2) 3) 4) 5)
establish rates of convergence for the proximal point algorithm. When the set (w) is restricted to be a singleton fzg, the dierentiability of at w implies the Lipschitz continuity of at w. Moreover, one can take ( ) ! kJ k as ! 0. This observation is veri ed in [20, Proposition 4]. This notion of dierentiability corresponds to the usual notion of dierentiability in the case when is single{valued. It follows from the de nition of monotonicity that if T is a maximal monotone operator, then the operator rT (x) is positive semi{de nite, if it exists. In [3, Example 6], we give an example of a convex function f for which @f ?1 is Lipschitz continuous but not dierentiable. In [3, Example 7], we show that it is possible to choose f so that @f ?1 is dierentiable at the origin, but does not satisfy the quadratic growth condition there.
The quadratic growth condition is indeed a strong smoothness property. However, in the case of convex programming, this condition is weaker than the standard hypothesis used for establishing the rapid local convergence of optimization algorithms and, more pointedly, is weaker that the conditions typically employed for the local analysis of variable metric 6
proximal point algorithms [1, 5, 11, 12]. Speci cally, for convex programs, the operator T is the subdierential of the essential objective function f. In this case, the assumption that the standard second{order suciency condition is satis ed at the solution to the convex program implies that the operator @ f?1 satis es the quadratic growth condition at the origin (see [20, Proposition 2] and [3, Theorem 8]). Thus, it is not surprising that we require a condition of this type in our local convergence analysis.
3.1 Statement of Updates and Convergence Results
We now consider the BFGS and Broyden update formulas for the matrices Hk . The BFGS update is suitable for convex programming applications since it preserves both symmetry and positive de niteness. Broyden's update is suitable for mini-max problems since in many of these applications the Jacobian r(T ?1) is non{symmetric when it exists.
Symmetric Updating with BFGS: Let H0 2 IRnn be any positive de nite symmetric matrix and for k 1, set yk = wk ? wk+1 and sk = zk+1 ? zk . If hyk ; sk i 0, set Hk+1 = Hk ; otherwise, set k k kT k k k T k k k k kT Hk+1 = Hk + (s ? Hk y )s hyk+; ssk i(s ? Hk y ) ? hs ? Hhykky; s;kyi2is s : (3) Non-symmetric Updating with Broyden's Formula: Let H0 = I and for k 1, set yk = wk ? wk+1 and sk = zk+1 ? zk . If sk T Hk yk = 0, set
Hk+1 = Hk ; otherwise, set
k ? Hk y k )sk T Hk ( s Hk+1 = Hk + : sk T Hk yk
(4)
Remarks 1. The updating formula Tin (3) is the formula for updating the inverse in BFGS
updating. The condition yk?1 sk?1 > 0 in the BFGS update not only insures the existence of the inverse, but also insures that the updates are positive de nite. The corresponding formula for direct approximation of rDk (zk ) (when it exists) is given by T T B k sk sk Bk y k y k Bk+1 = Bk ? k T k + k T k ; (5) s Bk s y s where Bk = Hk?1 for all k 0. 2. The updating formula in (4) is the formula for updating the inverse of Broyden's update. The condition sk?1 T Hk?1 yk?1 6= 0 is satis ed if and only if the inverse Broyden updates are well{de ned and nonsingular, in which case k k kT (6) Ak+1 = Ak + (y ? kATk sk )s ; s s where Ak = Hk?1 for all k 0. 7
We require that the following hypotheses in our convergence analysis: (H1) The operator T ?1 satis es the quadratic growth condition at the origin with J := r(T ?1)(0) and T ?1(0) = fzg. (H2) The approximation criteria (L) is satis ed at every iteration. (H3) The sequence fzk g converges linearly to z. (H4) There is an iteration index k such that k > 2 kJ k for all k k.
Theorem 3 (Symmetric Updating) Let fz k g be any sequence generated by the variable metric proximal point algorithm using
the symmetric updating strategy and suppose that the hypotheses (H1){(H4) are all satis ed. If J is symmetric, then (i) for all k suciently large y k T sk > 0,
(ii) the sequences fkHk kgand f
Hk?1
g are bounded, (iii) Hk = Bk?1 for all k suciently large, and (iv) the sequence converges to z at a super{linear rate.
Theorem 4 (Non{Symmetric Updating) Let fz k g be any sequence generated by the variable metric proximal point algorithm using the non{symmetric updating strategy and suppose that the hypotheses (H1){(H4) are all satis ed. If it is further assumed that there exists k > 0 such that k = > 3 kJ k for all k k, then (i) there is a k^ k such that for all k k^ we have sk?1 T H k?1 yk?1 = 6 0 and Hk is updated using Broyden's formula,
(ii) the sequences fkHk kg and f
Hk?1
g are bounded, and (iii) the sequence converges to z at a super{linear rate.
Our proofs of these results are based on an extension of the Dennis{More characterization theorem for superlinear convergence [8]. This result is stated below for the readers convenience. Theorem 5 [3, Theorem 20, Super{Linear Convergence] Let fzk g be any sequence generated by the variable metric proximal point algorithm satisfying criterion (L) for all k. Sup?1 ?1 ?1 pose that the
operator
T is dierentiable at the origin with T (0) = fzg and rT (0) =
J . If limk Dk (zk ) = 0, then fzk g converges to the solution z super{linearly if and only if [I ? (I + 1k J )Hk?1](zk+1 ? zk ) ! 0 as k ! 1 : kzk+1 ? zk k 8
3.2 Convergence Proofs
Several technical lemmas are required to prepare the way for the proofs of Theorems 3 and 4. We begin with three lemmas that depend only on structure of the algorithm and not on the speci c choice of the updates fHk g.
Lemma 6 Under hypotheses (H2) and (H3), there are positive numbers L0 ? L6 such that for all k suciently large
(a)
wk
L0
sk
and
wk
L0
Dk (z k )
,
(b) L1
wk
zk ? z
L2
wk
,
(c) L3
zk ? z
sk
L4
zk ? z
,
(d)
Dk (zk )
L5
sk
, and
(e)
sk+1
L6
sk
. If it is further assumed that hypotheses (H1) and (H4) hold, then for all k suciently large
(f)
Dk (zk ) ? Dk+1 (zk+1 )
12
sk
.
Proof The second inequality of (c) follows from the linear convergence of zk to z. The rst (c) also follows from linear convergence, since there exists 0 < < 1 such that
k
z ? z , so that
inequality in
k+1
z z
?
k
z
? z
1 ?1
sk
(7)
for all k suciently large. By (H2) and Proposition 1 (c), we have
1
D (zk )
1
zk ? z
;
k
w (8) 1 ? k k 1 ? k which proves the rst inequality in (b). The second inequality in (b) follows immediately from [3, Lemma 14, Part (ii)]. To see (a), just combine (7) and (8). The relation (d) follows from Proposition 1 (c) and the rst inequality in (c), while (e) follows from the linear convergence of fzk g and both inequalities in (c). Thus, (a)-(e) have been established.
We now show Part (f). Clearly,
Dk (zk )
! 0 by (H3) and (a). Hence (H1) implies that for all k suciently large
T ?1(? 1 Dk (zk )) ? z ? J (? 1 Dk (zk )) o(
Dk (zk )
)IB ; or k k 9
But
(I + T ?1 1 )(?Dk (zk )) ? z + (I + 1 J )Dk (zk ) o(
Dk (zk )
)IB ;
k k Dk (z ) = ?(I + T ?1
1 k
)?1(zk ), hence zk
k
2 (I + T ?1 1k )(?Dk (zk )). This yields
zk ? z 2 ?(I + 1 J )Dk (zk ) + o(
Dk (zk )
)IB :
(9)
k
From (9), Proposition 1c and (d), we conclude that
zk ? z + (I + 1 J )Dk (zk ) 2 o(
Dk (zk )
)IB o(
sk
)IB k Hence, for all k large,
sk 2 ?(I + 1 J )(Dk+1 (zk+1) ? Dk (zk )) + [o(
sk
) + o(
sk+1
)]IB : k Therefore, by (e) and (H4), we have
3
k
s < Dk+1 (z k+1 ) ? Dk (z k ) + o( sk ) ; 2 for all k large. This establishes (f).
Lemma 7 If hypotheses (H1){(H4) hold, then yk T sk > 0 for all k suciently large. Proof By (H4), k and Dk D for all k large. Letting z = zk+1 and z0 = zk in Proposition 1 (b) and recalling that P = I + D yields [(I + D)(zk+1) ? (I + D)(zk )]T [D(zk ) ? D(zk+1)] 0 ; or [zk+1 ? zk + D(zk+1) ? D(zk )]T [D(zk ) ? D (zk+1)] 0 : Hence, by (f) of Lemma 6,
2
2 sk T [D(zk ) ? D(zk+1)]
D(zk ) ? D (zk+1)
14
sk
for all k suciently large. Therefore
2 sk T yk 41
sk
+ sk T (wk ? D(zk )) ? sk T (wk+1 ? D(zk+1 ))
41
sk
2 ?
sk
[
wk ? D(zk )
+
wk+1 ? D(zk+1)
] : By (H2) and Lemma 6(a),
k
w
? D (zk )
k
wk
k L0
sk
:
(10) (11)
Hence, by (10) and Lemma 6 (e),
2
2 sk T yk 14
sk
?
sk
[k L0
sk
+ k+1L0
sk+1
] ( 41 ? k L0 ? k+1L0L6)
sk
: Now, for all k suciently large, L0(k + k+1L6) < 1=4, hence, for such k, yk T sk > 0. 10
Lemma 8 If (H1){(H4) hold, then there are positive numbers L7, L8, and L9 such that
k
y ? (I + 1 J )?1 sk
k
s + L8 k + L9 k+1 (12) L 7 k ks k for all k suciently large. In particular,
1
y k X
k=0
? (I + 1 J )?1sk
0,
2 (I + 1 J )?1sk ? yk 2 (D(zk ) ? wk ) ? (D(zk+1 ) ? wk+1 ) + M1
sk
IB : By (11) and Lemma 6 (a) and (e),
k
y ? (I + 1 J )?1 sk (L0 k + L0 L6 k+1 + M1
sk
)
sk
:
Therefore (12) holds for all k large.
Since zk ! zPlinearly, Lemma 6 (c) implies that P1k=0
sk
< 1. Thus, (12) and the hypothesis that 1 k=1 k < 1 implies (13). The proof of Theorem 3 now follows easily from the following result due to Byrd and Nocedal [4].
Theorem 9 (Byrd and Nocedal, 1989) Let fBk g be generated by the BFGS formula kT k k kT Bk+1 = Bk ? BkksT s Bk k + y k T yk ; s Bk s y s
11
where B1 is symmetric and positive de nite, and where y kT sk > 0 for all k. Furthermore assume that fsk g and fy k g are such that
k
y
? Gsk
ksk k k ;
(14)
for some symmetric and positive de nite matrix G, and for some sequence fk g with the
P1 (Bk ?G)sk k k
?1 property k=1 k < 1. Then limk!1 ksk k = 0 ; and the sequences fkBk kg, f Bk g are bounded.
Proof of Theorem 3 By Lemma 7 and (H4), k = > 2 kJ k and yk T sk > 0 for all k large. Since > kJ k and J is symmetric, (I + 1 J )?1 is symmetric and positive de nite. Also Lemma 8 implies that (14) is satis ed with G = (I + 1 J )?1. Consequently, by Theorem
1 ?1 k 9 both fkBk kg and f
Bk?1
g are bounded and k(Bk ?(Ik+skkJ ) )s k ! 0 ; or equivalently, k(I ?(I + 1 J )Bk )sk k ! 0 : Therefore, zk ! z at a super{linear rate by Theorem 5. ksk k The proof of Theorem 4 uses two more technical lemmas.
Lemma 10 [9, Lemma 8.2.5] Let s 2 IRn be nonzero, E 2 IRnn , and let kkF denote the
Frobenious norm, then
E (I
T kEsk )2)1=2 kE k ? 1 ( kEsk )2 :
2 ? ss )
= (kE kF ? ( F 2 kE k T s s F ksk F ksk
Lemma 11 Let A0; A1; : : : ; Ak be generated by the Broyden update formula. Then for any matrix G we have k j jk X kAk+1 ? Gk kA0 ? Gk + ky k?sjGs (15) k : j =0 Proof We only need to show that kAk+1 ? Gk kAk ? Gk +
k
y
? Gsk
ksk k :
To see this, we have k ? Ak sk )sk T ( y Ak+1 ? G = Ak ? G + sk T sk T k (yk ? Gsk )sk T k sk )sk + = Ak ? G + (Gs ?kA s T sk sk T sk k sk T k ? Gsk )sk T s ( y = (Ak ? G)(I ? k T k ) + : s s sk T sk
12
(16)
T T
Since (I ? sskk sTksk ) is a projection matrix, we have
(I ? sskk sTksk )
= 1.
v 2 IRn,
uvT
= kuk kvk. Therefore
I
Also for any vectors u,
k
k sk T
y k ? Gsk
sk
y ? Gsk s
= kAk ? Gk + ksk k : kAk+1 ? Gk kAk ? Gk ? k T k
+ ksk k2 s s
Proof of Theorem 4 Set G = (I + 1 J )?1. Then, by the Banach Lemma, 1 X
kI ? Gk ( 1 kJ k)i < 21 : i=1
(17)
By (13), there is a k0 k such that 1 ky j ? Gsj k X 1 ? kI ? Gk) : ( j ks k 2 j =k0
(18)
To show (i), we need only show that there exists a k^ k0 such that the matrices Ak de ned in (6) are nonsingular for all k k^. If we cannot take k^ = k0, then there is a k^ > k0 T such that Hk^ = I (i.e. sk^?1 Hk^?1yk^?1 = 0). We claim that for this choice of k^ the matrices Ak are non-singular for all k k^. To see this, note that for all k k^ kAk ? I k kI ? Gk + kAk ? Gk k j jk X 2 kI ? Gk + ky k?sjGs k j =k^ < 21 + kI ? Gk < 1 by Lemma 11, (18), and (17). Therefore, Ak is non{singular for all k k^ with
kAk k 2 and
A?k 1
1 ? kI1 ? Gk ; 2 which also veri es (ii). k k sk k We now show (iii). Set Ek := Ak ? G and k := ky ?kGssk kkk , and recall that for any 2
vectors u, v 2 IRn,
uvT
F = kuk kvk. Let k k^. From (16), we have
kEk+1kF By Lemma 10,
Ek (I
?
sk sk T )
sk T sk
F
+
k
y
k sk T
? Gsk
sk
s = Ek (I ? k T k )
+ k :
ksk k2 s s F
E sk kEk+1 kF kEk kF ? 2 kE1 k ( kskk k )2 + k ; or k F
13
2
Ek sk
Hence
ksk k2
This means and so
2 kEk kF (kEk kF ? kEk+1 kF + k ) L (kEk kF ? kEk+1kF + k ) :
2 N
Ek sk
X
k=k^
ksk k2
N
X L (kE0kF ? kEN +1kF + k ) :
k=k^
1
Ek sk
2 X
k=0
(Ak
ksk k2
0 such that for all k k
(29)
k < 21 ; and
(30)
+ k + (1 + kJk ) (1 ? ) ;
(31) where = k and is the parameter used in Step 4 of the algorithm. Then there exists an index k^ such that for all k k^, we have k 1 and
k+1
z
? z
zk ? z
;
that is, the convergence rate is linear.
The proof of Theorem 15 requires the following technical lemma.
Lemma 16 Under the hypotheses of Theorem 15 we have
(a) (@f )?1 ( ?1 D(z k )) ? x ? J ( ?1 D(z k )) o(
D(z k )
)IB for all k suciently large, and
(b) if k = 1, then
(I + 1 J )Hk?1 (z k+1 ? (I + Hk D)(z k ))
k (1 + kJk )
wk
and (I + 1 J )Hk?1(zk+1 ? (I + Hk D)(zk )) 2 O(k
D(zk )
)IB.
Proof First note that conditions (29) and (31) imply that f
Hk wk
=
wk
g is bounded.
Hence the conditions in Theorem 12 and Corollary 13 are all satis ed. In particular, zk ! z with D(zk ) ! 0 and wk ! 0. Let > 0 be such that (@f )?1(v) ? Jv ? z o(kvk)IB 18
(32)
whenever kvk < . Let k1 be such that whenever k > k1,
D(zk )
. When k > k1, the inclusion (32) implies that
(@f )?1( ?1 D(zk )) ? x + J ( 1 D(zk )) o(
D(zk )
)IB ; which proves (a). We now show (b). If k = 1, then zk+1 = zk + Hk wk . Hence
Hk?1 (zk+1 ? (I + Hk D)(zk )) = Hk?1(Hk wk ? Hk D(zk )) = wk ? D(zk ) : By Proposition 14, Therefore,
k
w
? D(zk )
k
wk
:
+ 1 J )Hk?1(zk+1 ? (I + Hk D)(zk ))
(1 + kJk )
wk ? D(zk )
k (1 + kJk )
wk
1 ?k (1 + kJ k )
D(zk )
; k or equivalently,
(I + 1 J )Hk?1(zk+1 ? (I + Hk D)(zk )) 2 O(k
D(zk )
)IB :
(I
Proof of Theorem 15: Conditions (29) and (31) imply that f
Hk wk
=
wk
g is bounded. Hence the conditions in Theorem 12 and Corollary 13 are all satis ed. In particular, zk ! z with D(zk ) ! 0 and wk ! 0. Let k1 k be such that Part (a) of Lemma 16 holds for all k k1. Then, by Lemma 16(a), Proposition 14(ii), and (30), there must exist a k~ > k1 such that for k > k~,
wk
wk
: (33)
(@f )?1 ( ?1 D(z k )) ? z + J ( 1 D(z k ))
D(z k )
1?
Since zk ! z and
wk
! 0, there must be a k^ > k~ such that
wk^
k^ , i.e., k^ = 1 and zk^+1 = zk^ + Hk^ wk^. Let z~k^+1 := (I + Hk^ D)(zk^ ), or equivalently, z~k^+1 = zk^ ? Hk^ (I + (@f )?1 1 )?1 (zk^) (by Proposition 1 (a)). Therefore, zk^ 2 (I + (@f )?1 1 )[Hk^?1(zk^ ? z~k^+1)] = Hk^?1(zk^ ? z~k^+1) + (@f )?1[ 1 Hk^?1 (zk^ ? z~k^+1)] :
19
By re{arranging this inclusion, we obtain the inclusion
zk^+1 ? z = zk^ ? z + (zk^+1 ? zk^) 2 [(@f )?1( 1 Hk^?1(zk^ ? z~k^+1)) ? z + (zk^+1 ? zk ) + Hk^?1(zk^ ? z~k^+1) = [(@f )?1( 1 Hk^?1(zk^ ? z~k^+1)) ? z ? J ( 1 Hk^?1(zk^ ? z~k^+1))] +[I ? (I + 1 J )Hk^?1](zk^+1 ? zk^) +(I + 1 J )Hk^?1(zk^+1 ? z~k^+1) = [(@f )?1( ?1 Dk^ (zk^)) ? z ? J ( ?1 Dk^ (zk^))] +[I ? (I + 1 J )Hk^?1](zk^+1 ? zk^) +(I + 1 J )Hk^?1(zk^+1 ? z~k^+1) :
Now consider the nal three terms in the sum on^ the right hand side of this inclusion. By
k (33), the rst of these terms is bounded by
w . The de nition of k bounds the second
^
k by k^ w and Lemma 16(b) bounds the third by (1 + kJk )
wk^
. Therefore,
^
k+1
z
But then
^
k+1
w
? z
( + k^ + (1 + kJ k ))
wk^
:
(34)
1?1
D(zk^+1)
1?1
zk^+1 ? z
kJ k ) ^
wk + k^ +(1+ 1 ?
wk^
(Proposition 14(ii)) (Proposition 1 (c)) (by (34)) (by (31)) : Hence k^+1 = 1. Proceeding as above, we nd that this implies that k = 1 for all k k^. Therefore,
k+1
z ? z
(1 ? )
wk
(by (34))
k
D(z )
(Proposition 14(ii))
k
z ? z ; (Proposition 1 (c)) for all k k^.
4.3 Super{Linear Convergence with BFGS Updating
Now that we have established conditions for the linear convergence of the Chen{Fukushima algorithm, Theorem 3 can be used to show the local super{linear convergence of the method when the BFGS updating rule is used to generate the matrices Hk . In order to do this, we 20
need to show that hypotheses (H1){(H4) are satis ed. This requires a slight modi cation to the BFGS updating scheme given at the beginning of Section 3.1.
BFGS Updating for the Chen{Fukushima Algorithm: Choose 0 < ^ < 0:1 where is de ned in the algorithm. For k = 0, choose H0 = H^ 0 = I . For k 1, set yk = wk ? wk+1, sk = zk+1 ? zk , and k ?H ^ k yk )sk T + sk (sk ? H^ k yk )T hsk ? H^ k yk ; ykisk sk T ( s ^ ^ Hk+1 = Hk + ? hyk ; sk i hyk ; sk i2
if tk T sk > 0; otherwise, set H^ k+1 = H^ k . Set Hk+1 = H^ k+1 if
(I ? H^ k+1 )wk+1
?^
wk+1
; 2
otherwise set Hk+1 = I .
Remark The updating strategy is somewhat peculiar in that the inverse Hessian approximations H^ k are never restarted, even when they are not being used. This is an artifact of our convergence proof which relies on Theorem 9 by Byrd and Nocedal.
Theorem 17 Suppose that the following conditions are satis ed: (i) The operator (@f )?1 satis es the quadratic growth condition at the origin with
(@f )?1(0) = fxg and r(@f )?1(0) = J: (ii) > c?2 ^ kJ k. (iii) In Step 1 of the algorithm, in addition to (24), the termination criterion (25) is also P1 satis ed on each outer iteration with the parameters k chosen to satisfy k=0 k
0,
(b) the sequences fkHk kg and f
Hk?1
g are bounded, (c) Hk = H^ k for all k large, and (d) the sequence converges to z at a super{linear rate to the unique solution to (P ).
Proof Hypotheses (H1) and (H4) follow from hypotheses (i) and (ii), respectively. Hypoth-
esis (H2) follows from (iii) and Proposition 14(ii). If we can show that (31) is satis ed for all k large, then (H3) will follow from Theorem 15. To this end, note that the BFGS updating 21
strategy for the Chen{Fukushima algorithm guarantees that
(Hk ? I )wk
?2 ^
wk
for all k, and, by hypothesis, is chosen so that ?2 ^ kJ k. Therefore,
(Hk
? (I + 1 J ))wk
(Hk ? I )wk
+ kJk
wk
( ? ^)
wk
;
so that k ? ^ < ? 2^ : Since k ! 0, condition (31) is eventually satis ed with = 2^ . Therefore, hypotheses (H1){(H4) are satis ed. We now show that Hk = H^ k for all k large. Since (H1){(H4) are satis ed, Lemma 8
^ ?1
^ and Theorem 9 imply that f Hk g and f Hk g are bounded and
^ ?1
( Hk
or equivalently,
(I
? (I + 1 J )?1)sk
!0 ; ksk k
? (I + 1 J )H^ k?1)sk
!0 : ksk k
Hence, the boundedness of fH^ k g implies that
^ k ^ k J H^ k?1sk
H^ k
(I ? (I + 1 J )H^ k?1)sk
Hk s ? sk + 1 H !0 : ksk k ksk k Therefore, there is a sequence k ! 0 such that
1 ? 1 k k k ^ ^ ^
k
sk
;
H H J H s s ? s + k k
k which in turn implies that
^ k )sk
( 1
H^ k J H^ k?1
+ k )
sk
(I ? H
1 (35) = ( kJ k + k )
sk
?2 ^
sk
for all k suciently large since > ?2 ^ kJ k. Now Hk 6= H^ k implies that sk = wk and
(I
? H^ k )sk
> ?2 ^
sk
:
By (35) this cannot occur for k suciently large, therefore eventually Hk = H^ k . The super{linear convergence of the iterates now follows from Theorem 3.
22
References [1] J.F. Bonnans, J.C. Gilbert, C. Lemarechal, and C. Sagastizabal. A family of variable metric proximal point methods. Mathematical Programming, 68:15|47, 1995. [2] J.V. Burke and M. Qian. Application of a variable metric proximal point algorithm to convex programming. Preprint, Mathematics, University of Wasington, Seattle, WA, 1996. [3] J.V. Burke and M. Qian. A variable metric proximal point algorithm for monotone operators. Preprint, Mathematics, University of Wasington, Seattle, WA, 1996. [4] R.H. Byrd and J. Nocedal. A tool for the analysis of quasi{Newton methods with application to unconstrained minimization. SIAM J. Numerical Analysis, 26:727| 739, 1989. [5] X. Chen and M. Fukushima. Proximal quasi{Newton methods for nondierentiable convex optimization. Technical Report AMR 95/32, Dept. of Applied Math., University of New South Wales, Sydney, South Wales, Australia, 1995. [6] M. Fukushima. A descent algorithm for nonsmooth convex minimization. Mathematical Programming, 30:163|175, 1984. [7] M. Fukushima and L. Qi. A globally and superlinearly convergent algorithm for nonsmooth convex minimization. SIAM J. Optim., 30:1106|1120, 1996. [8] Jr. J.E. Dennis and J.J. More. A characterization of superlinear convergence and its application to quasi{Newton methods. Math. Comp., 28:549|560, 1974. [9] Jr. J.E. Dennis and R.B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall, New Jersey, 1983. [10] C. Lemarechal. Bundle methods in nonsmooth optimization. In C. Lemarechal and R. Miin, editors, Nonsmooth Optimization. Pergamon Press, Oxford, 1978. [11] C. Lemarechal and C. Sagastizabal. An approach to variable metric bundle methods. In J. Henry and J.P. Yuan, editors, IFIP Proceedings, Syatems Modeling and Optimization, pages 144|162. Springer, Berlin, 1994. [12] C. Lemarechal and C. Sagastizabal. Variable metric bundle methods: from conceptual to implementable forms. Mathematical Programming, 76:393|410, 1997. [13] R. Miin. A quasi{second{order proximal bundle algorithm. Mathematical Programming, 73:51|72, 1996. [14] R. Miin, D. Sun, and L. Qi. Quasi{Newton bundle{type methods for nondierentiable convex optimization. Technical Report AMR 96/21, Dept. of Applied Math., University of New South Wales, Sydney, South Wales, Australia, 1996. 23
[15] J.J. Moreau. Proximite et dualite dans un espace Hilbertien. Bull. Soc. Math. France, 93:273|299, 1965. [16] J.-S. Pang and L. Qi. Nonsmooth equations: Motivation and algorithms. SIAM J. on Optimization, 3:443|465, 1993. [17] L. Qi and X. Chen. A preconditioning proximal Newton method for nondierentiable convex optimization. Mathematical Programming, 76:411|430, 1995. [18] M. Qian. The Variable Metric Proximal Point Algorithm: Theory and Application. Ph.d., University of Washington, Seattle, WA, 1992. [19] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970. [20] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control and Optimization, 14:877|898, 1976.
24