SIAM J . CONTROL A N D OPTIMIZATION Vol. 20. No. 2 . March 1982
@ 1982 Socaety for Industr~aland A p p l ~ e dMathernatla
0363-0129/82/20024006 S O l . O O / O
PROJECTED NEWTON METHODS FOR OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS* DIMITRI P. BERTSEKASt
Abstract. We consider the problem min {f(x)\x201, and propose algorithms of the form xk+, = [xt -akDkvf(xk)]+, where [.I+ denotes projection o n the positive orthant, ak is a stepsize chosen by an Armijo-like rule and D k is a positive definite symmetric matrix which is partly diagonal. We show that D k can be calculated simply on the basis of second derivatives o f f so that the resulting Newton-like algorithm has a typically superlinear rate of convergence. With other choices of D k convergence at a typically linear rate is obtained. The algorithms are almost as simple as their unconstrained counterparts. They are well suited for problems of large dimension such as those arising in optimal control while being competitive with existing methods for low-dimensional problems. The effectiveness of the Newton-like algorithm is demonstrated via computational examples involving as many as 10,000 variables. Extensions to general linearly constrained problems are also provided. These extensions utilize a notion of an active generalized rectangle patterned after the notion of an active manifold used in manifold suboptimization methods. By contrast with these methods, many constraints can be added o r subtracted from the binding set at each iteration without the need to solve a quadratic programming problem.
1. Introduction. We consider the problem minimize f ( x ) subject to x 20, where f : R n -,R is a continuously differentiable function, and the vector inequality x Z O is meant to be componentwise (i.e., for x = ( X I , x Z ,. - . , x n )E R n , we write x L O if xi 2 0 for all i = 1, . . . , n ) . This type of problem arises very often in applications; for example, when f is a dual functional relative to an original inequality constrained primal problem and x represents a vector of nonnegative Lagrange multipliers corresponding to the inequality constraints, and when f represents an augmented Lagrangian or exact penalty function taking into account other possibly nonlinear equality and inequality constraints. The analysis and algorithms that follow apply also with minor modifications to problems with rectangle constraints such as minimize f ( x ) subject to b1S x S 62, where b1 and b2 are given vectors. Problems ( 1 ) and ( 2 ) are referred to as simply constrainedproblems, and their algorithmic solution is the primary subject of this paper. In view of the simplicity of the constraints, one would expect that solution of problem (1) is almost as easy as unconstrained minimization of f. This expectation is partly justified in that the first order necessary condition for a vector f = ( f I , . . . , f " ) to be a local minimum of problem ( 1 )takes the simple form
* Received
by the editors August 26, 1980, and in final revised form April 22, 1981. t Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139. This work was supported in part by the National Science Foundation under grant ENG-79-06332 and in part by Alphatech Inc., through Department of Energy contract DE-ACO179ET29243.
DIMITRI P. BERTSEKAS
Furthermore the direct analog of the method of steepest descent takes the simple form where a k is a positive scalar stepsize and for any vector z = ( z ' , .
,z n ) E R nwe denote
The stepwise a k may be chosen in a number of ways. In the original proposal of Goldstein [ I ] and Levitin and Poljak [ 2 ] , a k is taken to be a constant a (i.e., a k =a, for all k ) , and a convergence result is shown under the assumption that a is sufficiently small and Vf is Lipschitz continuous. In general a proper value for 6 can be found only through experimentation. An alternative suggested by McCormick [ 3 ] is to choose a k by function minimization along the arc of points x k ( a ) , a Z O , where
Thus
a k
is chosen so that
(6)
f l ~ k ( f f k ) I=
p; f l x l , ( a ) I .
Unfortunately the minimization above is very difficult to carry out, particularly for problems of large dimension, since f [ x k ( a ) ] need not be differentiable, convex, or unimodal as a function of a even if f is convex. For most problems we prefer the Armijo-like stepsize rule, first proposed in Bertsekas [ 4 ] , whereby a k is given by
where
mk
is the first nonnegative integer m satisfying
(7b)
f ( x k ) -f[xk(Prns)l
2g v f ( x k ) ' [ ~ k- ~ k ( / . ? ~ s ) ] .
Here the scalars s , p and a are fixed and are chosen so that s > O , p E ( 0 , l ) and U E ( 0 , k). In addition to being easily implementable and convergent, the algorithm ( 4 ) , ( 7 ) has the advantage that when it converges to a local minimum x * satisfying the standard second order sufficiency conditions for optimality (including strict complementarity) it identifies the binding constraints at x * in a finite number of iterations in the sense that there exists such that
(8) where, for every x
B ( x * ) =B ( x k ) E R m ,B
V k > k;
( x ) denotes the set of indices of binding constraints at x ,
Minor modifications of the proofs given in [ 4 ] show that the results stated above hold also for the algorithm
where Dk is a diagonal positive definite matrix, and x k ( a ) is given by
ak is
chosen by ( 7 ) where now
OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS
223
For this it is necessary to assume that the diagonal elements d:, i = 1, . . - , n of the matrices Dk satisfy
where d and d are some positive scalars. While it is often possible to achieve substantial computational savings by proper diagonal scaling of Vf as in (lo), the resulting algorithm is typically characterized by linear convergence rate [4], [22]. Any attempt to construct a superlinearly convergent algorithm must by necessity involve a nondiagonal scaling matrix Dk which is an adequate approximation of the inverse Hessian v2f(xk)-', at least along a suitable subspace. At this point we find that the algorithms available at present are far more complicated than their unconstrained counterparts, particularly when the problem has large dimension. Thus the most straightforward extension of Newton's method is given by [I 2)
xk+1= ~ +ak(-fk k -xk),
where f k is a solution of the quadratic program minimize Vf(xk)' ( X - xk) +$(x - x ~ ) ' v ~ ~ ( x -xk) ~)(x (13)
subject to x 2 0,
and akis a stepsize parameter. There are convergence and superlinear rate of convergence results in the literature regarding this type of method (Levitin and Poljak [2], Dunn [5]) and its quasi-Newton versions (Garcia-Palomares and Mangasarian [6]); however, its effectiveness is strongly dependent upon the computational requirements of solving the quadratic program (13). For problems of small dimension problem (13) can be solved rather quickly by standard pivoting or manifold suboptimization methods, but for large-dimensional problems the solution of the quadratic program (13) by standard methods can be very time consuming. Indeed there are large-scale quadratic programming problems arising in optimal control, the solution of which by pivoting methods is unthinkable. In any case the facility or lack thereof of solving the quadratic program (13) must be accounted for when comparing method (12) against other alternatives. Another possible approach for constructing superlinearly convergent algorithms for solving problem (1) stems from the original gradient projection proposal of Rosen [7] and is based on manifold suboptimization and active set strategies as in Gill and Murray [8], Goldfarb [9], Luenberger [lo] and other sources, (see Lenard [ l l ] for up-to-date performance evaluation of various alternatives). Methods of this type are quite efficient for problems of relatively small dimension, but are typically unattractive for large-scale problems with a large number of constraints binding at a solution. The main reason is that typically at most one constraint can be added to the active set at each iteration, so if, for example, 1,000 constraints are binding at the point of convergence and an interior starting point is selected, then the method will require at least 1,000 iterations (and possibly many more) to converge. While several authors [8], [lo] have alluded to the possibility of bending the direction of search along the constraint boundary, the only specific proposal known to the author that has been made in the context of the manifold suboptimization approach is the one of McCormick [12] and it does not seem particularly attractive for large-scale problems. (The quasi-Newton methods proposed by Brayton and Cullum [13] incorporate bending but simultaneously require the solution of quadratic programming subproblems.)
224
DIMITRI P. BERTSEKAS
Manifold suboptimization methods require also additional computation overhead in deciding which constraint to drop from the currently active set. For the apparently most successful strategies (Lenard [ l l ] ) which attempt to drop as many constraints as possible this overhead can be significant and must be taken into account when comparing the manifold suboptimization approach with other alternatives. The algorithms proposed in this paper attempt to combine the basic simplicity of the steepest descent iteration (4), (7) with the sophistication and fast convergence of the constrained Newton's method (12), (13). They do not involve solution of a quadratic program thereby avoiding the associated computational overhead, and there is no bound to the number of constraints that can be added to the currently active set thereby bypassing a serious inherent limitation of manifold suboptimization methods. The basic form of the method is
where
Di, is a positive definite symmetric matrix which is partly diagonal, and ak is a stepsize determined by an Armijo-like rule similar to (1) that will be described later. The convergence and rate of convergence properties of this method are discussed in 3: 2. A key property of the method is that under mild assumptions it identifies the manifold of binding constraints at a solution in a finite number of iterations in the sence of (8). This means that eventually the method is reduced to an unconstrained method on this manifold and brings to bear the extensive methodology and analysis relating to unconstrained minimization algorithms. In 3: 3 we discuss how the method (14), (15) can form the basis for constructing algorithms for general linearly constrained problems of the form minimize f (x) subject to bl 5 Ax S bz. The main idea here is to view problem (16) locally as a simply constrained problem via a transformation of variables. For example, if the matrix A is square and invertible problem (16) is equivalent to the problem minimize h ( y ) Af ( ~ - ' ~ ) subject to bl d y 5 b2, via the transformation y = Ax. A similar approach based on an active set strategy is employed when A is not square and invertible. The ideas are similar to those involved in manifold suboptimization methods where a linear manifold is selected as a "local universe" for the purposes of the current iteration. In our algorithms we take a suitably chosen rectangle (i.e., a set described by upper and lower bounds on the variables) as a "local universe" instead of a manifold. Finally in 3: 4 we provide results of computational experiments with large scale optimal control problems some of which involve several thousand variables. Throughout the paper we emphasize Newton-like methods as prototypes for broad classes of superlinearly converging algorithms that fit the framework of the
OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS
225
paper. We often make positive definiteness assumptions on the Hessian matrix of f in order to avoid getting bogged down in technical details relating to modifications of Newton's method such as those employed in unconstrained minimization [14]-[16] to account for the possibility that v2f is not positive definite. Quasi-Newton, approximate Newton and conjugate gradient versions of the Newton-like methods presented are possible but the discussion of specific implementations is beyond the scope of the paper. More generally it may be said that the nature of the algorithms proposed is such that almost every useful idea from unconstrained minimization can be fruitfully adapted within the constrained minimization framework considered here; however, the precise details of how this should be done may involve considerable further research and experimentation. The notation employed throughout the paper is as follows. All vectors are considered to be column vectors. A prime denotes transposition. The standard norm in R n is denoted by 1 . 1 , i.e., for x = ( x l , . . ,x n ) we write 1x1 = [ ~ ~ - i ,)2(]1x/ 2 . The gradient and Hessian of a function f :R n + R are denoted by Vf and v2frespectively.
2. Algorithms for minimization subject to simple constraints. We consider first the problem min { f ( x ) l x 2 0 ) of ( 1 ) . Any vector f 2 0 satisfying the first order necessary condition (3) will be referred to as a critical point with respect to problem ( 1 ) . We focus attention at iterations of the form xk+1
=[xk
-ffkDkvf(xk)l+,
where Dk is a positive definite symmetric matrix and ak is chosen by search along the arc of points
It is easy to construct examples (see Fig. 1)where an arbitrary positive definite choice of the matrix Dk leads to situations where it is impossible to reduce the value of the objective by suitable choice of the stepsize a (i.e., f [ ~ k ( ( ~ ) ] Z f ( b'ff x k )2,0 ) . The
226
DIMITRI P. BERTSEKAS
following proposition identifies a class of matrices Dkfor which an objective function reduction is possible. Define for all x 2 0 ,
We say that a symmetric n x n matrix D with elements dii is diagonal with respect to a subset of indices I c { 1 , 2 , . ,n ) if
PROPOSITION 1. Let x 2 0 and D be a positive definite symmetric matrix which is diagonal with respect to I ' ( x ) and denote (a) The vector x is a critical point with respect to problem ( 1 ) if and only if (b) If x is not a critical point with respect to problem ( 1 ) there exists a scalar 6 > 0 such that (20)
f [ x ( a : ~ ] < f ( x )V a ~ ( 0 , ~ l .
Proof. Assume without loss of generality that for some integer r we have I+(x)={r+l;
. - ,n}.
Then D has the form
where fi is positive definite and d' > 0 , i = r + 1, . . . , n. Denote (22)
p
= DVf
(x).
(a) Assume x is a critical point. Then using (3), (17)
These relations and the positivity of di, i = r + 1, . . . ,n imply that p'=O
Vi=l;-.,r,
pi>^ V i = r + l ; . - , n . Since x i ( a )= [ x ' - a p i ] + and x' = 0 for i = r+ 1, . . , n it follows that x i ( a )= x i for all i, and a 2 0. Conversely assume that x = x ( a ) for all a 20.Then we must have
OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS
227
Now by definition of Z+(x)we have that if x i = 0 and ie! Z+(x)then af(x)/axi5 0. This together with the relations above imply
since by (21),(22),
and
is positive definite we have ~ ~ = , p i a f ( x ) /8a0x, iand it follows that i af(x) P =-= 0 Vi=l;.-,r. ax'
.
Since for i = r + 1, . , n, af(x)/axi> 0 and x i = 0 , we obtain that x is a critical point. (b) For i = r + 1, . , n we have af(x)/axi> 0 , x i = 0 , and, from (21),(22),p i > 0. Since x i ( a )= [ x i-aPi]+we obtain
(23)
.
x i = x i ( a ) = O VCYZO, i = r + l ; . . , n .
Consider the sets of indices
(24) (25)
1~={i~x~>0orx~=0and~~ 0 yields that P is a feasible descent direction at x and there exists a scalar CT > 0 for which the desired relation (20) is satisfied. Q.E.D. Based on Proposition 1 we are led to the conclusion that the matrix D k in the iteration
should be chosen diagonal with respect to a subset of indices that contains
Unfortunately the set I+(xk)exhibits an undesirabie discontinuity at the boundary of the constraint set, whereby given a sequence {xk}of interior points that converges to a boundary point 2 the set I+(xk) may be strictly smaller than the set I + ( f ) . This causes difficulties in proving convergence of the algorithm and may have an adverse effect on its rate of convergence. (This phenomenon is quite common in feasible direction algorithms and is referred to as zigzagging or jamming.) For this reason we will employ certain enlargements of the sets I'(xk) with the aim of bypassing these difficulties. The algorithm that we describe utilizes a scalar E > O (typically small), a fixed' diagonal positive definite matrix M (for example the identity), and two parameters p E ( 0 , l ) and a E (0, $) that will be used in connection with an Armijo-like stepsize rule. An initial vector x o z O is chosen and at the kth iteration of the algorithm we have a vector xk 2 0. Denote (k + 1)st iteration of the Algorithm. We select a positive definite symmetric matrix Dk which is diagonal with respect to the set I; given by
Denote
Then xk+l is given by
Actually the results that follow can be shown also for the case where M is changed from one iteration to the next in a way that its diagonal elements are bounded above and away from zero.
OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS
229
where and mk is the first nonnegative integer m such thatZ
The stepsize rule (36), (37) (see Fig. 2) may be viewed as a combination of the Armijo-like rule (7) and the Armijo rule usually employed in unconstrained minimization (see, e.g., Polak [18]). When I; is empty the right-hand side of (37) becomes upmVf(xk)'pk and is identical to the corresponding expression of the Armijo
UNSUCCESSFUL TRIAL STEP SIZES
rule in unconstrained optimization, while if 1; identical with (7). Note that for all k we have
= {1,2,
. . . ,n ) then inequality (37) is
I: 3 I + ( x k ) so the matrix Dk is diagonal with respect to I+(xk).It is possible to show that for all m 2 0 the right-hand side of (37) is nonnegative, and it is positive if and only if xk is not a critical point. Indeed since D k is positive definite and diagonal with respect to 1: we have
while for all i E I:,
in view of the fact af(xk)/axi>O, we have p i > 0 and hence x;-x;(ru)~0
VazO,
I
, k=0,1,..-,
This shows that the right side of (37) is nonnegative. If xk is not critical then it is easily seen (compare also with the proof of Proposition l(b)) that one of the inequalities 'The results that follow can also be proved if ~ i , , ; ( ~ f ( x ~ ) / ~ x is' ) preplaced ~ in ( 3 7 ) by ( a f ( ~ ~ ) l a x ~where ) ~ ; , yk =min {I, d k ] and al, =sup { a J x ;- a p ; ZO, V i t I ; } . This modification makes ( 3 7 ) easier to satisfy. yk
xi,r,
230
DIMITRI
P. BERTSEKAS
(38)or (39)is strict for a > 0 so the right side of (37)is positive for all m 1 0 . A siight modification of the proof of Proposition l ( b ) also shows that if xk is not a critical point then (37) will be satisfied for all m sufficiently large so the stepsize ak is well defined and can be determined via a finite number of arithmetic operations. If xk is a critical point then, by Proposition 1(a), we have xk = xk ( a )for all x 2 0. Furthermore the argument given in the proof of Proposition l(a) shows that
so both terms in the right side of (37) are zero. Since also xk follows that (37) is satisfied for m = 0 thereby implying that xk+l= x k ( l )= xk
= xk(a) for
all a 2 0 it
if xk is critical.
In conclusion, the algorithm is well defined, decreases the value of the objective function at each iteration k for which xk is not a critical point, and essentially terminates if xk is critical. We proceed to analyze its convergence and rate of convergence properties. To this end we will make use of the following two assumptions: (A) The gradient V f is Lipschitz continuous on each bounded set of R n ;i-e., given any bounded set S c R nthere exists a scalar L (depending on S ) such that (40)
IVf(x)-Vf(y)l~L\x-yI Vx,Y E S .
( B ) There exist positive scalars A l ,
A2
and nonnegative integers ql, q2 such that
where wk
= (xk -[xk
-MVf(~k)l+l.
Assumption (A) is not essential for the result of Proposition 2 that follows, but simplifies its proof. It is satisfied for just about every problem likely to appear in practice. For example it is satisfied when f is twice differentiable as well as when f is an augmented Lagrangian of the type used for inequality constrained problems involving twice differentiable functions. Assumption ( B ) is a condition of the type commonly utilized in connection with unconstrained minimization algorithms. When q l = q2 = 0, relation (41) takes the form
and simply says that the eigenvalues of D kare uniformly bounded above and away from zero. PROPOSITION2. Under Assumptions (A) and ( B ) above, every limit point of a sequence {xk) generated by iteration (35) is a critical point with respect to problem (1). Proof. Assume the contrary, i.e., that there exists a subsequence {xk)K converging to a vector f which is not critical. Since { f ( x k ) )is decreasing and f is continuous it follows that { f ( x k ) )converges to f (2)and therefore
Since each of the sums in the right-hand side of (37) is nonnegative (cf. (38), (39)), we must have
OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS
23 1
Also since iis not critical and M is diagonal we have l i -[f -MVf(f)]'(#O, so (41) implies that the eigenvalues of {Dk)K are uniformly bounded above and away from zero. In view of the fact that Dk is diagonal with respect to I:, it follows that there exist positive scalars XI, i2 such that for all k E K that are sufficiently large
af (xk) -pk SA20 0 so (49) cannot hold. Therefore from (48) ii>O
and
af(f)>o. -
'
ax Since we have [cf. (39)] for all k E K for which i E I:
it follows from (44) and (50) that lim [ x ~ - x : ( ~ ~ : I ] = o . k-w ksK
Using the above relation, (45) and (50), we obtain (47). We will complete the proof by showing that {akIK is bounded away from zero thereby contradicting (47). Indeed in view of (46) the subsequences { x k ) ~(, P ~ ) Kand {xk(a)}K, a E [O, 11 are uniformly bounded so by Assumption (A) there exists a scalar L > 0 such that for all t E [0, 11, a E [0, 11 and k E K we have We have for all k E K and a E [0, 11 f[xk(a)]=f(~k)+Vf(~k)'[~k(a)-~kI
DIMITRI P. BERTSEKAS
and finally by using (51)
For ~ E I we : havex:(a)=[x:-ap:]'2x:-ap: It follows using (45)that
andp:>O,soOSx:-x:(a)Sap:.
Consider the sets
For all i E IlWkwe must have x i > & I , for otherwise we would have i E I:. Since ( f - [ f - M V f ( f ) ] ' (# 0 we must have limk,o,kEKinfE~ > 0 and E~ > 0 for all k. Let I > 0 be such that I S E , for all k e K, and let B be such that lP:l S B for all i and k E K. Then for all a E [0, FIB] we have x: ( a )= x: - apL for all i E IiSkso it follows that
AISO for all a 2 o we have x: - x : ( a ) 5 ap:, and since af(xk)/axi s o for all i e 1 2 . k ~we obtain
(55)
I ict2.k
w[x;-x;(a)lzn axi
is^^.^
af(xk) -p;. ax'
.
Combining (54) and (55),we obtain
For all a
e0 we also have
(x: - x : ( a ) ( ~ a J p : (V i = 1,. . n. Furthermore it is easily seen using Assumption (B) that there exists A > 0 such that a
Combining the last two relations we obtain for all a 2 0
,
OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS
233
We now combine ( 5 2 ) ,(53),(56)and (57)to obtain for all a E [0, ( F I B ) ]and k
EK
Suppose a is chosen so that
or equivalently
Then we have from (58),( 5 9 )for all k E K
This means that if (60) is satisfied with p m = a , then the inequality (37) of the Armijo-like rule will be satisfied. It follows from the way the stepsize is reduced that ak satisfies
This contradicts ( 4 7 )and proves the proposition. Q.E.D. It is interesting to note that the argument of the last part of the proof above shows that if the level set { x( f ( x )5 f (xo),x 2 0)is bounded, then there exists a scalar ci > 0 such that, for every a E (0, ci], the constant stepsize algorithm x k +=~ x k ( a ) generates sequences { x k )the limit points of which are critical points with respect to problem (1). We now focus attention at a local minimum x* satisfying the following second order sufficiency conditions. For all x 2 0 we denote by B ( x ) the set of indices of binding constraints at x, i.e.,
( C ) The local minimum x* of problem ( 1 ) is such that for some S >O, f is twice continuously differentiable in the open sphere {xllx -x*l0, i = rk + 1, . . . , n and Dk can be an arbitrary positive definite matrix. Suppose we choose Dk to be the inverse of the Hessian o f f with respect to the indices i = 1, . , rk, i.e., the elements [D;'Iii are
By Assumption ( C ) ,v 2 f ( x * )is positive definite on T so it follows from (76) that this choice is well defined and satisfies the assumption of Proposition 3 for k sufficiently large. Since the conclusion of this proposition asserts that the method eventually reduces to Newton's method restricted on the subspace T a superlinear convergence rate result follows. This type of argument can be used to construct a number of Newton-like and quasilNewton methods and prove corresponding convergence and rate of convergence results. We state one of the simplest such results regarding a Newton-like algorithm which is well suited for problems where f is strictly convex and twice differentiable. Its proof follows simply from the preceding discussion and standard results on the unconstrained form of Newton's method so it is left to the reader. PROPOSIT~ON 4. Let f be convex and twice continuously differentiable. Assume that problem ( 1 ) has a unique optimal solution x* satisfying Assumption ( C ) ,and that there exist positive scalars m l , m 2 such that
Assume also that in the algorithm (32)-(37), the matrix Dk is given by Dk = H i 1 , where Hk is the matrix with elements H f given by
~f
i-
= a2f(xk)
ax'ax'
otherwise.
Then the sequence {xk} generated by iteration ( 3 5 ) converges to x* and the rate of convergence of { ( x k-x*I} is superlinear (at least quadratic if v2f is Lipschitz continuous in a neighborhood of x*). Note that by making use of the result of Proposition 3 it follows that when f is a positive definite quadratic function, the algorithm of Proposition 4 solves problem ( 1 ) in a finite number of iterations provided the unique solution x* satisfies Assumption (C). The algorithm of Proposition 4 also has the property that, for all k sufficiently large, the initial unity stepsize will be accepted by the Armijo rule. Our computational experience suggests that the unity stepsize is also acceptable for the great majority
OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS
237
of iterations even before the binding constraints at the solution are identified. We did observe however some cases where it was necessary to reduce the initial unity stepsize several times before a sufficient reduction in the objective function value was effected. The most typical situation where such a phenomenon can occur is when the scalar qk defined by = m i n { l G k = s u P { a ~ x ~ - a p ~ ~ 0 , x ~ > 0 , i ~ ~ ; ) is small relative to unity. Under these circumstances some nonbinding constraint, that was not taken into account when forming the index set I;, is encountered after a small movement along the arc { x k ( a ) / a> 0). As a result it may occur that the objective function value increases as a is increased from qk. A reasonable heuristic device to avoid a large number of function evaluations in such cases is to modify the line search so that if at any iteration a fixed number r of trial stepsizes 1, P, . . - ,P'-' fail to pass the Armijo rule test then ?I,is computed and used as the next trial stepsize. There is another (infrequent) situation where a unity initial stepsize may be inappropriate when far from convergence, and the Armijo rule may need a large number of stepsize reductions before determining an acceptable stepsize. This situation can arise when the sets of indices {ilx: = 0, i$ I t } and {ilx: = 0, O, it is clear that the Armijo-like rule will yield a stepsize after a finite number of arithmetic operations. Taking into account the fact that the gradient of the transformed objective function is
and making use of (91) we can finally write iteration (95) in terms of the original variables as
(96)
xk+i = (A: )-'[A :xk - a
[ ( ~ t ) ' l - '(xk ~ f)I#,
k ~ k
where xk+l= (A:)-lyk+1. The algorithmic process by means of which xk+l is obtained is illustrated in Fig. 3.
DIMITRI P. BERTSEKAS
ORIGINAL SPACE
TRANSF
CONTOURS OF h,
* Y'
It may appear that iteration (96) involves excessive computational overhead in view of the presence of the inverse (A:)-'. However in many problems with special structure it is possible to compute this inverse very efficiently. For other problems we note that it is possible to organize the algorithm so that most of the indices in the sets Bk and B k + , are common. In fact at each iteration k typically at most one nonbinding inequality not belonging to the current active rectangle Bk will become binding at the next iteration, the exception being the unlikely situation where more than one of the constraints ( 9 4 )will become simultaneously binding at yk+'. In this case the matrices A : and A:+, need only differ by at most one row and as a result the inverse (A:+l)-' can beobtainedfrom (A:)-' by theHouseholdermodification ruleinvolvingonly 0 ( n 2 )arithmetic operations (see Gill and Murray [8, p. 591).Note also that if a number (say n k )of the trivial constraints ( 8 9 )participate in the formation of the active rectangle (90) then the inverse (A:)-' can be formed by matrix inversion of order (n - n k ) . The reader who is familiar with manifold suboptimization methods, as described for example in Gill and Murray [ 8 ] , will notice a strong similarity between the transformation process involved in these methods and the one employed above. The only essential difference is that in our method we use the active generalized rectangle Xk in place of the manifold of active constraints. The main advantage that algorithm (96) offers over manifold suboptimization alternatives is that as many as n new
OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS
241
constraints may become binding in a single iteration while considerable flexibility is afforded in changing the active set of constraints. By contrast, in manifold suboptimization methods, barring exceptional circumstances, at most one new constraint will become binding in any single iteration while dropping currently active constraints must be carefully controlled. Thus a fundamental limitation of these methods is substantially overcome, the capability of attaining superlinear convergence is maintained and there is no need to solve a quadratic programming subproblem at each iteration. There are many issues relating to convergence, rate of convergence, active rectangle selection and implementation of the algorithm described in this section but their discussion properly belongs to a separate paper. We provide instead a specific superlinearly convergent Newton-like implementation of algorithm (96)for the case where the constraint set is a simplex. Example (minimization on a simplex). Consider the problem minimize f ( x ) n
subject to x 2 0,
1 x i = 1, i=l
where we assume that the function f :R n-,R is convex, twice continuously differentiable with everywhere positive definite Hessian matrix. Given a feasible vector xk let
We consider the transformation of variables defined by
thus implicitly forming an active rectangle consisting of the equation ly='=lxi = 1 and the inequalities x i 2 0, i # i. The inverse transformation is
If we write this transformation as x problem is transformed into
=
Tky, where Tk is the appropriate matrix, the
minimize h k ( y )
f ( ~ k ~ )
subject to y i 2 0 V i # i,
The last constraint is (by ignored in the iteration of The first and second respect to the variables y i
construction) inactive at the point yk = Tkxk, so it will be the Newton-like method of 5 2. derivatives of the transformed objective function hk with at yk are given by
242
DIMITRI P. BERTSEKAS
The Newton-like iteration to be performed in the space of variables y is a slight variation of the one of 0 2 (cf. Proposition 3 ) to account for the presence of the constraint y i = 1. It takes the form described below. Let
i 2 -1
where p i = [ a 2 h k ( y k ) / ( a y )
1 . Let also
and form the matrix Hkwith elements HZ given by
Let
Then y t + l = y k ( f f k ) , where for all
The stepsize that
at
is given by
ah0
ak = P m k ,where
m k is the first nonnegative integer m such
and 1- 1 Y : ( p m ) 2 0 . i f i
The vector x k + l is then given by x:+1 = y i + l
V i # i,
x : + ~= 1 -
1 y:+l.
ifi
Similarly as for Proposition 3 , it is easily shown that this algorithm converges superlinearly to the unique (global) minimum of problem ( 9 7 ) . The algorithm can be extended trivially to the case where, in addition to the nonnegativity constraints, there is a single equality or inequality constraint, as well as to the case where the constraint set consists of a Cartesian product of simplices. Similar algorithms can be written in
OPTIMIZATION PROBLEMS WITH SIMPLE CONSTRAINTS
243
explicit form for problems with a large number of nonnegativity (or upper and lower bound constraints) and a small number of additional equality o r inequality constraints. Newton-like algorithms of this type are particularly effective when the problem has special structure that facilitates the solution of the linear system of equations involved in implementing the basic iteration (cf. (98)).
4. Application in discrete-time optimal control-computational results. The algorithms of the paper are particularly well suited for discrete-time optimal control problems involving a discrete time system of the form
a cost functional of the form
and simple constraints on the control vectors of the form
We assume that the functions fi : Rn+"+ Rn,gi :Rn+"+ R and G :R n+ R are twice continuously differentiable and N is a positive integer. Problems of this type are discussed for example in Varaiya [17], Polak [18], and Cannon, Cullum and Polak [20]. They are often characterized by large dimension, particularly when they arise from discretization of continuous-time optimal control and calculus of variations problems. Each state vector xi can be uniquely represented in terms of the control sequence u = {uo, . . , u ~ - via ~ )the system equation (99) in the form
where c$~ are the appropriate functions. The problem is then equivalent to minimize J ( u ) = G[c$,(u)] + (101)
-
subject t o bi 5 ui S bi,
N-1 i=O
i = 0,
gi[c$i(u), ui]
. . . ,N - 1,
It is well known (see Mitter [21], Polak [18]) that the unconstrained Newton direction -[v2J(u)]-'VJ(U) for this problem can be efficiently computed by means of the Riccati equation. An algorithm such as the one of Proposition 4 can also be similarly implemented via the Riccati equation. A t each iteration k we first determine the set of indices 1%[cf. (81)l. W e then compute the Newton direction with respect to the control vector coordinates corresponding t o indices i@ 1: via the Riccati equation, while we compute the (diagonally) scaled steepest descent direction for the remaining coordinates corresponding to indices i E I F . The overall algorithm is thus very similar to the one used for the corresponding unconstrained problem. It is well suited for large scale linear-quadratic problems with simple control constraints for which pivoting methods are apparently very cumbersome and inefficient. Our computational example is of this type.
244
DIMITRI P. BERTSEKAS
Consider the two-dimensional linear system
) and the control constraints are The initial state xo = ( x ~ ,x ~~ , , is~ given
(103)
- 1 u i l
i=O,l;..,N-1.
The problem is to minimize
where the matrix Q and the scalar R are given by
This problem arises from discretization of the continuous-time problem of minimizing
subject to
and
If the interval [ 0 , TI is discretized into N intervals of length
and the approximation
is used, then problem (102)-(105) is a discretized version of problem (106)-(110). We show in Table 1 for a variety of values of N and s the number of iterations required by the method of Proposition 4 to obtain the exact solution for two initial states xo= ( 1 5 , 5 ) and xo= ( 5 , -10) and two initial control trajectories U P = O and U P = 1. In all runs we chose E = 0.01, p = 0.5 and cr = All computations were performed in double precision and this was found essential for large values of N. The results demonstrate the ability of the method to identify the set of binding constraints in very few iterations. It is worth noting that while the table gives results for N only up to 10,000, an incomplete set of experiments was run with N = 25,000, and a very similar performance was observed for the method.
OPTIMIZATION PROBLEMS W l T H SIMPLE CONSTRAINTS
# of binding
constraints at solution
1,:
SO
REFERENCES [I] A. A. GOLDSTEIN,Convex programming in Hilbert space, Bull. Amer. Math. Soc., 70 (1964), pp. 709-710. [2] E. S. LEVITINAND B. T. POLIAK, Constrained minimization problems, U.S.S.R. Comput. Math. Math. Phys., 6 (1966). pp. 1-50. [3] G. P. MCCORMICK,Anti-zig-zagging by bending, Management Science, 15 (1969), pp. 315-319. [4] D. P. BERTSEKAS,O n the Goldstein-Levitin-Poljak gradient projection method, IEEE Trans. Automat. Control, AC-21 (1976), pp. 174-184. [5] J. C. DUNN,Newton's method and the Goldstein step-length rule for constrained minimization problems, this Journal, 6 (1980), pp. 659-674. [6] U. M. GARCIA-PALOMARES AND 0.L. MANGASARIAN, Superlinearly convergent quasi-Newton algorithms for nonlinearly constrained optimization problems, Math. Prog., 11 (1976), pp. 1-13. [7] J. B. ROSEN, The gradient projection method for nonlinear programming, Part I : linear constraints, J. Soc. Ind. Appl. Math., 8 (1960), pp. 181-217.
246
DIMITRI P. BERTSEKAS
[8] P. E. GILL AND M. MURRAY,eds., Numerical Methods for Constrained Optimization, Academic Press, New York. Chs. 2, 3. [9] D. GOLDFARB,Extension of Davidon's variable metric algorithm to maximization under linear inequality and equality constraints, SIAM J. Applied Math., 17 (1969), pp. 739-764. [lo] D. G. LUENBERGER, Introduction to Linear and Nonlinear Programming, Addison-Wesley, Reading. MA, 1973, Ch. 11. [ l l ] M. L. LENARD,A computational study of active set strategies in nonlinear programming with linear constraints, Math. Prog. 16 (1979). pp. 81-97. [12] G. P. MCCORMICK,The variable reduction method for nonlinear programming, Management Science, 17 (1970). pp. 146-160. [13] R. K. BRAYTONAND J. CULLUM,A n algorithm for minimizing a differentiable function subject to box constraints and errors, J. Opt. Th. Appl., 29 (1979), pp. 521-558. [14] W. MURRAY,Second derivative methods, Numerical Methods for Unconstrained Optimization, W. Murray, ed., Academic Press, New York, 1972, pp. 57-71. [15] R. FLETCHERAND T. L. FREEMAN,A modified Newton method for minimization, J. Opt. Th. Appl., 23 (1977). pp. 357-372. [16] J. J. MOREAND D. C. SORENSEN,O n the use of directions of negative curvature in a modified Newton method, Math. Prog., 16 (1979), pp. 1-20. [17] P. P. VARAIYA,Notes on Optimization. Van Nostrand-Reinhold, New York, 1972. [I81 E. POLAK, Computational Methods in Optimization: A Unified Approach, Academic Press, New York, 1971. [19] D. P. BERTSEKAS,Consbained Optimization and Lagrange Multiplier Methods, Academic Press, New York, 1981 (to appear). [20] M. D. CANNON,C. D. CULLUMAND E. P O ~ A ~ , n e o rofy Optimal Control and Mathematical Programming, McGraw-Hill, New York, 1970. , approximation methods for the solution of optimal control problems. 1211 S. K. M I ~ E R Successive Automatica, 3 (1966), pp. 135-149. [22] J. C. DUNN, Global and asymptotic convergence rate esdmates for a class of projected gradient processes, this Journal, 19 (1981), pp. 368-400.