A View of Unconstrained Minimization Algorithms ... - Semantic Scholar

Report 0 Downloads 126 Views
A View of Unconstrained Minimization Algorithms That Do Not Require Derivatives M.J.D. POWELL AERE Harwell Existing algorithms are examined, with particular attention given to their merits and defects, in order to identify the techniques which may give methods for minimizahon without derivatives that are better than the methods we have at present. "Quasi-Newton" and "conjugate direction" algorithms are studmd, and it is noted that both these classes of methods have fundamental disadvantages To overcome these disadvantages another class of algorithms suggests itself, called "B-conjugate" methods. Some comments are made on the construction of good algorithms within this class. Key Words and Phrases: CR Categories: 5.41

1.

conjugate directmns, minimization, optimization

INTRODUCTION

W e use the n o t a t i o n F ( x ) for the function whose least value is to be calculated, where x is a vector whose components are the variables which are to be adjusted a u t o m a t i c a l l y b y a minimization algorithm. W e let n be the n u m b e r of c o m p o n e n t s of x. To define F ( x ) for a minimization algorithm, the user m u s t provide a subroutine t h a t calculates F ( x ) for a n y x. H e m u s t also provide a starting vector x ¢1), say, and perhaps some other information, for example, step-lengths and the a c c u r a c y required. T h e n m o s t algorithms a u t o m a t i c a l l y construct a sequence of points x (k) (k = 1, 2, 3 , . . . ) , which should converge to the required vector of variables. T h e algorithms t h a t we consider are iterative, and we let x (k) be the starting point of the kth iteration. Typically between n and 3n values of F ( x ) are required b y each iteration. T h e algorithms we consider include safeguards which force the inequality F ( x (k+l)) < F(x(k)), k = 1, 2, 3, . . . , (1.1) to be satisfied. W e assume t h a t the true function F ( x ) has continuous first a n d second derivatives, a l t h o u g h the user does not provide a p r o g r a m which c o m p u t e s their values. I n this case it is almost always best to select the minimization algorithm from the general m e t h o d s t h a t are designed to work particularly well when F ( x ) is a q u a d Copyright O 1975, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part of this materiM is granted provided that ACM's copyright notice is given and that reference is made to the publicahon, to its date of issue, and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery. A version of this paper was presented at Mathematical Software II, a conference held at Purdue University, West Lafayette, Indiana, May 29-31, 1974. Author's address: Computer Science and Systems Division, Building 8.9, AERE Harwell, Didcot, 0Xll ORA, England. ACM T r a n s a c t i o n s o n Mathematical Software,Vol 1. No 2, June 1975

98

M.J.D. Powell

ratic function. The reason is that if the steps {x(~+1) - x (~)} become small, then usually the local behavior of F(x) is similar to that of a quadratic function, so the algorithm should work well; if the steps {x(k+l) - x (k)} are large, then usually good progress is being made anyway. Some very useful algorithms of this type are described in the books by Brent C1] and Kowalik and Osborne [-101, which fall into the two classes mentioned in the abstract, namely, "quasi-Newton" algorithms and "conjugate direction" algorithms. In addition a number of direct search methods have been developed for the case when derivatives need not exist or need not be smooth; for example, see Kowalik and Osborne [-10], Parkinson and Hutchinson [-12], and Wilde [-17]. However, we do not discuss direct search methods, because they do not contribute to the main purpose of this paper, which is to encourage the investigation of B-conjugate algorithms. The class of quasi-Newton methods is considered in Section 2. We note that for all algorithms in this class it is necessary for the kth iteration (k = 1, 2, 3 , . . . ) to obtain an estimate of the first derivative vector of F(x) at x (k) which has good relative accuracy, but this requirement becomes harder and harder to achieve as x (~) approaches the required solution, because the first derivative vector tends to zero. A comparison of current algorithms shows that this is a real difficulty, because different careful attempts to meet it cause variations by factors of two or more in the total computing time. Therefore, instead of quasi-Newton methods, we would prefer our minimization algorithms to be independent of high accuracy in first derivative estimates. Conjugate direction methods are studied in Section 3. They do not require accurate first derivative estimates. Instead they use only calculated function values to seek search directions which are conjugate with respect to the current second derivative matrix of the objective function. Although these methods are usually quite successful when n is less than about 10, they have some disadvantages. For instance, special devices are often needed to keep linear independence in the search directions, and the concept of conjugacy is not well defined, unless the second derivative matrix of F(x) is constant. "B-conjugate methods" are introduced in Section 4. They avoid most of the disadvantages of the quasi-Newton and conjugate direction algorithms, by maintaining an estimate of the second derivative matrix of F(x), and by using linearly independent search directions that are conjugate with respect to this estimate. The first algorithm of this type was proposed by Greenstadt [-8]. We consider some techniques for revising the second derivative estimate. In Section 5 there is a discussion of the prospects of B-conjugate methods. We note that many choices have to be made to define a particular algorithm and that most of these choices are unexplored. Although it is still too early to say that B-conjugate methods will provide general computer subroutines for minimization without derivatives that are preferable to existing subroutines, the prospects seem to be excellent. 2.

QUASI-NEWTON METHODS

At the beginning of the kth iteration (k = 1, 2, 3 , . . . ) of a quasi-Newton method, we require a starting point x Ck), a vector g(k), and a symmetric matrix B (k). The A C M Transactions on MathematicM Software, Vol 1, No. 2, June 1975.

A View of Unconstrained Minimization Algorithms

.

99

vector g(k) is an estimate of the gradient of F ( x ) at k (k), and the matrix B (k) is an estimate of the second derivative matrix of F ( x ) at x (k). Sometimes the errors in B (k) are quite large. For example, in m a n y useful algorithms it is advantageous to force B (k) to be positive definite, even though the true second derivative matrix may have negative eigenvalues at x (k). To simplify the description and discussion t h a t follows, we suppose in this section t h a t B (k) is positive definite on every iteration. The derivative estimates provide the quadratic approximation F ( x (k) + 3) ~ F ( x (k)) + ~Tg(k) + ½5TB(~)~,

(2.1)

where the superscript T distinguishes a row vector from a column vector. The value of ~ that minimizes the right-hand side of eq. (2.1) satisfies the equation g(k) + B(k)~ = 0.

(2.2)

Therefore some quasi-Newton methods define x (k+l) by the equation x (k+l) = x (k) - EB(k)]-lg (k).

(2.3)

However, because this choice of x (k+~) m a y conflict with inequality (1.1), it is usual to let x (k+l) be the vector x (k+l) = x (k) - ~(k)EB(~)]-lg(k) ,

(2.4)

where ~(~) is a multiplying factor which is chosen to satisfy condition (1.1) and possibly another condition to ensure that B (k+l) is positive definite. To determine the value of k (k), we seek a good estimate of the least value of the function of one variable ,~(x) = F{x(k) - X[B(~)]-lg(~)},

(2.5)

by calculating only a few actual values of ~b(h). A description of a suitable method for adjusting k is given by Gill et al. [51. Next the gradient of F ( x ) at the point (2.4) is estimated. Usually either forward or central differences are employed, with the ith component of g(k+l) (i = 1, 2 , . . . , n) being defined by the equation gi (k+l) = {F(x (k+l) + hlei) - F ( x (~+x)) }/hl

(2.6)

gl (k+l) = {F(x (k+l) + hiei) - F ( x (k+l) - hiel)}/2h~,

(2.7)

or by the equation

where e~is the ith coordinate vector. The choice of the step-lengths hi(i = 1, 2 , . . . , n) is discussed later. An important and valuable feature of the methods used to define B (k+~) is t h a t they require no more values of the objective function. A survey of these methods is given by Powell E14]. The various available and successful choices of B (k+1) all satisfy the equation B(k+i)Ex(k+l)

__ x ( k ) ] _-- [ g ( ~ + l ) __ g(k)~,

(2.8)

because, when F ( x ) is a quadratic function, this equation is also satisfied b y the true second derivative matrix. For example, one of the most useful choices of B (k+l) ACM

Transactions on

Mathematical Software, Vol. 1, No. 2, June 1975.

7

100

M.J.D. Powell

is given by the Broyden-Fletcher-Shanno formula ['3] B(k+1) = B(k)

_

B(k)~(k)~(k)TB(k)/~(k)TB(k)~(k) + y(k)y(k)T/y(k)T~(k),

(2.9)

where $(~) and y(k) are the differences ~(~) = x(k+l) _ x(k);

y(k) = g(~+~) _

g(~).

(2.10)

Thus the data that is needed to begin the next iteration is calculated. The extensive numerical results given by Gill et al. [-5J indicate that the class of quasi-Newton methods contains the best of the available algorithms for minimization without derivatives. However, each iteration of a quasi-Newton method uses at least n function values to estimate first derivatives, but it uses only about three or four function values in the line search that seeks the minimum of the function (2.5). Thus in large problems only a small proportion of the function evaluations are applied directly to the main problem of reducing the objective function. This seems to be a poor strategy, unless F(x) is almost quadratic, so we can be hopeful that even better algorithms will be developed. Another deficiency of quasi-Newton methods is that usually the search direction -rB(~)~-~g (~) in expression (2.4) gives fast convergence only if the direction of g(k) is a good approximation to the direction of the true gradient of F(x) at x (k). However, the true gradient should tend to zero as k tends to infinity, so the difficulties of calculating a suitable vector g(~) become more and more severe. Therefore many algorithms switch from formula (2.6) to formula (2.7) when g(k) becomes small, in order to obtain higher accuracy at the cost of almost doubling the number of function values per iteration. Thus the precision of the calculated values of F(x) is very important. To avoid these extra function values, Cullum r2J suggests the formula g (~+1) = {F(x(k+l) + h,e,) -- E(x (k+l)) -- ½h,2B,,(k)}/h~

(2.11)

instead of eq. (2.7), where B~ (k) is the ith diagonal element of B (k). Another way of obtaining better accuracy in the differences (2.6) and (2.7) is to avoid the use of adaptive methods in the calculation of F(x). For example, if F(x) is a definite integral which is calculated by a numerical quadrature formula, and if the weights and abscissas of the quadrature formula are held constant, then the leading error term of the quadrature formula usually cancels out when the difference (2.6) or (2.7) is formed. The choice of the step-length h~ in eq. (2.6) and (2.7) also causes problems. The earliest quasi-Newton method (see Stewart [-16J) includes a technique that chooses h~ automatically, and numerical results show that it works quite well. However, Gill and ~ u r r a y [-4J suggest that it is better to keep h, (i = 1, 2 , . . . , n) constant throughout the calculation in order that the leading error terms in g(k) and g(k+l) cancel when ~(k) is calculated from expression (2.10). In my opinion it is preferable that the step-lengths be adjusted automatically so that people who do not understand the difficulties of numerical differentiation can apply the minimization subroutines successfully, without expert advice on the choice of h,. 3.

CONJUGATE DIRECTION METHODS

In order to define conjugate directions clearly, we begin by supposing that F(x) is a positive definite quadratic function, whose second derivative matrix is G. Then the A C M T r a n s a c h o n s on M a t h e m a h c a l Software, Vol. 1, No. 2, J u n e 1975.

A View of UnconstrainedMinimization Algorithms

101

n nonzero directions d~ (i = 1, 2 , . . . , n) are mutually conjugate if and only if the equations d~TGd~ = 0, i ~ j, (3.1) hold. Conjugate directions are important to minimization algorithms, because in the quadratic case, the following construction calculates the vector of variables that minimizes F ( x ) . Let x0 be any starting vector. For i = 1, 2 , . . . , n we let x, be the vector x, = X~_l + M d , , (3.2) where M is the value of h that minimizes the function of one variable ¢,(~,) = F(x,_l -}- kd,).

(3.3)

Then xn is the point at which F ( x ) is least. M a n y proofs of this statement have been published; for instance, see Kowalik and Osborne [-10~. In a conjugate direction method for minimizing a general function without calculating derivatives, we begin the kth iteration at the point x (k) with search directions d~ (k) (i = 1, 2 , . . . , n). Initially these directions are the coordinate directions, but they are modified on each iteration by some method that should tend to make them mutually conjugate with respect to the second derivative matrix at the solution, say x*. Therefore, following the construction of the preceding paragraph, the main operation of the kth iteration is to let x0(k) -- x (k), and for i = 1, 2 , . . . , n to define x, (~) to be the point x~(k) -- X~_l(k) ~- M(k)d, (k),

(3.4)

where again hl (k) is determined by a line search to minimize F(x,(k)). Thus if x (k) is close to x*, and if the search directions are almost mutually conjugate, we expect x (k+l) to be much better than x (k) as an estimate of x*. Next the kth iteration obtains the directions d~ (k+l) (i = 1, 2 , . . . , n), which may involve some more values of the objective function. Then a few extra function values m a y be needed to fix x (k+l). Usually the value of F ( x (k+l)) is the least calculated value of the objective function, and always the inequality F ( x (~+1)) _< F ( x , (k))

(3.5)

is satisfied. For example, most iterations of Powell's algorithm [-13] use the formulas d~(k+l) = di+l (~), i = 1, 2 , . . . ,

n -- 1;

dn (k+l) = ~ M(k)d, (k),

(3.6)

and x (~+~) is obtained by a line search from x~(~) in the direction d~ (k+~). Provided that h~(k) is nonzero for all values of k, it may be proved that this method obtains the least value of a quadratic function in at most n iterations. The conjugate direction methods avoid the two main drawbacks of the quasiNewton methods, because they do not require estimates of g(k) (k = 1, 2, 3 , . . . ) , and because mest function values are applied to reduce the objective function. However, we note that they too have some disadvantages. One disadvantage is that it is sometimes awkward to ensure that for all ]c the directions d,(k) (i = 1, 2 , . . . , n) have good linear independence properties. For ACM Transactmns o n M a t h e m a t i c a l Software, Vol. 1, No. 2, June 1975.

102

M.J.D. Poweil

example, if hi (k) is small in comparison with ~(~) (i = 2, 3 , . . . , n), then eq. (3.6) requires modification. In this case Powell's algorithm [-13-] makes the search direction dl (k+l) equal to dl (k), although this may postpone quadratic termination by one iteration. These modifications are certainly needed in general, but they can slow the method down, particularly when n is greater than about 10. To avoid this difficulty, Brent ['lJ suggests a different modification to Powell's algorithm, which requires the eigenvalues and vectors of an n X n symmetric matrix to be calculated after every n iterations. The extra work of the eigenproblems can cause the total computing time to be greater than before if each evaluation of F(x) requires comparatively little time. However, it usually gives a worthwhile reduction in the number of function values needed for the whole minimization calculation, so Brent's method is recommended for serious problems where the calculation of F(x) is quite long. Another disadvantage is that conjugate directions may not be well determined when F(x) is not quadratic. We explain this remark by the example of minimizing Rosenbrock's function [-15]

F(x~, x2) = 100(x~~ -- x2) ~ + (1 -- x~)~.

(3.7)

If the initial value of x~is negative, then the sequence of points x (k) (k = 1, 2, 3 , . . . ) usually passes close to (0, 0), and then the path to the solution at (1, 1) stays near the parabola x2 = x~~. Thus it passes close to (1, ~), and we consider the problem of estimating conjugate directions near this point. The direction of the parabola at (½, ¼) is (1, 1), so let us seek the direction conjugate to (1, 1) with respect to the symmetric matrix G(x), where G(x) is the true second derivative matrix of F(x) for some value of x. If x = (0.499, 0.25), then the required direction is (-0.3330, 1); if x = (0.5, 0.25), then the required direction is (0, 1) ; and if x = (0.501, 0.25), then the required direction is (0.1428, 1). Note that the direction is very sensitive to the value of x. The difficulty occurs because the second derivative matrix of F(x) at (½, ~) is almost singular. It is usual to have a line of singularities near the bottom of a curved valley; for example, the second derivative matrix of the function (3.7) is singular on the parabola x2 = x~2 -b 0.005. Therefore it is calamitous that minimization algorithms often generate sequences of points x (k) (k = 1, 2, 3 , . . . ) that follow curved valleys. Thus the aim of trying to obtain linearly independent conjugate directions, including search directions which allow moves along the floors of any valleys, is frequently ambiguous. These remarks make the justification for conjugate direction methods rather uncertain, except in regions of x-space where F(x) satisfies a strict convexity condition. However, no difficulties of this nature occur if F(x) has continuous second derivatives and if, instead of seeking conjugate directions, we seek a local approximation to the second derivative matrix. This brings us to the subject of B-conjugate methods. 4.

B-CONJUGATEMETHODS

For the reasons given at the end of Section 3, we prefer to approximate second derivatives of F(x) instead of revising estimates of conjugate directions. As in ACM Transactions on Mathematical Software, Vol. 1, No. 2, June 1975.

A View of UnconstrainedMinimization Algorithms

103

Section 2, we let B (~) be the approximate second derivative matrix at x (k), but now we relax the condition that B (k) is to be positive definite. We wish to avoid the strong dependence on first derivative estimates that is present in quasi-Newton methods. Therefore the main data that is required for the kth iteration of a B-conjugate method is only x (k) and B (~). We must use this data in a way that gives fast convergence when x (k) is close to the solution x*, and when B (k) is close to the true second derivative matrix G(x*). Therefore the first operation of a B-conjugate algorithm is to choose linearly independent search directions d, (*) (i = 1, 2 , . . . , n) that satisfy the conjugacy conditions d,(k)TB(k)d3(k) = 0,

i ~ j.

(4.1)

There is much freedom in this choice, and different ways of fixing the freedom give different algorithms in the class of B-conjugate methods. Then we follow the main operation of a conjugate direction method, that is, we let x0(k) = x (k), and for ~ = 1, 2 , . . . , n we define x~(k) by eq. (3.4), where ),(k) is determined as before. Again we m a y calculate some extra function values to define x (k+l), subject to condition (3.5), and to determine the data for the (k A- 1)th iteration. Thus so far a B-conjugate method is the same as a conjugate direction method, the conjugate directions being related to B (k) b y the matrix equation B(k) =

z,d,(k)d

,

for certain values of the multipliers z~ (i = 1, 2 , . . . , n). However, the problem of calculating the new second derivative approximation B (k+l) is better defined than seeking new conjugate directions. The remainder of this section discusses a method for obtaining B (k+l) from B (k) using only calculated values of F ( x ) . We recall that in quasi-Newton methods the matrix B (k+l) satisfies eq. (2.8), because this equation is obtained b y the true second derivative matrix when F ( x ) is a quadratic function. W e follow a similar approach now, because we seek conditions on B (k+l) t h a t would be satisfied by the true second derivative matrix in the quadratic case. For example, when the value of h(k) is determined for eq. (3.4), the function F(x,-1 (k) -4- )~d~(k)) is usually calculated for three different values of )t. We fit a quadratic in ), to these function values and require the term d~(k)TB(k+l)d~ (~) to equal the second derivative of this quadratic, because this condition on B (k+~) is correct when F ( x ) is a quadratic function. Another useful pattern of function values occurs when F ( x ) is known at the four corners of a rhombus, say when the function values F ( a ) , F ( a A- p), F ( a A- q), and F ( a A- p A- q) are calculated. In this case we force the condition pTB(k+~)q = F ( a ) -- F(a A- p) -- F(a + q) A- F(a A- p A- q)

(4.3)

to be satisfied, because it is obtained by the true second derivative matrix in the quadratic case. Schemes of this kind provide suitable conditions on B (k+~) in terms of calculated values of F ( x ) . A C M Transactmns on Mathematma! Software, Vol. 1, No 2, June 1975.

104

M.J.D. Powell

These conditions should make full use of the function values t h a t are needed to obtain x, (k) from x0Ck). Also, function values from previous iterations m a y be used, and it may be convenient to calculate F ( x ) at some extra points. This freedom allows m a n y different algorithms in the class of B-conjugate methods. I t is important to note that all these conditions on B ~k+l) can be expressed in the linear form C,~BI~ (~+1) = ~t, t = 1, 2 , . . . , m, (4.4) i,3

where the numbers C~t, # t , and m are known. Also we require the matrix B (k+~) to be symmetric. Further we would like B (k+l) to inherit any good information in B (k), obtained b y the (k - 1)th and earlier iterations. All these requirements are rather daunting when F ( x ) is a general function. Therefore we look for a method of calculating B (k+l) t h a t can be applied when F ( x ) is a general function and that meets the requirements when F ( x ) is a quadratic function. Then we can be hopeful of obtaining good convergence properties when x (k) is close to the vector of variables that minimizes F ( x ) . One method of this type can be obtained b y orthogonal projections. To describe it we employ the Euclidean matrix norm

II A li = E ~ A~?] "2

(4.5)

¢,i

in the linear space of n X n real matrices. With respect to this norm we let B (k+l) be the matrix that is closest to B (k~, subject to eqs. (4.4) and to symmetry. In other words, B (k+~) is the orthogonal projection of B (k) onto the linear manifold of symmetric matrices that satisfy the conditions (4.4). This method works quite well when F ( x ) is a quadratic function, because the second derivative matrix of F ( x ) , G say, is in this linear manifold. Therefore the fact that the projection is orthogonal implies the equation II B(k+') -- G II2 = II U(~) - G 1[2 - II B(k+~) -- U(k)II 2,

(4.6)

showing that the new second derivative approximation is at least as good as the old one. Further, the new approximation is better unless B (k) already satisfies the conditions (4.4) on B (k+l). Although eq. (4.6) is promising, it does not guarantee t h a t the second derivative approximations become good. I t is necessary to choose the conditions (4.4) carefully. This point is considered further in Section 5. 5.

DISCUSSION

Although the most successful algorithms now for minimization without derivatives are quasi-Newton and conjugate direction methods, we have noted major disadvantages in both these classes of methods. Difficulties occur in quasi-Newton methods because of the strong dependence on accurate first derivatives, and in conjugate direction methods the revision of the conjugate directions can be a very poorly defined problem. However, the estimation of second derivatives in quasiNewton methods seldom impairs efficiency, and the fact that conjugate direction methods usually search along n independent directions on every iteration helps to A C M Transactions on Matheraatical Software, Vol. 1, No. 2, June 1975.

A View of Unconstrained Minimization Algorithms

105

avoid jamming away from the solution. Therefore the prospect that some B-conjugate methods may retain the advantages and lose the disadvantages of current algorithms is quite exciting. There is so much choice within the class of B-conjugate methods that finding a good algorithm may take a long time, particularly because comparisons should be made with current methods that have been designed and programmed carefully. The main purpose of this section is to make some remarks intended to help the development of B-conjugate algorithms. For the reasons given in Section 1, it seems important to require the algorithms to work well when F(x) is a quadratic function. There are two main approaches to this problem. One is to arrange that when F (x) is quadratic the algorithm terminates after a finite number of steps in exact arithmetic. The other is to design the method so that it has a superlinear rate of convergence. Quadratic termination can be achieved quite directly by using at least ½(n -{- 1) (n + 2) function values on every iteration. Algorithms of this type are proposed by Winfield [18J and by Mifflin [-11J. Although these methods compare favorably with other techniques when n is smsll, they tend to require more work than the algorithms described in Section 2 when n is large and x (1) is a poor starting approximation. Therefore I think it is best to study B-conjugate methods that restrict the number of new function values on each iteration to a small multiple of n. In this case our knowledge of quasi-Newton methods (see Powell E14~, for instance) suggests that it is possible to achieve quadratic termination after at most n iterations. However, I do not know of a technique that gives quadratic termination which is well suited to B-conjugate methods, and it would be good to have one. Instead of quadratic termination, it is feasible to obtain superlinear convergence in the quadratic case from eq. (4.6). A suitable method is to ensure that the conditions (4.4) cause [] B (k+l) - B (k) ]1 to be bounded away from zero, if the inequality II

-

x* l i / l l

-

x* li >- c

(5.1)

is satisfied, where x* is the vector of variables that minimizes F(x), and c is any positive constant. The following algorithm has this property. On each iteration the search directions d~(k) (i = 1, 2 , . . . , n) satisfying condition (4.1) must also meet a strict linear independence condition, for example, the inequality I d e t ( d l ok), d~(k), . . . , d- (~)) I / / 1 ~

[[ d, (k) II ~ e,

(5.2)

where e is a positive constant. Then, after applying eq. (3.4) for i = 1, 2 , . . . , n, we continue the sequence of points x~(k) (i = 1, 2, 3 , . . . ) by defining xn+, (k) (i= 1,2,...,n) to be the vector

xn+,(k) = xn+,_1(k)+ )~+,(k)d~(k),

(5.3)

where k~+~(k) is calculated by a line search to minimize F(x~+~(k)). We let x (k+l) be the point x2,~(k). The advantage of using each search direction twice is that the calculated function values provide not only the quantities d~(k)TB(k+i)d,(~) (i = 1, 2 , . . . , n), but also the cross-terms d,(k)TB(k+l)h,(k) (i = 1, 2, . .. , n), A C M Transactions on Mathematical Software, Vol. 1, N o 2, June 1975.

106

M.J.D. Powell

where h~ (k) is the vector n-]-*--I

hl (x) =

~

X3(k)dt~l(k),

(5.4)

3~$',~I

I-j] being the integer EJl =

j,

jn.

I

(5.5)

Thus m = 2n in eq. (4.4). B (k+l) is calculated by the orthogonal projection technique described at the end of Section 4. To prove that this algorithm has superlinear convergence in the quadratic case, it m a y be shown that inequalities (5.1) and (5.2) imply the existence of a positive constant, 5 say, such that the condition max I d~(k)TB(k+l)h~(k) I/[[ d, (k) [[ [I h~ (k) [I >- e

(5.6)

is satisfied. However, the terms d~(k)TB(k)h~(k) (i = 1, 2 , . . . , n) are all zero. Thus II B(k+~) -- B(k)[I is bounded away from zero, so eq. (4.6) implies t h a t inequality (5.1) is satisfied for only a finite number of values of k. Since the constant c m a y be chosen to be arbitrarily small, superlinear convergence is obtained. I programmed an elementary version of this algorithm and applied it to some numerical examples. It performed about as well as the Powell method [-131. However, because it occasionally causes eqs. (4.4) to be inconsistent, I switched to a different method, which unfortunately worked more slowly in practice, despite the fact t h a t it also gives superlinear convergence. Therefore I am not able to recommend a particular B-conjugate algorithm at present. The following open questions m a y help the choice of a good algorithm. Can a quadratic termination property be obtained conveniently? Should the freedom that is present in the choice of the directions d~(k) (~ = 1, 2 , . . . , n), after satisfying condition (4.1), be used to provide an algorithm which is similar to the type of method described in Section 3 ? Can this freedom be used to reduce the routine work of each iteration to of order n~ operations? Is it worthwhile to use function values calculated before the kth iteration to obtain conditions on B(k+i)? How should the points at which F(x) is calculated on the kth iteration be distributed? Are other methods for calculating B (k+~) better than the orthogonal projection technique described at the end of Section 4? Goldfarb [-6J and Greenstadt [-71 are relevant to this last question, for they consider projection methods for revising second derivative approximations with respect to a norm that is a generalization of eq. (4.5). They use norms of the form [[ A IIw = [-:~'~ (W-~/2AW-1/2)2,JJ ~/2,

(5.7)

¢3

where W is a positive definite matrix. Thus they obtain equations which express B (k+l) in terms of B (k) and W, and different choices of W give different useful formulas for B (k+~). Especially interesting expressions occur when W = B (k) and when W = B (k+~), ignoring the fact t h a t these matrices are not constant positive definite matrices. In these cases there is a correspondence with the usual quasiA C M Transactzons on Mathematzcal Software, Vol 1, N o 2, J u n e 1975

A View of Unconstrained Minimization Algorithms

107

N e w t o n formulas, i n c l u d i n g eq. (2.9). T h u s B - c o n j u g a t e m e t h o d s can i n c l u d e f o r m u l a s for B (k+l), which are s i m i l a r to t h e expressions t h a t are so successful in quasi-Newton methods. G r e e n s t a d t [-91 describes a B - c o n j u g a t e a l g o r i t h m t h a t calculates B (k+~) b y a p r o j e c t i o n m e t h o d b a s e d on t h e n o r m (5.7), where W is t h e m a t r i x B (~). I n t h i s m e t h o d t h e choice of t h e directions d J k) (i = 1, 2 , . . . , n) d e p e n d s on a n e s t i m a t e of t h e first d e r i v a t i v e v e c t o r of F ( x ) a t x (k).

ACKNOWLEDGMENT I wish to a c k n o w l e d g e t h a t m y w o r k on B - c o n j u g a t e m e t h o d s b e g a n b y s t u d y i n g a n d t r y i n g to b r o a d e n t h e t e c h n i q u e s of G r e e n s t a d t ' s t w o a l g o r i t h m s for m i n i m i z a t i o n w i t h o u t d e r i v a t i v e s [-8, 91.

REFERENCES 1. BRENT, R.P. Algorithms for Mznimizatwn Without Derivatives. Prentice-Hall, Englewood Cliffs, N.J., 1973. 2. CULLUM, J. Modffied rank one---no derivative unconstrained optimization method (MRIND). I B M Tech. Disclosure Bull. 14 (1972), 3732-3733. 3. FLETCHER, R. A new approach to variable metric algorithms. Computer J. 13 (1970), 317-322. 4. GILL, P.E., AND MURRAY, W. Quasi-Newton methods for unconstrained optimization. J. Inst. Math. Appl. 9 (1972), 91-108. 5. GILL, P.E, MURRAr, W., AND PITFIELD, R. The implementation of two revised quasiNewton algorithms for unconstrained optimization. Rep. NAC 11, Nat. Physical Lab., Teddington, England, 1972. 6. GOLDFARB,D. A family of variable metric methods derived by variational means. Math. Computation 24 (1970), 23-26. 7. GREENSTADT,J. Variations on variable-metric methods. Math Computation 24 (1970), 1-22. 8. GREENSTADT,J. A quasi-Newton method with no derivatives. Math. Computation 26 (1972), 145-166. 9. GREENSTADT,J. Improvements in a QNWD method. Rep. 320-3306, IBM Corp., Palo Alto, Calif., 1972. 10. KOWALIK, J., AND OSBORNE, M R. Methods for Unconstrained Optimizatwn Problems. Elsevier, New York, 1968. 11. MIFFLIN, P~. A superlinearly convergent algorithm for minimization without evaluating derivatives. Rep. 65, Administrative Sci, Yale U., New Haven, Conn., 1974. 12. PARKINSON,J.M., AND HUTCHINSON,D. An investigation into the efficiency of variants on the simplex method. In Numemcal Methods for Nonlinear Opt~m~zatwn, F.A. Lootsma (Ed.), Academic Press, London, 1972, ch. 8 13. POWELL,M.J.D. An efficient method for finding the minimum of a function of several variables without calculating derivatives. Computer J. 7 (1964), 155-162. 14. POWELL,M.J.D. Recent advances in unconstrained optimization. Math. Progr. 1 (1971), 26-57. 15. ROSENBROCK, H.H. An automatic method for finding the greatest or the least value of a function. Computer J. 3 (1960), 175-184. 16. STEWART, G.W. A modification of Davidon's minimization method to accept difference approximations of derivatives. J. ACM 14, 1 (Jan. 1967), 72-83. 17. WILDE, D.J. Optimum Seeking Methods. Prentice-Hall, Englewood Cliffs, N.J., 1964. 18. WINFIELD,D. Function minimization by interpolation in a data table. J. Inst. Math. Appl. 12 (1973), 339-347. Received June 1974; revised October 1974 ACM Transactionson MathematicalSoftware.Vol. 1. No 2. June 1975