On the Global Convergence of BFGS Method for ... - CiteSeerX

Report 2 Downloads 102 Views
On the Global Convergence of BFGS Method for Nonconvex Unconstrained Optimization Problems Dong-Hui Li 1 Department of Applied Mathematics Hunan University Changsha, China 410082 e-mail: [email protected] Masao Fukushima Department of Applied Mathematics and Physics Graduate School of Informatics Kyoto University Kyoto 606-8501, Japan e-mail: [email protected] April 7, 1999 Abstract This paper is concerned with the open problem whether BFGS method with inexact line search converges globally when applied to nonconvex unconstrained optimization problems. We propose a cautious BFGS update and prove that the method with either Wolfe-type or Armijo-type line search converges globally if the function to be minimized has Lipschitz continuous gradients.

Key words:

1

unconstrained optimization, BFGS method, global convergence

Present address (available until October, 1999): Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan, e-mail: [email protected]

1

Introduction

BFGS method is a well-known quasi-Newton method for solving unconstrained optimization problems. Because of favorable numerical experience and fast theoretical convergence, it has become a method of choice for engineers and mathematicians who are interested in solving optimization problems. Local convergence theory of BFGS method has been well established [3, 4]. The study on global convergence of BFGS method has also made good progress. In particular, for convex minimization problems, it has been shown that the iterates generated by BFGS are globally convergent, if the exact line search or some special inexact line search is used [1, 2, 5, 8, 13, 14, 15]. On the contrary, however, little is known concerning global convergence of BFGS method for nonconvex minimization problems. Indeed, so far, no one has proved global convergence of BFGS method for nonconvex minimization problems, or has given a counter example that shows nonconvergence of BFGS method. Whether BFGS method converges globally for a nonconvex function remains unanswered. This open problem has been mentioned many times and is currently regarded as one of the most fundamental open problems in the theory of quasi-Newton methods [7, 12]. Recently, the authors [10] proposed a modi ed BFGS method and established its global convergence for nonconvex unconstrained optimization problems. The authors [9] also proposed a globally convergent Gauss-Newton based BFGS method for symmetric nonlinear equations that particularly contain unconstrained optimization problems as a special case. The results obtained in [9] and [10] positively support the open problem. However, the original question still remains unanswered. The purpose of this paper is to study this problem further. We introduce a cautious update in BFGS method and prove that the method with Wolfe-type or Armijo-type line search converges globally if the function to be minimized has Lipschitz continuous gradients. Moreover, under appropriate conditions, we show that the cautious update eventually reduces to the ordinary update. In the next section, we present BFGS method with cautious update. In Section 3, we prove global convergence and, under additional assumptions, superlinear 1

convergence of the algorithm. In Section 4, we report some numerical results with the algorithm. Some notations: For a real-valued function f : Rn ! R, g(x) and G(x) denote the gradient and Hessian matrix of f at x, respectively. For simplicity, g(xk ) and G(xk ) are often denoted by gk and Gk , respectively. For a vector x 2 Rn , kxk denotes its Euclidean norm.

2

Algorithm

Let f : Rn ! R be continuously di erentiable. Consider the following unconstrained optimization problem: min f (x); x 2 Rn :

(2.1)

The ordinary BFGS method for (2.1) generates a sequence fxk g by the iterative scheme: xk+1 = xk + k pk ; k = 0; 1; 2; : : : ; where pk is the BFGS direction obtained by solving the linear equation Bk p + gk = 0:

(2.2)

The matrix Bk is updated by BFGS formula Bk+1 = Bk 0

Bk sk sTk Bk yk ykT + yT s ; sTk Bk sk k k

(2.3)

where sk = xk+1 0 xk and yk = gk+1 0 gk . A good property of BFGS formula (2.3) is that Bk+1 inherits the positive de niteness of Bk as long as ykT sk > 0. The condition ykT sk > 0 is guaranteed to hold if the stepsize k is determined by the exact line search f (xk + k pk ) = min f (xk + pk ) (2.4) >0 or Wolfe-type inexact line search 8
0 and hence Bk+1 is not necessarily positive de nite even if Bk is positive de nite. In order to ensure the positive de niteness of Bk+1 , the condition ykT sk > 0 is sometimes used to decide whether Bk is updated or not. More speci cally, Bk+1 is determined by 8 > >
0; sk Bk sk yk sk Bk+1 = > > : Bk otherwise:

(2.7)

Computationally, the condition ykT sk > 0 is often replaced by the condition ykT sk >  , where  > 0 is a small constant. In this paper, we propose a cautious update rule similar to the above and establish a global convergence theorem for nonconvex problems. For the sake of motivation, we state a lemma due to Powell [14]. (Powell [14]) If BFGS method with line search (2.5) is applied to a continuously di erentiable function f that is bounded below, and if there exists a constant M > 0 such that the inequality

Lemma 2.1

ky k k 2  M

(2.8)

lim inf kg(xk )k = 0: k!1

(2.9)

ykT sk

holds for all k , then

Notice that if f is twice continuously di erentiable and uniformly convex, then (2.8) always holds. Therefore, global convergence of BFGS method follows from 3

Lemma 2.1 immediately. However, in the case where f is nonconvex, it seems dicult to guarantee (2.8). Maybe this is a reason why global convergence of BFGS method has not been proved. In [10], the authors proposed a modi ed BFGS method by using y~k = C kgk ksk + (gk+1 0 gk ) with a constant C > 0 instead of yk in the update formula (2.3). Global convergence of the modi ed BFGS method in [10] is proved without convexity assumption on f by means of Lemma 2.1 with a contradictory assumption that fkgk kg are bounded away from zero. However, this method lacks the scale-invariance property the original BFGS method enjoys. We now further study global convergence of BFGS method for (2.1). Instead of modifying the method, we introduce a cautious update rule in the ordinary BFGS method. To be precise, we determine Bk+1 by Bk+1

8 > > < => > :

T T T Bk 0 BkTsk sk Bk + ykTyk ; if yk sk2  kgk k ; ks k k sk Bk sk yk sk Bk ; otherwise;

(2.10)

where  and are positive constants. Now, we state the BFGS method with cautious update. Algorithm 1

Choose an initial point x0 2 Rn and an initial symmetric and positive de nite matrix B0 2 Rn2n. Choose constants 0 < 1 < 2 < 1, > 0 and  > 0. Let k := 0. Step 1 Solve the linear equation (2.2) to get pk . Step 2 Determine a stepsize k > 0 by (2.5) or (2.6). Step 3 Let the next iterate be xk +1 := xk + k pk . Step 4 Determine Bk +1 by (2.10). Step 5 Let k := k + 1 and go to Step 1. Remark. It is not dicult to see from (2.10) that the matrix Bk generated by Algorithm 1 is symmetric and positive de nite for all k, which in turn implies that ff (xk )g is a decreasing sequence whichever line search (2.5) or (2.6) is used. Moreover, we have from (2.5) or (2.6) Step 0

0

1 X gT s

k=0

k k

4

< 1;

(2.11)

if f is bounded below. In particular, we have

0 klim  gT p = 0 klim g T s = 0: !1 k k k !1 k k

3

(2.12)

Global Convergence

In this section, we prove global convergence of Algorithm 1 under the following assumption, which we assume throughout this section. Assumption A: The level set

= fx 2 Rn j f (x)  f (x0 )g is contained in a bounded convex set D. The function f is continuously di erentiable on D and there exists a constant L > 0 such that

kg(x) 0 g(y)k  Lkx 0 yk;

8x; y 2 D:

(3.1)

Since ff (xk )g is a decreasing sequence, it is clear that the sequence fxk g generated by Algorithm 1 is contained in . For the sake of convenience, we de ne the index sets K = fi j

yiT si ks i k 2

 kgik g and K k = fi 2 K j i  kg:

(3.2)

Let ik be the number of indices i 2 K k . By means of K , we may rewrite (2.10) as 8 > >
k k k k k > : Bk ; otherwise:

(3.3)

Taking trace operation on the both sides of (3.3), we get for any k tr (Bk+1) = tr (B0 ) 0

X i2K k

kBisik2 + X kyi k2 : sTi Bi si

i2K k

yiT si

(3.4)

Now we establish global convergence of Algorithm 1. We rst show that Algorithm 1 converges globally if K is nite. 5

Let Assumption A hold and fxk g be generated by Algorithm 1. If K is nite, then we have lim kgk k = 0: (3.5)

Theorem 3.1

k!1

The assumption that K is nite implies that there is an index k0 such that 4 Bk = Bk0 = B holds for all k  k0 . By the positive de niteness of B , there are positive constants m1  M1 such that Proof

m1 kpk2  pT Bp  M1 kpk2 ; m1 kpk2  pT B 01 p  M1 kpk2 ;

8p 2 Rn: (3.6)

If Wolfe-type line search (2.5) is used, then we get from (3.1) and the second inequality of (2.5) Lksk k2

 ykT sk  0(1 0 2 )gkT sk = (1 0 2 )0k 1sTk Bsk  (1 0 2 )0k 1m1 ksk k2; 8k  k0;

where the last inequality follows from (3.6). This implies k  (1 0 2)m1 L01 ;

8k  k0 :

Therefore, we get from (2.12) gk B 01 gk = gkT pk ! 0:

This together with (3.6) implies (3.5). Next, we consider the case where k is determined by Armijo-type line search (2.6). Let  = lim supk!1 k . If  > 0, then using an argument similar to the above, we get (3.5). Suppose that  = 0. This means limk!1 k = 0. Let x be an arbitrary accumulation point of fxk g and fxk gk2K be a subsequence converging to 4 x. Since pk = 0B 01 gk for k  k0, it follows that fpk gk2K ! p = 0B01g(x). By the line search rule, when k is suciently large, 0k =4 k = does not satisfy (2.6). So, we have f (xk + 0k pk ) 0 f (xk ) 0 0k g (xk )T pk  0: Dividing the both sides by 0k and then taking the limit yield (1 0 )g(x)T p  0; 6

which implies

0g(x)T B01g(x)  0:

By the positive de niteness of B, we get g(x) = 0. The proof is then complete. 2 We proceed to showing global convergence of Algorithm 1 in the case where K is in nite. We will deduce a contradiction with the assumption that there is a constant  > 0 such that kgk k   (3.7) holds for all k. Before establishing a global convergence theorem for Algorithm 1, we show some useful lemmas. Let Assumption A hold and fxk g be generated by Algorithm 1. If (3.7) holds for all k , then there exists a constant M2 > 0 such that the inequalities

Lemma 3.1

tr (Bk+1)  M2 ik and

X i2K k

(3.8)

kBisik2  M i sTi Bi si

2

k

(3.9)

hold for all k suciently large. Proof

It follows from (3.2) and (3.7) that yiT si   ksi k2

(3.10)

holds for all i 2 K . This together with (3.1) implies that for any i 2 K

kyik2  L2 =4 M 0 : 2 yT s  i i

(3.11)

This together with (3.4) yields inequality (3.8) with a suitable constant M2. Moreover, since tr (Bk+1 ) > 0 holds for any k, we get from (3.4) and (3.11) X i2K k

kBisi k2  tr (B ) + M 0 i : 0 2 k sT B s i

i i

This yields inequality (3.9) with a suitable constant M2 . 7

2

Let Assumption A hold. If (3.7) holds for all k , then there exist positive constants 1 ; 2 and 3 such that for any k > 1 there are at least dik =2e indices i 2 K k such that Lemma 3.2

kBi sik  1 ksik; 2 ksik2  sTi Bisi  3ksik2 :

(3.12)

Notice that (3.7) implies that (3.10) and (3.11) hold for any i 2 K . Therefore, in a way similar to the proof of Theorem 2.1 in [1], we can show the conclusion. 2 Lemma 3.2 shows that when K is in nite, if (3.7) holds for all k, then there exists an in nite index set K~  K such that

Proof

and

kgik = kBipik  1kpik; 8i 2 K~

(3.13)

~ kpik2  201pTi Bipi = 0 201giT pi  201kgik kpik; 8i 2 K;

(3.14)

and hence

~ 8i 2 K:

kpik  201kgik;

(3.15)

Moreover, (3.13) and (3.14) imply

kgik2  12 kpi k2  0 12 201giT pi;

~ 8i 2 K:

(3.16)

We prove global convergence of Algorithm 1 with Armijo-type line search. Let Assumption A hold and fxk g be generated by Algorithm 1 with k being determined by Armijo-type line search (2.6). Then

Theorem 3.2

lim inf kgk k = 0: k!1

(3.17)

By Theorem 3.1, it suces to verify (3.17) when K is in nite. Suppose that (3.17) does not hold. Then there is a constant  > 0 such that (3.7) holds for all k. Let the set K~  K be as speci ed in the paragraph preceding Theorem 3.2. Then K~ contains in nitely many indices. Denote  = lim supk2K;k ~ !1 k = 0 limk2K ;k!1 k , where K  K~ . Since fxk gk2K is bounded, it follows from (3.15) that fpk gk2K is also bounded. Without loss of generality, we assume that the Proof

0

0

0

8

sequences fxk gk2K and fpk gk2K converge to some vectors x and p, respectively. It then follows from (2.12) that g(x)T p = 0. By (3.16), it suces to show that  > 0. We assume the contrary  = 0. Then by the line search rule, for all k 2 K 0 suciently large, 0k =4 k = does not satisfy (2.6). This means 0

0

f (xk + 0k pk ) 0 f (xk )  0k gkT pk :

(3.18)

By the mean-value theorem, there is a k 2 (0; 1) such that f (xk + 0k pk ) 0 f (xk ) = 0k g (xk + k 0k pk )T pk . Applying this to (3.18), we deduce L0k kpk k2

 (g(xk + k 0k pk ) 0 g(xk ))T pk  0(1 0 )gkT pk  (1 0 ) 2 kpk k2;

where the rst inequality follows from (3.1) and the last inequality follows from (3.14). The last inequality contradicts the assumption  = 0. The proof is then complete. 2 We turn to showing global convergence of Algorithm 1 with Wolfe-type line search. To this end, we show a useful lemma similar to Lemma 3.2 in [2]. Let Assumption A hold and fxk g be generated by Algorithm 1 with k being determined by Wolfe-type line search (2.5). If (3.7) holds for all k , then there exists a constant m3 > 0 such that for all k large enough Lemma 3.3

Y

i2K k

Proof

i  mi3k :

The formula (3.3) gives the recurrence relation (see e.g. (3.13) in [14]) T

and

(3.19)

det Bi+1 = sTyiBsis det Bi; i

i i

8i 2 K

(3.20)

 det Bi+1 = det Bi; 8i 62 K: (3.21) Let nk be the largest index in K k . Multiplying the inequalities (3.20) for i 2 K k and (3.21) for i 62 K k yields det Bn +1 = det B0 k

9

yiT si : T i2K k si Bi si Y

(3.22)

On the other hand, the second inequality of (2.5) implies that for each i yiT si  0(1 0 2 )giT si = (1 0 2 )0i 1 sTi Bi si : Then in a way similar to the proof of Lemma 3.2 in [2], we get (3.19) by applying the last inequality and (3.8) to (3.22). 2 Now we prove global convergence of Algorithm 1 with Wolfe-type line search.

Let Assumption A hold and fxk g be generated by Algorithm 1 with k being determined by Wolfe-type line search (2.5). Then (3.17) holds.

Theorem 3.3

By Theorem 3.1, it suces to verify (3.17) when K is in nite. Denote K = fk1 < k2 < : : :g. Notice that (2.11) particularly implies

Proof

0

1 X gT s

j =1

kj kj

< 1:

Since Bk sk = 0k gk , it follows that j

j

j

j

1 X sTkj Bkj skj (3.23) kgkj k kj kB s k2 = 0 gkTj skj < 1 kj kj j=1 j =1 If (3.17) does not hold, then there exists a constant  > 0 such that (3.7) holds for all k. So, (3.23) implies 1 X sTkj Bkj skj (3.24) kj kBkj skj k2 < 1: j =1 Therefore, for any  > 0, there exists an integer j0 > 0 such that for any positive integer q, 0 +q 0 +q  jY sT Bk sk sT Bk sk  1 1 jX  kj kj j j2  ; kj kj j j2 q  kBkj skj k q j=j0 +1 kBkj skj k q j =j0 +1 1 X

2

where the left-hand inequality follows from the geometric inequality. Thus 0 +q  jY

j =j0 +1

kj

1 q

  q

 q2  q2

jY 0 +q

kBk sk k2  j

j

T j=j0 +1 skj Bkj skj jX 0 +q Bkj skj 2 T j =j0 +1 skj Bkj skj jX 0 +q Bkj skj 2 T j =0 skj Bkj skj

k

k

k

k

  (j0 +q 2q + 1) M2 ; 10

1

q

where the last inequality follows from (3.9). Letting q ! 1 yields a contradiction, because Lemma 3.3 ensures that the left-hand side of the above inequality is greater than a positive constant. The proof is complete. 2 Theorems 3.1, 3.2 and 3.3 show that there exists a subsequence of fxk g converging to a stationary point of (2.1). The following theorem shows that if additional conditions are assumed, then the whole sequence converges to a local optimal solution of (2.1). Let f be twice continuously di erentiable. Suppose that sk ! 0. If there exists an accumulation point x3 of fxk g at which g (x3) = 0 and G(x3 ) is positive de nite, then the whole sequence fxk g converges to x3. If in addition, G is Holder continuous and the parameters in the line searches satisfy ; 2 2 (0; 1=2), then the convergence rate is superlinear. Theorem 3.4

The assumptions particularly imply that x3 is a strict local optimal solution of (2.1). Since ff (xk )g converges, it follows that x3 is an isolated accumulation point of fxk g. Then, by the assumption that fsk g converges to zero, the whole sequence fxk g converges to x3. Hence fgk g tends to zero and by the positive de niteness of G(x3), the matrices

Proof

4

Ak =

Z 0

1

G(xk + sk )d

are uniformly positive de nite for all k large enough. Moreover, by the meanvalue theorem, we have yk = Ak sk . Therefore, there is a constant m > 0 such that ykT sk  m ksk k2 , which implies that when k is suciently large, the condition ykT sk  kg k is always satis ed. This means that Algorithm 1 reduces to the k ks k k 2 ordinary BFGS method when k is suciently large. The superlinear convergence of Algorithm 1 then follows from the related theory in [1, 2, 14]. 2 Theorem 3.4 shows a strong convergence property of Algorithm 1. However, it seems dicult in practice to verify the condition sk ! 0 for either of the two line searches used in the algorithm. In the following, we propose an extension of Armijo-type line search to relax this condition. Let 3 2 (0; 1) and 4 > 0 be given constants. We determine a stepsize k 11

satisfying the inequality f (xk + k pk )  f (xk ) + 3k gkT pk 0 4 kk pk k2 : (3.25) The only di erence between (3.25) and (2.6) lies in the term 04 kk pk k2 . Since pk is a descent direction of f at xk and 04 kk pk k2 = o(k ) as k goes to zero, it is clear that (3.25) holds for all suciently small k > 0. Therefore, we can nd a k by a backtracking process similar to Armijo-type line search. In a way similar to Theorems 3.1 and 3.2, it is also not dicult to prove global convergence of Algorithm 1 with line search (3.25). Moreover, (3.25) particularly implies 4 ksk k2  f (xk ) 0 f (xk+1). It then follows from the descent property of ff (xk )g that sk ! 0. Therefore, we can establish a theorem similar to Theorems 3.1 and 3.4. We state the global convergence theorem without proof. Let Assumption A hold and fxk g be generated by Algorithm 1 with k satisfying (3.25). Then (3.17) holds. If we further suppose that f is twice continuously di erentiable and there exists an accumulation point x3 of fxk g at which g (x3 ) = 0 and G(x3) is positive de nite, then the whole sequence fxk g converges to x3. If in addition, G is Holder continuous at x3 and 3 2 (0; 1=2), then the convergence rate is superlinear. Theorem 3.5

4

Numerical Experiments

This section reposts some numerical experience with Algorithm 1. We tested the algorithm on the following three problems taken from [11]. Problem 1: Extended Powell singular function

f (x) =

n=4 n X i=1

(x4i03 + 10x4i02)2 + 5(x4i01 0 x4i)2 o

+ (x4i02 0 2x4i01 )4 + 10(x4i03 0 x4i)4 ; x3 = (0; : : : ; 0)T .

Problem 2: Extended Rosenbrock function

f (x) =

n=2 n X i=1

o

100(x2i 0 x22i01 )2 + (1 0 x2i01 )2 ; 12

x3 = (1; : : : ; 1)T . Problem 3: Extended Wood function

f (x) =

n=4 n X i=1

100(x4i02 0 x24i03 )2 + (1 0 x4i03 )2 + 90(x4i 0 x24i01 )2

o

+ (1 0 x4i01 )2 + 10(x4i02 + x4i 0 2)2 + 0:1(x4i02 0 x4i )2 ; x3 = (1; : : : ; 1)T .

We applied Algorithm 1, which will be called CBFGS method (C stands for cautious), with Wolfe-type or Armijo-type line search to these problems and compared it with the ordinary BFGS method. We used the condition maxfkg(xk )k; kxk 0 x3 kg  1005 as the stopping criterion. For each problem, we chose di erent initial starting points but the same initial matrix B0 = I , i.e., the unit matrix. For each problem, the parameters common to the two methods were set identically. Specifically, we chose the parameters as follows. We set the parameters as 1 = 0:1 and 2 = 0:49 in Wolfe-type line search (2.5) and  = 0:1 in Armijo-type line search (2.6). As to the parameters and  in the cautious update (2.10), we rst let 8 < =:

if kgk k  1; if kgk k < 1

0:01; 3;

and  = 0:1. This choice is intended to make the cautious update closer to the original BFGS method. It is not dicult to see that the convergence theorems in Section 3 remain true if we choose according to this rule. Indeed, more generally, even if varies in an interval [1; 2] with 1 > 0, all the theorems in Section 3 hold true. The results are shown in Tables 1-4, where P1, P2 and P3 stand for Problems 1,2, and 3, respectively, and k stands for the number of iterations. For CBFGS method, `o ' denotes the number of times the condition in the cautious update was not met, that is, the numbers of k's such that ykT sk =ksk k2 < kgk k . For BFGS method, `o ' denotes the number of k's such that ykT sk < 10017. Note that, for BFGS method with Wolfe-type line search, `o ' is normally zero since ykT sk is always positive because of the second inequality in (2.5). In the tables, 13

'Init' stands for the initial point. The tested initial points are x0 = (0; 0; : : : ; 0)T , x1 = (1; 1; : : : ; 1)T , x2 = (10; 10; : : : ; 10)T , x3 = (100; 100; : : : ; 100)T , x4 = 0x2 , x5 = 0x3, x6 = (0; 100; 0; 100; : : :)T and x7 = 0x6 . Next, to check the in uence of and  on CBFGS method, we solved Problem 1 with various values of and  starting from the same initial point x2 = (10; 10; : : : ; 10)T . The results are shown in Table 5, where 8