Metropolis-Type Annealing Algorithms for Global Optimization in Rd

Report 1 Downloads 103 Views
(C) 1993 Society for Industrial and Applied Mathematics

SIAM J. CONTROL AND OPTIMIZATION Vol. 31, No. 1, pp. 111-131, January 1993

009

METROPOLIS-TYPE ANNEALING ALGORITHMS FOR GLOBAL OPTIMIZATION IN R a* SAUL B. GELFAND

AND

SANJOY K.

MITTER

Abstract. The convergence of a class of Metropolis-type Markov-chain annealing algorithms for global optimization of a smooth function U(. on Ea is established. No prior information is assumed as to what bounded region contains a global minimum. The analysis contained herein is based on writing the Metropolistype algorithm in the form of a recursive stochastic algorithm Xk+l Xk-ak(V U(Xk)+ k)+ bkWk, where Wk} is a standard white Gaussian sequence, {sck} are random variables, and ak A/k, b v/-/x/k log log k for k large. Convergence results for {Xk} are then applied from our previous work SIAM Journal on Control and Optimization, 29 (1991), pp. 999-1018]. Since the analysis of {Xk} is based on the asymptotic behavior of the related Langevin-type Markov diffusion annealing algorithm dY(t) -V U(Y(t)) dt+ c(t) dW(t), where W(. is a standard Wiener process and c(t)= x/-/v/g for large, this work demonstrates and exploits the close relationship between the Markov chain and diffusion versions of simulated annealing.

Key words, global optimization, random optimization, simulated annealing, stochastic gradient algorithms, Markov chains AMS(MOS) subject classifications. 65K10, 90C30, 60J60

1. Introduction. Let U(.) be a real-valued function on some set E. The global optimization problem is to find an element of the set S*= {x: U(x) Co (a constant depending only on 3 in probability. U(. )), then Y(t) S* as A discrete-time annealing algorithm for continuous optimization was suggested in [8], [18] and is based on simulating a recursive stochastic algorithm

- -

(1.3)

Xk+

Xk--ak(V U(Xk)+ k)+ bkWk.

Here U(. is again a smooth function on d, {k} is a sequence of d-valued random variables, { Wk} is a sequence of independent standard d-dimensional Gaussian random variables, and {ak}, {bk} are sequences of positive numbers with ak, bk 0 as k-. Algorithm (1.3) could arise from a discretization or numerical integration of the diffusion (1.1) so as to be suitable for implementation on a digital computer; in this case, k is due to the discretization error. Alternatively, algorithm (1.3) could arise by artificially adding decreasing white Gaussian noise (i.e., the bk Wk terms) to a stochastic gradient algorithm

Zk+I Zk ak(V U(Zk) + ),

(1.4)

which arises in a variety of optimization problems including adaptive filtering, identification and control; in this case, SCk is due to noisy or imprecise measurements of V U(.) (cf. [19]). We again use (1.3) instead of (1.4) for minimizing U(.) to avoid getting trapped in strictly local minima. In the following, we refer to (1.4) and (1.3) as standard and modified stochastic gradient algorithms, respectively. The asymptotic behavior of Xk as k- has been studied in [8], 18]. In 18] convergence results were obtained for a version of (1.3), which was modified to constrain the trajectories to lie in a compact set (and hence is only applicable to global optimization over a compact subset of d); in [8] results were obtained for global optimization over all of d. Also, in [18] convergence is obtained essentially only for the case where SCk =0; in [8] convergence is obtained for {SCk} with unbounded variance. This latter fact has important implications when V U(.) is not measured exactly. Our main result from [8] can be roughly stated as follows: If U(.) and {:k} are suitably behaved, ak =A/k and b2k B/k log log k for k large with B/A> Co (the same Co as above), and {Xk} is in probability (conditions are also given in 8] for tightness tight, then Xk 5;* as k of {Xk}). Our analysis in I-8] of the asymptotic behavior of Xk as k- is based on the behavior of the associated stochastic differential equation (SDE) (1.1). This is analogous to the well-known method of analyzing the asymptotic behavior of Zk as k-o based on the behavior of the associated ordinary differential equation (ODE)

-

(1.2) 19], [20]. It has also been suggested that continuous optimization might be performed by simulating a continuous-state Metropolis-type Markov chain [10]. This method has been applied to the restoration of noise corrupted images [16], [23]. In these works, Gaussian random field models are used so that the state space is unbounded. Although some numerical work has been performed with continuous-state Metropolis-type annealing algorithms, there has been very little theoretical analysis, and, furthermore,

METROPOLIS-TYPE ANNEALING ALGORITHMS

113

the analysis of the continuous-state case does not follow from the finite-state case in a straightforward way (especially for an unbounded state space). The only analysis of which we know is in [16], where a certain asymptotic stability property is established for a related algorithm and a particular cost function that arises in a problem of image restoration. In this paper, we demonstrate the convergence of a class of continuous-state Metropolis-type Markov-chain annealing algorithms for general cost functions. Our approach is to write such an algorithm in the form of a modified stochastic gradient algorithm (1.3) for suitable choice of :k, and to apply results from [8]. A convergence result is obtained for global optimization over all of d. Some care is necessary to formulate a Metropolis-type Markov chain with appropriate scaling. It turns out that writing the Metropolis-type annealing algorithm in the form (1.3) is more complicated than writing standard variations of gradient algorithms, which use some type of finite-difference estimate of V U(. ), in the form (1.4) (cf. [19]). Indeed, to the extent that the Metropolis-type annealing algorithm uses an estimate of V U(. ), it does so in a much more subtle manner than a finite-difference approximation, as is seen in the analysis. Since our convergence results for the Metropolis-type Markov-chain annealing algorithm are ultimately based on the asymptotic behavior of the Langevin-type Markov diffusion annealing algorithm, this paper demonstrates and exploits the close relationship between the Markov chain and diffusion versions of simulated annealing, which is particularly interesting in view of the fact that the development and analysis of these methods has proceeded more or less independently. We note that similar convergence results for other annealing algorithms based on the continuous-state Markov-chain sampling method (such as the "heat bath" method) can be obtained by a procedure similar to that used in this paper. It is important to note that, although we establish the convergence of the Metropolis-type Markov-chain annealing algorithm by effectively comparing it with the Langevin-type Markov diffusion annealing algorithm, the finite-time behavior of the algorithms may be quite different. Some indication of this arises in the analysis; see Remarks 1 and 2 in 4. The paper is organized as follows. In 2 we discuss appropriately modified versions of tightness and convergence results for modified stochastic gradient algorithms, as given in [8]. In 3 we present a class of continuous-state Metropolis-type annealing algorithms and state some convergence theorems. In 4 we prove the convergence theorems of 3, using the results of 2.

2. Modified stochastic gradient algorithms. In this section, we give convergence and tightness results for modified stochastic gradient algorithms of the type described in 1. The algorithms and theorems discussed below are a slight variation on the results of [8] and are appropriate for proving convergence and tightness for a class of 3 and 4). continuous state Metropolis-type annealing algorithms (see We use the following notation throughout the paper. Let V U(.), A U(.), and HU(. denote the gradient, Laplacian, and Hessian matrix of U(. ), respectively. Let [, (’, "), and (R) denote Euclidean norm, inner product, and outer product, respectively. For real numbers a and b, let a v b maximum {a, b}, a ^ b minimum {a, b}, a ]+= a v 0, and [a]-= a O. For a process {Xk} and a function f(. ), let En,,{f(Xk)} denote ^ conditional expectation, given X, =x, and let E,.,.,2,,,2{f(X)} denote conditional expectation, given Xl and X,2 x_ (more precisely, these are suitably fixed versions of the conditional expectation). Also, for a measure /x(.) and a function f(.), let

I"

X

114

S. B. GELFAND AND S. K. MITTER

I(f)=fdl. Finally, let N(m,R)(.) denote normal measure with mean m and covariance matrix R, and let I denote the identity matrix. 2.1. Convergence. In this section, we consider the convergence of the discrete-time

algorithm

(2.1)

Xk+l Xk-- ak(V U(Xk)+ k)+ bkWk.

Here U(.) is a smooth real-valued function on R d, {k} .is a sequence of Rd-valued random variables, { Wk} is a sequence of independent standard d-dimensional Gaussian random variables, and

A

ak

=--,

k large,

bk x/k log log k’

where A, B are positive constants. For k=0, 1,..., let k tr(Xo, Wo,’’’, Wk-1, o,’’’, :k-1). In the following, we consider the following conditions (a,/3 are constants whose values are specified

later). Condition 1. U(.) is a C 2 function from

lim

-

Ixl-

to [0, ) such that

(VU(x) IVU(x)[’ ]_])

1,

inf (IV U(x)[ 2- A U(x)) > -oo.

Condition 2. For e > 0, let

d’(x)

d

exp

-

dx,

Z

=

exp

-i

dx ko,

(2.2a) (2.2b)

E{]kJE]k} 0 such

VXk K, with probability one (w.p.1), VXk K, w.p.1.

We note that r concentrates on S*, the r and a simple characterization in terms of

global minima of U(. ). The existence of

HU(. is discussed in [15]. In [4] and [8], it was shown that there exists a constant Co, which plays a critical role in the convergence of (1.1) and (1.3), respectively (in [4] Co was denoted by Co). Co has an interpretation in terms of the action functional for the perturbed dynamical systems

(2.3) dY(t)=-VU(Y(t)) dt+e dW(t). Now, for 4(" an absolutely continuous function on Rd, the (normalized) action functional for (2.3) is given by I( t, x, y)

inf th(t)=y

1

( [4,(s)+V

METROPOLIS-TYPE ANNEALING ALGORITHMS

115

According to [4],

Co=23- sup (V(x,y)-2U(y)), x,y

So

where V(x, y) limt_o I(t, x, y), and So is the set of all the stationary points of U(. ), i.e., So={X: V U(x)=0}; see [4] for a further discussion of Co, including some examples. Let K1 c d, and let {X} denote the solution of (2.1) with Xo x. We say that {X" k >= 0, x K1} is tight if, given e > 0, there exists a compact K2 c d such that Po.x{Xk K2} > 1 e for all k -> 0 and x K1. Below is our theorem on the convergence of X as k-oo. THEOREM 1. Assume that Conditions 1-3 hold with a >-1 and fl >0. Let {Xk} be given by (2.1), and assume that {X: k >= O, x K} is tight for K a compact set. Then, for B/A > Co and any bounded continuous function f(. on d,

(2.4)

lim

Eo,,(f(Xk))= 7r(f)

uniformly for x in a compact set. Note that since 7r concentrates on S*, under the conditions of Theorem 1, we have that Xk- S* as k c in probability. Theorem 1 is the same as [8, Thm. 2], except there we assumed that (2.2) was valid for all k_>-0. However, examination of the proof of [8, Thm. 2] shows that we actually established that

(2.5)

lim k->eo

Eox.ko o{f(Xk)} 7r(f)

uniformly for Xo in a compact set and all x, only assuming that (2.2) is valid for all k-> ko. It is easy to show that (2.4) follows from (2.5) and the assumption that {X" k => 0, x K} is tight. 2.2. Tightness. In this section, we consider the tightness of the discrete-time algorithm

(2.6)

,

Here {Ok(" )} are

Xk+l Xk ak(bk(Xk) + rlk) + bktrk(Xk) Wk. Borel functions from d to R d, {O-k(" )} are Borel

functions from

Rd

{r/k} is a sequence of d-valued random variables, and { Wk}, {ak}, {bk} are as in 2.1. Below, we give sufficient conditions for the tightness of {X" k _-> 0, x K}, where K is a compact subset of d. Note that algorithm (2.6) is somewhat more general than algorithm (2.1). We consider this more general algorithm because it is sometimes convenient to write an algorithm in the form (2.6) (with bk(X) # V U(x) for some x, k) to verify tightness, and then to write the algorithm in the form (2.1) to verify conver-

to

gence. We give an example of this situation when we consider continuous-state Metropolis-type annealing algorithms in 3 and 4. Let (k-" o’(Xo, Wo,’’’, Wk-1, o,’’’, Tlk-1). We consider the following conditions (a,/3, 3’1, /2 are constants whose values are specified later). Condition 4. Let K be a compact subset of d. Then sup

Ig,, (x)l
0.

K be a compact subset of a. Then sup

Itrk(x)[ 0, and 0 0. Then it is i.e., q(x, y)= q(y, x), easy to show that the resulting stationary Markov chain has a Gibbs invariant measure with density (with respect to z) exp (-U(x)/T). Furthermore, if this measure is and the chain is/x-recurrent, then the chain, in fact, has a unique Gibbs invariant finite probability measure, and the transition probability measure converges to the Gibbs for all initial states [22, Thm. 7.1, p. 30]. measure (in the total variation norm) as k It is known that if a chain is /x-irreducible 2 and satisfies a certain condition due to Doeblin [6, Hyp. (D), p. 192], then it is tz-recurrent. In [7, Chap. 3], we use this theory to give some sufficient conditions for the ergodicity of general state Metropolis-type Markov chains when Z is a compact metric space and /x is a finite Borel measure. However, there has been almost no work on the convergence and asymptotic behavior is a compact of the nonstationary annealing chain when T 0, although, when metric space, we would expect the behavior to be similar to when Z is finite. We next specialize the general state Metropolis-type annealing algorithm (3.3), (3.4) to a d-dimensional Euclidean state space. This is the most important case and the one that has seen application [16], [23]. Actually, the Metropolis-type annealing chain that we consider is not exactly a specialization of the general state chain described above. Motivated by our desire to show convergence of the chain by writing it in the

a

If, for every x E and A A such that/x(A) > 0, Po,x {.-Jk=l {Xk A} 1, then {X} is/x-recurrent. If, for every xE and AA such that/x(A) > 0, Po,x (._J=l {X A}>0, then {X} is/x-irreducible.

118

-

s.B. GELFAND AND S. K. MITTER

form of the modified stochastic gradient algorithm (2.1), we are led to choosing a nonstationary Gaussian transition density

(3.5)

qk(x,y)=

(2,rrbkrak(x))d/2

exp

and a state-dependent temperature sequence

(3.6)

Tk(x)

bZktrZk(X) 2ak

(

cnst

trk(X)

)

rzk(x))

log log k

where

(3.7)

(1

ly-xl a bktrk(X)

(klXl) v 1,

60.

To understand these choices, suppose that x lies in some fixed compact set. Then, for k large enough,

(3.8)

qk(x,y)

1

(2rb)d/a

exp

and

(3.9)

Tk(X)

Tk

(

ly_-_xl: b ]

1

2a

The choice of the transition density (3.8) is clear, given that we want to write the chain in the form (2.1). The choice of the temperature schedule (3.9) is also clear if we view (2.1) as a sampled version of the associated diffusion (1.1) with sampling intervals ak kand sampling times k n=o an, since then we should have the corresponding sampled temperatures T(tk) ca(tk)/2. Indeed, it is straightforward to check that, if C B/A, then

Tk

bk c2( tk)

T( tk) as k->

2 2ak (recall that ak- A/k, bk B/k log log k, and c2(t)- C/log for large k, t). Finally, the reason that the Ixl dependence is needed in trk(x), and hence both (3.5), (3.6), is

that to establish tightness of the annealing chain by writing the chain in the form of (2.6), we need a condition similar to the following:,

I (x)l-->const Ixl,

Ixl large,

k fixed,

for suitable choice of k(" ). In other words, the annealing chain must generate a drift (toward the origin) at least proportional to the distance from the origin. This discussion leads us to the following continuous-state Metropolis-type Markov-chain annealing algorithm and convergence result. To establish convergence, we must assume, along with Conditions 1 and 2, the following condition. Condition 7. It holds that inf

li--

sup

>o Ixl- ly-xl 2.

Ix

U(x)

U(x)---constlx[ p

and HU(x)=

119

METROPOLIS-TYPE ANNEALING ALGORITHMS

Metropolis-Type Annealing Algorithm 1. Let {Xk} be a Markov chain with one-step transition probability at time k given by

(3 10)

e{x+

AIx

x}

f s(x, y) dN(x, bkr(x)i(y)+2

2

rk(X)la(X)

where

trk(X)

(3.11)

sk(x, y)=exp

(3.12)

(

(alx [) v 1, 2ak U(y)- U(x)]

o-(x)

b2

+)

and 3’ > 0 (rk(x) gives the correct normalization). THEOREM 3. Assume that Conditions 1, 2, and 7 hold, and also that

(3.13)

lim Ixl-,oo

.

IV U(x)l < [x----

Let {Xk} be the Markov chain with transition probability given by (3.10)-(3.12) and with < en, for B/A > Co and any bounded continuous function f(. on Nd,

0
0 (rk(x) gives the correct normalization). Note that if K is any fixed compact, Xk=x K, and k is very large, then (3.17) and (3.12) coincide. Note also that (3.17), like (3.12), only uses measurements of U(. (and not V U(. )). THEOREM 4. Assume that Conditions 1, 2, and 7 hold, and also that

(3.18)

.

Ix] Co and any bounded continuous function f(. on R

0 < 3"
2.

120

S. B. GELFAND AND S. K. MITTER

.

4. Proofs of Theorems 3 and 4. In the following, cl, c2,’’" denotes positive constants whose value may change from proof to proof. We need the following lemma. LEMMA 1. Assume that V(.) is a C 2 function from R a to Let

s(x, y) exp (-hi V(y) V(x)] +) and

(x, y) exp (-A[(V V(x), y x)]+), where A > O. Then

Is(x, y)- (x, Y)I - 1, 0 < fl < 2 % and 0 that 0 we Hence assume that Theorems < < )1 and 2 apply, and (recall T T2 T Theorem 3 follows. It remains to prove Lemmas 2 and 3. We use the following claim. CLAIM. Let u d with ]u 1. en (a) io.,w>, dS(0, )(w)= O(); w aS(0, )(w)= O(); (b) (c) w w dN(0, )(w)= O(). Proo Let Ul= u, and extend Ul to an ohonormal basis {u,..., Ud} for d. Then, by changing variables (rotation) and using the mean value theorem, we obtain every k

-

.

o,.w>, o,.w>,

that

(a) (b) (c)

Proof of Lemma 2(a).

Since K is compact and

a- 0, we can choose ko such that

a[Xk[ ko. Hence, using (4.7) and the fact that P{k Ik} P{k IXk} and Wk is independent of Xk, we have, for k >- ko " " and

Xk K (w.p.1), that

b E{(1 Srk) Wk[Xk} -V U(Xk)+-ak

(4.9)

=-V U(X)- bk E{ WkE{k [Xk, ak

-V U(Xk)-- bk E{ WP{k ak

-V U(Xk)-- b_._k E{ WkP{k

Wk}lXk}

11X, W}IXk}

l[Xk, Wk}}.

122

s.B. GELFAND AND S. K. MITTER

Henceforth, we assume that k -> ko and condition on Xk

X

K. Then, using (4.6), we

obtain that

I

E{#lX= x}= -V U(x)- b ws(x, x + bw) dN(O, I)(w).

(4.10)

ai

Let

2ak [(V U(x), y x)] +

k(X, y) exp

(4.11) and k(X, y)

b--k

)

Sk(X, y) S"k(X, y). Using the fact that HU(.

O(1) on a compact, we

obtain that, for any fixed 6 > 0,

sup

]HU(x + e(y x))

_-< Cl,

e(O,1)

for all

(4.12)

lY-xl < & Hence, using Lemma 1, ]k(X, Y)I 0, that

I

lwl’l (x, x + b w)l dN(O, I)(w) [w[’lgg(x x + b,w)[ dN(O, I)(w)

/b

Nc3a+c3exp(- c) =O(a). Now, expanding (4.10) and using (4.14) gives E{ X x}= -V U(x)

--

b

(x, x + b) dN(O, I)()

b [ (x, x + b) N(O, )()

(4.5

U(x- b

[ f(x, x + b g(o, (

+O(b)

(4.

U(xl

+O(b).

b

[

a(0, ( w exp

(V U(x), ) dN(O, I)(w)

123

METROPOLIS-TYPE ANNEALING ALGORITHMS

Clearly,

E{sCk IXk x} O(bk)

(4.17)

for x such that 7U(x)=0. Henceforth, we assume that VU(x)SO. Let V t)(x)= 7 U(x)/[7 U(x)]. Completing the square in the second integral in (4.16), we obtain that

{ (4.18)

x

x}

-v U(x) b [

wdN(O,I)(w)

ak d(VO(x),w)>=o

dN

+O(bk). Now V U(x)= O(1), and so 2

(4.19)

((ak) [VU(x) 12)

exp 2

=1+0

((ak]2] \-]

].

-

V U(x), I (w)

Substituting (4.19) into (4.18), using VU(x)=O(1) and ak/bk=O(1), and changing variables from W+2(ak/bk)VU(x) to W, gives

l{"lx’ x}= -V U(x)- b" ak

I

wdN(O,I)(w) v O(x),w)=O(ak/bk)

I - _O(, k/bk)

( ak ) +

fo

dN(O,I)(w)

O bk

wdN(O,I)(w) _- ko and Xg

-

{w@ w{x, wllx}+e(X)

I-

{WN wP{

E{((1-)W)@((1-g)W)IXg}+e(X)

lx, w}}+ e(X),

where

( b- IV U(Xk)[ + IV U(Xk)[2)

o

eg(Xg)

Henceforth, we assume that k >= ko and condition on Xk

X

K. Then, using (4.6), we

obtain that

x, (4.23)

I-

W(WSk(X,X+bkW dN(O,I)(w)

\ak/

Let gg(x, y) be given by (4.11) and gk(X,y)= Sk(X,y)--gk(X,y). Then, expanding (4.23) and using (4.14), gives w(R) WCk(X, X + bkW) dN(O, I)(w)

I

--( b-k]2 \ak/

(4.24)

f

w(R)wk(X,X+bkw) dN(O, I)(w)

W@WS"k(X,X+bkW) dN(O,I)(w)

I\ak/

(4.25)

125

METROPOLIS-TYPE ANNEALING ALGORITHMS

Clearly,

(4.26)

{:(R) Ix =x}=

o(b)

for x such that V U(x)=0. Henceforth, we assume that VU(x)S0. Let V(x)= V U(x)/IV U(x)l. Completing the square in the second integral in (4.25), we obtain that

w(R)wdN(O,I)(w) O(x),w)