Convergence Guarantees for Generalized Adaptive Stochastic ... - SJU

Comment

Report 2 Downloads 84 Views

Convergence Guarantees for Generalized Adaptive Stochastic Search Methods for Continuous Global Optimization Rommel G. Regis Mathematics Department Saint Joseph’s University, Philadelphia, PA 19131, USA, [email protected]

June 23, 2010 This paper presents some simple technical conditions that guarantee the convergence of a general class of adaptive stochastic global optimization algorithms. By imposing some conditions on the probability distributions that generate the iterates, these stochastic algorithms can be shown to converge to the global optimum in a probabilistic sense. These results also apply to global optimization algorithms that combine local and global stochastic search strategies and also those algorithms that combine deterministic and stochastic search strategies. This makes the results applicable to a wide range of global optimization algorithms that are useful in practice. Moreover, this paper provides convergence conditions involving the conditional densities of the random vector iterates that are easy to verify in practice. It also provides some convergence conditions in the special case when the iterates are generated by elliptical distributions such as the multivariate Normal and Cauchy distributions. These results are then used to prove the convergence of some practical stochastic global optimization algorithms, including an evolutionary programming algorithm. In addition, this paper introduces the notion of a stochastic algorithm being probabilistically dense in the domain of the function and shows that, under simple assumptions, this is equivalent to seeing any point in the domain with probability 1. This, in turn, is equivalent to almost sure convergence to the global minimum. Finally, some simple results on convergence rates are also proved. Key words: global optimization; stochastic search; random search; global search; local search; convergence; elliptical distribution; evolutionary algorithm; evolutionary programming

1

Introduction

Let f be a deterministic function defined on a set D ⊆ Rd . A point x∗ ∈ D such that f (x∗ ) ≤ f (x) for all x ∈ D is said to be a global minimum point (or global minimizer) for f on D. If f is continuous and D is compact, then a global minimum point for f on D is guaranteed to exist. However, a global minimizer may also exist even if D is not compact or even if f is discontinuous on certain regions in D. We shall prove simple conditions that guarantee the convergence of a general class of adaptive stochastic algorithms for finding the global minimum of f on D. In particular, we prove a theorem that extends the result given on page 40 of Spall (2003) and derive consequences of this theorem for stochastic algorithms that are used in practice. 1

We mention that there is a substantial body of literature on stochastic methods for local optimization of noisy loss functions (e.g. see Chin (1997) or Spall (2003)). These methods are called stochastic approximation algorithms and they typically use approximations of the gradient of the loss function. Examples of these methods include the standard finite-difference stochastic approximation (FDSA) algorithm (Kiefer and Wolfowitz 1952) and the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm (Spall 1992). However, the focus of this paper is on stochastic search methods for global optimization of a deterministic function. Many results have been provided on the convergence of stochastic search algorithms for global optimization (e.g. Solis and Wets 1981, Pinter 1996, Stephens and Baritompa 1998, Maryak and Chin 2001, Zabinsky 2003). However, many of these convergence conditions are cumbersome to verify for algorithms that are used in practice. Moreover, some of these convergence results apply only to a specific type of stochastic search algorithm. For example, Maryak and Chin (2001) showed that under certain conditions, the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm converges in probability to the global optimum. In addition, many of these convergence conditions are usually applied to the uniform distribution or its variants but seldom applied to other distributions that are used in practical stochastic global optimization algorithms like the Normal, Cauchy or even the Triangular distributions. For example, the Pure Adaptive Search (PAS) algorithm by Zabinsky and Smith (1992) uses the uniform distribution on the improving level set in each iteration. However, in many heuristic optimization algorithms such as evolution strategies and evolutionary programming algorithms, the multivariate Normal distribution is typically used. Section 3 will provide a very simple and very general framework that can capture a wide range of stochastic global optimization algorithms that can be designed in practice. The main goal of this paper is to provide a set of convergence conditions for this general framework that are easy to verify and that apply to commonly used probability distributions in practice.

2

Notations

Because the algorithms are stochastic, we treat the iterates as d-dimensional random vectors whose realizations are in D ⊆ Rd . Consider a stochastic algorithm whose iterates are given by the sequence of random vectors {Yn }n≥1 defined on a probability space (Ω, B, P ), where the random vector Yn : (Ω, B) → (D, B(D)) represents the nth function evaluation point. Here, Ω is the sample space, B is a σ-field of subsets of Ω, and B(D) are the Borel sets in D. For maximum generality, we also focus on adaptive algorithms. Here, adaptive means that Yn possibly depends on the random vectors Y1 , . . . , Yn−1 for all n > 1. Our use of the term adaptive is more general than the one used 2

by Zabinsky (2003) in the Pure Adaptive Search (PAS) algorithm. In fact, in PAS, each Yn has the uniform distribution on the improving level set {x ∈ D : f (x) < f (Yi ), i = 1, . . . , n − 1}. In practical global optimization algorithms, it is not uncommon to combine both deterministic and stochastic strategies for the selection of function evaluation points. For example, a practitioner might want to start with a set of predetermined function evaluation points that he or she believes would be good starting guesses for the location of the global minimum. If D is a closed hypercube, another possibility is to begin with an optimal space-filling experimental design. Moreover, one might also have a sequence of deterministically selected points in between sequences of randomly selected points. For example, after doing some stochastic search, one might refine the current best solution by running a deterministic gradient-based local minimization solver from it. To capture global optimization algorithms in practice, we also allow for the possibility that some of the Yn ’s are degenerate random vectors. Here, a degenerate random vector in Rd is one whose mass is concentrated at a single point in Rd . Note that when a particular Yn is degenerate this essentially means that Yn is deterministically selected. Note that in this case, it is still possible that Yn depends on the realizations of the previous random vectors in the sequence {Y1 , . . . , Yn−1 }. In practice, it could happen that a randomly generated point falls outside the domain D. For example, if Y is a multivariate Normal random vector with a positive definite covariance matrix, then theoretically its realization can be anywhere on Rd even if its mean vector is restricted to D. In this case, the runaway random point is typically replaced by a suitable point in D. More precisely, let Y : (Ω, B) → (Rd , B(Rd )) be a random vector and let D ⊆ Rd . It would be useful to have a deterministic function ρD : Rd → D with the property that ρD (x) = x for all x ∈ D. We refer to such a function as an absorbing transformation for D since it “absorbs” any point of Rd into D. For example, if D is compact, ρD : Rd → D may be defined such that ρD (x) is a point in D with kx − ρD (x)k = inf y∈D kx − yk. That is, ρD (x) is a point in D that is as close as possible to x ∈ Rd . When D = [a, b] ⊆ Rd is a hypercube, we have ρD (x) = min(max(a, x), b), where the max and min are taken componentwise. Another example of ρD might denote reflecting across a boundary of D if it has one.

3

Main Convergence Results

Our goal is to provide simple conditions that guarantee the convergence of stochastic algorithms that follow the following framework, which generalizes the well-known Simple (or “Blind”) Random Search that is given on page 38 of Spall (2003):

3

Algorithm A: Generalized Adaptive Random Search (GARS) Framework Inputs: (1) The objective function f : D → R, where D ⊆ Rd . (2) A deterministic absorbing transformation ρD : Rd → D, i.e. ρD (x) = x for all x ∈ D. (3) A collection of intermediate random elements {Λi,j : (Ω, B) → (Ωi,j , Bi,j ) : i ≥ 1 and j = 1, 2, . . . , ki } that are used to determine the trial random vector iterates. These Λi,j ’s could be random variables, random vectors or other types of random elements defined on the same probability space (Ω, B, P ). For convenience, we define the following notation. Define E0 = ∅, and for each n ≥ 1, define the collection of random elements En := {Λi,j : i = 1, 2, . . . , n; j = 1, 2, . . . , kn } = En−1

[

{Λn,1 , Λn,2 , . . . , Λn,kn }.

Step 0. Set n = 1. Step 1. Generate a realization of the random vector Yn : (Ω, B) → (Rd , B(Rd )) as follows: Step 1.1 For each j = 1, . . . , kn , generate a realization of the intermediate random element Λn,j : (Ω, B) → (Ωn,j , Bn,j ) according to some probability distribution. Step 1.2 Set Yn = Θn (En ) for some deterministic function Θn . That is, Yn is a deterministic function of the random elements in En . (Hence, Y1 , Y2 . . . , Yn are not necessarily independent.) Step 2. Set Xn = ρD (Yn ) and evaluate f (Xn ). Step 3. Increment n = n + 1 and go back to Step 1.

We refer to a stochastic algorithm that follows the GARS framework above as a GARS algorithm. In a GARS algorithm, the actual sequence of iterates (i.e. function evaluation points) is given by {Xn }n≥1 . Note that the realization of the random vector Xn is the same as that of Yn if the realization of Yn belongs to D. Now we can also define the sequence {Xn∗ }n≥1 where X1∗ = X1 , and ∗ ∗ ) while X ∗ = X ∗ for n > 1, Xn∗ = Xn if f (Xn ) < f (Xn−1 n n−1 otherwise. Note that {Xn }n≥1 is the

sequence of best points encountered by the algorithm. We say that a GARS algorithm converges to the global minimum of f on D in probability (or almost surely) if {f (Xn∗ )}n≥1 converges to 4

f ∗ := inf x∈D f (x) in probability (or almost surely). In this paper, we are interested in conditions that guarantee the convergence of a GARS algorithm in a probabilistic sense. In the convergence theorems below, we shall assume that the GARS algorithm under consideration is allowed to run indefinitely. In Step 1 of the GARS framework, each Λi,j could be a random variable, random vector or any type of random element. If Λi,j is a random variable, then Ωi,j = R and Bi,j = B(R). If Λi,j is an m-dimensional random vector, then Ωi,j = Rm and Bi,j = B(Rm ). Here, we allow for the possibility that we first generate the realizations of possibly several intermediate random elements before we determine the realization of the trial random vector iterate Yn . Moreover, the realization of Yn possibly depends on the realizations of the current and previous intermediate random elements {Λi,j : i = 1, 2, . . . , n and j = 1, 2, . . . , ki } through a deterministic function Θn . The introduction of these intermediate random elements provides flexibility to the framework so that it can capture some practical stochastic global optimization algorithms. The GARS framework also covers the special case where the Yn ’s are generated independently. In this case, kn = 1 and Yn = Λn,1 for all n ≥ 1, and the d-dimensional random vectors {Λn,1 }n≥1 are independently generated. This shows that the GARS framework is indeed an extension of the well-known Simple Random (“Blind”) Search algorithm, which is given on page 38 of Spall (2003). However, we are more interested in the general case where the random vectors {Yn }n≥1 are possibly dependent. For example, Yn may be defined by adding random perturbations to the components of ∗ Xn−1 (the best solution after n − 1 function evaluations) that are Normally distributed with zero

mean. There are many convergence results on random search (e.g. Pinter 1996, Solis and Wets 1981, Spall 2003, Zhigljavsky 1991). However, many of these convergence results (e.g. Solis and Wets 1981) are hard to verify in practice. In addition, except for the trivial case of the uniform distribution, previous results on random search did not include a verification of the convergence conditions for other commonly used distributions in practice such as the multivariate Normal and Cauchy distributions. The following theorem is an extension of the theorem in page 40 of Spall (2003). It presents a condition that guarantees the convergence of a GARS algorithm. The main differences between this theorem and one given by Spall (2003) are: (i) the sampling in one iteration is not necessarily independent of the previous iterations; (ii) the convergence condition is only required for some subsequence of the iterations; and (iii) the objective function might have multiple global minima (not just multiple local minima) over the given domain. Having a convergence theorem that can 5

handle (i) and (ii) is important because these conditions hold in many practical stochastic algorithms. Moreover, (iii) is also important since in the most general case, the objective function in a global optimization problem might have multiple global minima, possibly even an infinite number of global minima. Before we present this theorem, we first introduce additional notation. Recall the definition of En in the GARS framework. For each n ≥ 0, we also define σ(En ) to be the σ-field generated by the random elements in En . We can think of σ(En ) as representing all the information that can be derived from the random elements in En . Theorem 1. Let f be a real-valued function on D ⊆ Rd such that f ∗ := inf x∈D f (x) > −∞. Suppose that a GARS algorithm satisfies the following property: For any ² > 0, there exists 0 < L(²) < 1 such that (Global Search Condition 1)

P [Xnk ∈ D : f (Xnk ) < f ∗ + ² | σ(E(nk )−1 )] ≥ L(²)

(1)

for some subsequence {nk }k≥1 . Then, f (Xn∗ ) −→ f ∗ almost surely (a.s.). The proof is similar to the one given in Spall (2003) but we include it to illustrate that it is not necessary to assume independence among the Xn ’s and that we do not need to require the above global search condition on every Xn . In addition, we also allow for the possibility that f has multiple global minimizers on D. Proof. Fix ² > 0 and define S² := {x ∈ D : f (x) < f ∗ + ²}. By assumption, P [Xnk ∈ S² | σ(E(nk )−1 )] ≥ L(²), for any k ≥ 1. Now for each k ≥ 1, we have P [Xn1 6∈ S² , Xn2 6∈ S² , . . . , Xnk 6∈ S² ] =

k Y

P [Xni 6∈ S² |Xn1 6∈ S² , Xn2 6∈ S² , . . . , Xn(i−1) 6∈ S² ].

i=1

By conditioning on the random elements in E(ni )−1 , it is easy to check that for each ² > 0, we have P [Xni ∈ S² |Xn1 6∈ S² , Xn2 6∈ S² , . . . , Xn(i−1) 6∈ S² ] ≥ L(²). Thus, P [Xn1 6∈ S² , Xn2 6∈ S² , . . . , Xnk 6∈ S² ] =

k Y

P [Xni 6∈ S² |Xn1 6∈ S² , Xn2 6∈ S² , . . . , Xn(i−1) 6∈ S² ]

i=1 k ³ ´ Y 1 − P [Xni ∈ S² |Xn1 6∈ S² , Xn2 6∈ S² , . . . , Xn(i−1) 6∈ S² ] ≤ (1 − L(²))k . = i=1

6

Observe that if i is the smallest index such that Xi ∈ S² , it follows that Xi∗ = Xi and Xn∗ ∈ S² for all n ≥ i. Consequently, if Xn∗k 6∈ S² , then Xn1 6∈ S² , Xn2 6∈ S² , . . . , Xnk 6∈ S² . Hence, for each k ≥ 1, we have P [f (Xn∗k ) − f ∗ ≥ ²] = P [f (Xn∗k ) ≥ f ∗ + ²] = P [Xn∗k 6∈ S² ] ≤ P [Xn1 6∈ S² , Xn2 6∈ S² , . . . , Xnk 6∈ S² ] ≤ (1 − L(²))k , and so, lim P [f (Xn∗k ) − f ∗ ≥ ²] = 0,

k→∞

i.e. f (Xn∗k ) −→ f ∗ in probability. By a standard result in probability theory (e.g. see Resnick 1999, Theorem 6.3.1(b)) it follows that f (Xn∗k(i) ) −→ f ∗ almost surely (a.s.) as i → ∞ for some subsequence {nk(i) }i≥1 . Next, since f ∗ > −∞ and {f (Xn∗ )}n≥1 is monotonically nonincreasing, it follows that lim f (Xn∗ (ω)) n→∞

exists for each underlying sample point ω. Finally, since the subsequence f (Xn∗k(i) ) −→ f ∗ a.s., it follows that f (Xn∗ ) −→ f ∗ a.s. Note that the global search condition in the above theorem is expressed in terms of the random vectors {Xnk }k≥1 , which are the images of {Ynk }k≥1 under the map ρD . It would be more convenient if the global search condition is expressed in terms of the random vectors {Ynk }k≥1 since these are the ones that are randomly generated by the algorithm. The following corollary gives the result that we want. Corollary 1. Let f be a real-valued function on D ⊆ Rd such that f ∗ := inf x∈D f (x) > −∞. Suppose that a GARS algorithm satisfies the following property: For any ² > 0, there exists 0 < L(²) < 1 such that (Global Search Condition 2)

P [Ynk ∈ D : f (Xnk ) < f ∗ + ² | σ(E(nk )−1 )] ≥ L(²)

(2)

for some subsequence {nk }k≥1 . Then, f (Xn∗ ) −→ f ∗ almost surely (a.s.). Proof. Using the notation in the proof of Theorem 1, the event [Xnk ∈ S² ] contains [Ynk ∈ S² ] for all k ≥ 1. Hence, for all k ≥ 1, we have P [Xnk ∈ S² | σ(E(nk )−1 )] ≥ P [Ynk ∈ S² | σ(E(nk )−1 )] ≥ L(²), and so, the result follows from Theorem 1.

7

Note that the global search conditions in (1) and (2) are only required for some subsequence of the iterations in order to guarantee almost sure convergence to the global minimum. This result is important since many practical global optimization algorithms use a combination of global and local search iterations. Local search iterations are meant to explore small regions of the search space in order to refine the current best solution and cannot usually satisfy (1) or (2). Moreover, it is also common to combine both stochastic and deterministic search strategies in a single algorithm. A deterministic iteration corresponds to a degenerate random vector Yn whose mass is concentrated at a single point in Rd . By imposing the above global search condition on only a subsequence of the iterations, we still obtain a theoretical guarantee of almost sure convergence to the global minimum. The next theorem deals with the case when f has a unique global minimizer x∗ over D. In this situation, the sequence of best solutions {Xn∗ }n≥1 converges to x∗ almost surely. This result was mentioned on p. 40 of Spall (2003) but a proof was not given. We include a proof of this theorem for the sake of completeness. Theorem 2. Let f be a real-valued function on D ⊆ Rd and suppose that x∗ is the unique global minimizer of f over D in the sense that f (x∗ ) = inf f (x) > −∞ and x∈D

inf f (x) > f (x∗ ) for all

x∈D kx−x∗ k≥η

η > 0. Furthermore, suppose that a GARS algorithm satisfies the property that f (Xn∗ ) −→ f (x∗ ) a.s. Then, Xn∗ −→ x∗ a.s. Proof. Fix ² > 0 and let fe := inf

f (x). x∈D kx−x∗ k≥²

Since f (Xn∗ ) −→ f (x∗ ) a.s., it follows that there

exists a set N ⊆ Ω with P (N ) = 0 such that f (Xn∗ (ω)) −→ f (x∗ ) for all ω ∈ N c . By assumption, fe − f (x∗ ) > 0. Hence, for each ω ∈ N c , there is an integer N such that for all n ≥ N , we have f (Xn∗ (ω)) − f (x∗ ) = |f (Xn∗ (ω)) − f (x∗ )| < fe − f (x∗ ), or equivalently, f (Xn∗ (ω)) < fe. If kXn∗ (ω) − x∗ k ≥ ², then f (Xn∗ (ω)) ≥ inf is a contradiction. Hence, we must have kXn∗ (ω) − x∗ k < ². This

e which

f (x) = f , x∈D kx−x∗ k≥² shows that Xn∗ (ω) −→

x∗ for

each ω ∈ N c . Thus, Xn∗ −→ x∗ a.s. The next theorem presents a convergence condition that is easier to verify than the one provided by Theorem 1 or Corollary 1 in the case where f is continuous at a global minimizer over its domain D. Theorem 3. Let f be a real-valued function on D ⊆ Rd such that f ∗ := inf x∈D f (x) > −∞. Moreover, let x∗ be a global minimizer of f over D and suppose that f is continuous at x∗ . Furthermore, 8

suppose that a GARS algorithm satisfies the following property: For any z ∈ D and δ > 0, there exists 0 < ν(z, δ) < 1 such that (Global Search Condition 3)

P [Ynk ∈ B(z, δ) ∩ D | σ(E(nk )−1 )] ≥ ν(z, δ),

(3)

for some subsequence {nk }k≥1 . Here, B(z, δ) is the open ball centered at z with radius δ. Then, f (Xn∗ ) −→ f ∗ a.s. Moreover, if x∗ is the unique global minimizer of f over D in the sense of Theorem 2, then Xn∗ −→ x∗ a.s. Note that in the above theorem, we again allow for the possibility that f has multiple global minimizers over D. In this case, the hypothesis of the theorem only requires that f be continuous at one of these global minimizers. Proof. Fix ² > 0 and k ≥ 1. Since f is continuous at x∗ , there exists δ(²) > 0 such that |f (x) − f (x∗ )| < ² whenever kx − x∗ k < δ(²). Hence, the event [Xnk ∈ D : f (Xnk ) < f (x∗ ) + ²] = [Xnk ∈ D : |f (Xnk ) − f (x∗ )| < ²] ⊇ [Xnk ∈ D : kXnk − x∗ k < δ(²)], and so, P [Xnk ∈ D : f (Xnk ) < f (x∗ ) + ² | σ(E(nk )−1 )] ≥ P [Xnk ∈ D : kXnk − x∗ k < δ(²) | σ(E(nk )−1 )] = P [Xnk ∈ B(x∗ , δ(²))∩D | σ(E(nk )−1 )] ≥ P [Ynk ∈ B(x∗ , δ(²))∩D | σ(E(nk )−1 )] ≥ ν(x∗ , δ(²)) =: L(²). Clearly, L(²) > 0 since δ(²) > 0. By Theorem 1, it follows that f (Xn∗ ) −→ f ∗ a.s. Theorem 3 essentially says that the algorithm must be able to reach any neighborhood of any point in D with positive probability that is bounded away from zero for all random vector iterates in the subsequence. However, in the above proof, we only use the assumption in Global Search Condition 3 at a global minimizer x∗ . The reason for requiring the condition for all z ∈ D is that the location of the global minimizer x∗ is not known. The algorithm is only guaranteed to reach x∗ (in the most general case) if the algorithm is capable of reaching any neighborhood of any point in D when it is run indefinitely. Note also that the requirements of Theorem 3 are somewhat mild. It only requires continuity of f at one of its global minimizers over the domain. Hence, the convergence of the algorithm is still guaranteed even on problems where certain regions of the search space are discontinuous. This property is important in practice since there are many real-world optimization problems for which the search space contains discontinuities. The next theorem presents a sufficient condition that guarantees the convergence of a GARS algorithm in terms of the infimum of the conditional densities of the trial random vector iterates. 9

It also applies to the case where f is continuous at a global minimizer of f over a compact set D. Again, f might have multiple global minimizers over D and the theorem only requires that f be continuous at one of these global minimizers. Theorem 4. Let D be a subset of Rd such that ψD (δ) := inf z∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0, where µ is the Lebesgue measure on Rd . Let f be a real-valued function on D such that f ∗ := inf x∈D f (x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over D. Consider a GARS algorithm and suppose that there is a subsequence {nk }k≥1 such that for each k ≥ 1, Ynk has a conditional density gnk (y | σ(E(nk )−1 )) satisfying the following condition: µ({y ∈ D : h(y) = 0}) = 0, where h(y) := inf k≥1 gnk (y | σ(E(nk )−1 )). Then f (Xn∗ ) −→ f ∗ a.s. Again, if x∗ is the unique global minimizer of f over D in the sense of Theorem 2, then Xn∗ −→ x∗ a.s. Proof. Fix δ > 0 and z ∈ D. For all k ≥ 1, we have Z Z P [Ynk ∈ B(z, δ) ∩ D | σ(E(nk )−1 )] = gnk (y|σ(E(nk )−1 )) dy ≥ B(z,δ)∩D

h(y) dy =: ν(z, δ).

B(z,δ)∩D

Since h(y) is a nonnegative function on D, µ({y ∈ D : h(y) = 0}) = 0 and µ(B(z, δ)∩D) ≥ ψD (δ) > 0, it follows that ν(z, δ) > 0. By Theorem 3, we have f (Xn∗ ) −→ f ∗ a.s. Theorem 4 basically requires that the infimum of the conditional densities of all random candidate vectors over some subsequence of iterations to be strictly positive except on a set of Lebesgue measure zero. This result is useful since the requirement is easy to verify for many distributions used in practice. Next, consider the situation where the random vector iterates are generated using different probability distributions on the different components of an iterate. More precisely, for each j = 1, . . . , d, let v (j) denote the j th component of a point v ∈ Rd , i.e. v = (v (1) , . . . , v (d) )T . Moreover, for a given random vector Y : (Ω, B) → (Rd , B(Rd )), let Y (j) be the random variable that corresponds to the j th component of Y . Also, for any D ⊆ Rd , define D(j) := {y ∈ R : y = v (j) for some v ∈ D}. Theorem 5. Let D be a subset of Rd such that ψD (δ) := inf z∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0, where µ is the Lebesgue measure on Rd . Let f be a real-valued function defined on D such that f ∗ := inf x∈D f (x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over D. Consider a GARS algorithm and suppose that there is a subsequence {nk }k≥1 such that the following properties hold:

10

(1)

(d)

[A] For each k ≥ 1, the random variables Ynk , . . . , Ynk are conditionally independent given the random elements in E(nk )−1 ; and (j)

[B] For each k ≥ 1 and for each 1 ≤ j ≤ d, the random variable Ynk has a conditional density (j)

(j)

gnk (u | σ(E(nk )−1 )) and that h(j) (u) := inf k≥1 gnk (u | σ(E(nk )−1 )) satisfies: µ({u ∈ D(j) : h(j) (u) = 0}) = 0. Here, µ is the Lebesgue measure on R. Then f (Xn∗ ) −→ f ∗ a.s. In addition, if x∗ is the unique global minimizer of f on D in the sense of Theorem 2, then Xn∗ −→ x∗ a.s. Proof. By Property [A] above, it follows that for each k ≥ 1, Ynk has a conditional density given by: gnk (y | σ(E(nk )−1 )) =

d Y

gn(j) (y (j) | σ(E(nk )−1 )). k

j=1

As in Theorem 4, define h(y) := inf k≥1 gnk (y | σ(E(nk )−1 )), and note that h(y) = inf

k≥1

d Y

(y (j) gn(j) k

| σ(E(nk )−1 )) ≥

d µ Y

inf

k≥1

j=1

j=1

(y (j) gn(j) k

¶ Y d | σ(E(nk )−1 )) = h(j) (y (j) ) j=1

Now {y ∈ D : h(y) = 0} ⊆

d [

{y ∈ D : h(j) (y (j) ) = 0},

j=1

and so, µ({y ∈ D : h(y) = 0}) ≤ µ(

d [

{y ∈ D : h(j) (y (j) ) = 0}) ≤

j=1

d X

µ({y ∈ D : h(j) (y (j) ) = 0}).

j=1

Next, we have {y ∈ D : h(j) (y (j) ) = 0} ⊆ D(1) × . . . × D(j−1) × {v ∈ D(j) : h(j) (v) = 0} × D(j+1) × . . . × D(d) , and so, µ({y ∈ D : h(j) (y (j) ) = 0}) ≤ µ(D(1) ) . . . µ(D(j−1) )µ({v ∈ D(j) : h(j) (v) = 0})µ(D(j−1) ) . . . µ(D(d) ) = 0. Hence, µ({y ∈ D : h(y) = 0}) = 0, and by Theorem 4, it follows that f (Xn∗ ) −→ f ∗ a.s. 11

4

Applications Involving Commonly Used Distributions

Throughout this section, we will assume that ψD (δ) := inf z∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0. In particular, we will show that this assumption holds when D is a compact (i.e. closed and bounded) hypercube in Rd . This special case is important since optimization over a compact hypercube is common in many real-world applications. Note that if D is a compact hyperrectangle, the optimization problem can be easily transformed into one where D = [0, 1]d by an appropriate rescaling of the variables. We will consider stochastic algorithms that follow the GARS framework in Section 3. Below, we consider some possibilities for the distributions of the random vectors in a subsequence {Ynk }k≥1 of the trial random vector iterates {Yn }n≥1 .

4.1

Uniform Distributions

The simplest case is when we have a subsequence {Ynk }k≥1 of random vectors where each Ynk = Λnk ,1 has the uniform distribution over D. Here, gnk (y | σ(E(nk )−1 )) =

1 , µ(D)

for all k ≥ 1,

and so, h(y) := inf gnk (y | σ(E(nk )−1 )) = k≥1

1 . µ(D)

Clearly, h satisfies µ({y ∈ D : h(y) = 0}) = 0, and so, by Theorem 4, f (Xn∗ ) −→ f ∗ = inf x∈D f (x) a.s.

4.2

Elliptical Distributions

Let u ∈ Rd , let V be a symmetric nonnegative definite d × d matrix, and let φ : [0, ∞) → R be a function. A d-dimensional random vector Y is said to have an elliptical distribution with parameters u, V and φ, written Y ∼ EC(u, V, φ), if its characteristic function has the form exp(itT u)φ(tT V t). Let Y : (Ω, B) → (Rd , B(Rd )) be a random vector that has an elliptical distribution. If Y has a density, then it has the form (Fang and Zhang 1990) g(y) = γ [det(V )]−1/2 Ψ((y − u)T V −1 (y − u)),

y ∈ Rd

(4)

where u ∈ Rd , V is a symmetric and positive definite matrix, Ψ is a nonnegative function over the Z ∞ positive reals such that y (d/2)−1 Ψ(y) dy < ∞, and γ is the normalizing constant given by 0

Z

γ= 2π d/2

Γ(d/2) ∞

y d−1 Ψ(y 2 ) dy

0

12

(5)

Elliptical distributions include some of the more important distributions used in the design of practical stochastic algorithms. For example, if we choose Ψ(y) = e−y/2 in the above definition, then we get a nondegenerate multivariate Normal distribution, which is widely used in evolutionary algorithms such as evolution strategies and evolutionary programming. Moreover, if we choose d+1 Ψ(y) = (1+y)−( 2 ) , then we get the multivariate Cauchy distribution. An example of a stochastic algorithm that uses the Cauchy distribution is given in Yao et al. (1999). The following theorem shows that practical algorithms that arise out of elliptical distributions, where Ψ is monotonically nonincreasing, converge to the global minimum almost surely. Theorem 6. Let D be a bounded subset of Rd such that ψD (δ) := inf z∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0, where µ is the Lebesgue measure on Rd . Let f be a real-valued function defined on D such that f ∗ := inf x∈D f (x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over D. Consider a GARS algorithm and suppose that there is a subsequence {nk }k≥1 such that for each k ≥ 1, we have Ynk = Uk + Wk , where Uk = Φk (E(nk )−1 ) for some deterministic function Φk and Wk is a random vector whose conditional distribution given σ(E(nk )−1 ) is an elliptical distribution with conditional density given by qk (w | σ(E(nk )−1 )) = γ[det(Vk )]−1/2 Ψ(wT Vk−1 w),

z ∈ Rd ,

where γ is defined in (5). For each k ≥ 1, let λk be the smallest eigenvalue of Vk . Furthermore, suppose that the following properties are satisfied: [P1] Ψ is monotonically nonincreasing; and [P2] inf k≥1 λk > 0. Then f (Xn∗ ) −→ f ∗ a.s. In addition, if x∗ is the unique global minimizer of f on D in the sense of Theorem 2, then Xn∗ −→ x∗ a.s. Before we prove the theorem, we note that in practice we typically have Uk = Φk (E(nk )−1 ) = ∗ X(n , which is the random vector representing the best solution found after (nk ) − 1 function k )−1

evaluations. However, in the above theorem, Uk can be any random vector that is a function of the random elements in E(nk )−1 whose realizations are in Rd but do not have to be in D. In addition, Uk may be any fixed vector in Rd . Proof. For each k ≥ 1, observe that the conditional distribution of Ynk given σ(E(nk )−1 ) is an elliptical distribution with conditional density gnk (y | σ(E(nk )−1 )) = γ[det(Vk )]−1/2 Ψ((y − uk )T Vk−1 (y − uk )), 13

y ∈ Rd ,

where uk = Φk ({λi,j : i = 1, 2, . . . , (nk ) − 1; j = 1, 2, . . . , ki }) and the λi,j ’s are the realizations of the random elements in E(nk )−1 . Now for each k ≥ 1 and y ∈ D, we have (y − uk )T Vk−1 (y − uk ) ≤ ky − uk k2 kVk−1 (y − uk )k2 ≤ ky − uk k22 kVk−1 k2 ≤

diam(D)2 , λk

where diam(D) = supx,y∈D kx − yk. Since D is bounded, it follows that diam(D) < ∞. Moreover, since Ψ is monotonically nonincreasing, we obtain µ T

Ψ((y − uk )

Vk−1 (y

− uk )) ≥ Ψ

diam(D)2 λk

¶ .

Moreover, since the determinant of a matrix is a product of its eigenvalues, it follows that det(Vk ) ≤ (λ∗k )d , where λ∗k is the largest eigenvalue of Vk . Thus, for each y ∈ D, µ gnk (y | σ(E(nk )−1 )) ≥

γ(λ∗k )−d/2 Ψ

Ã

¶

diam(D)2 λk

≥γ

!−d/2

sup λ∗k k≥1

µ Ψ

diam(D)2 inf k≥1 λk

¶ > 0.

Hence, we have Ã h(y) := inf gnk (y | σ(E(nk )−1 )) ≥ γ k≥1

sup λ∗k k≥1

!−d/2

µ Ψ

diam(D)2 inf k≥1 λk

¶ > 0.

By Theorem 4, it follows that f (Xn∗ ) −→ f ∗ a.s. It is easy to check that Ψ is monotonically nonincreasing for the multivariate normal (Ψ(y) = d+1 e−y/2 ) and multivariate Cauchy (Ψ(y) = (1+y)−( 2 ) ) distributions. Hence, Theorem 6 holds when these distributions are used. Finally, we note that Ynk in Theorem 6 could yield a realization that is outside of D. This is why it is important to apply some absorbing transformation ρD : Rd → D to ensure that the new iterate will be inside D. Corollary 2. Let D be a bounded subset of Rd such that ψD (δ) := inf z∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0, where µ is the Lebesgue measure on Rd . Let f be a real-valued function defined on D such that f ∗ := inf x∈D f (x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over D. Consider a GARS algorithm and suppose that there is a subsequence {nk }k≥1 such ∗ that for each k ≥ 1, we have Ynk = X(n + Wk , where Wk is a random vector whose conditional k )−1

distribution given σ(E(nk )−1 ) is multivariate Normal with mean vector 0 and covariance matrix Vk . For each k ≥ 1, let λk be the smallest eigenvalue of Vk . If inf k≥1 λk > 0, then f (Xn∗ ) −→ f ∗ a.s. In addition, if x∗ is the unique global minimizer of f on D in the sense of Theorem 2, then Xn∗ −→ x∗ a.s.

14

Proof. As mentioned earlier, the multivariate Normal distribution is a special case of the elliptical distribution. Moreover, note that (nk )−1 ∗ X(n = k )−1

X

Xi 1Ei ,

i=1

where 1Ei is an indicator function and Ei is the event defined by Ei = [f (Xi ) ≤ f (Xj ) for all j = 1, . . . , (nk ) − 1 and i is the smallest index with this property] = [f (ρD (Yi )) ≤ f (ρD (Yj )) for all j = 1, . . . , (nk ) − 1 and i is the smallest index with this property] For each i = 1, 2, . . . , (nk ) − 1, Yi is a deterministic function of the random elements in Ei . Hence, ∗ X(n is a deterministic function of the random elements in E(nk )−1 . Since inf k≥1 λk > 0, it follows k )−1

that for all k ≥ 1, the matrix Vk is invertible and Wk has an elliptical distribution with conditional density given σ(E(nk )−1 ) given by qk (w | σ(E(nk )−1 )) = γ[det(Vk )]−1/2 Ψ(wT Vk−1 w),

z ∈ Rd

where Ψ(y) = e−y/2 and γ = (2π)−d/2 . Clearly, Ψ(y) = e−y/2 is monotonically nonincreasing and we have inf k≥1 λk > 0 by assumption. The conclusion now follows from Theorem 6.

4.3

Hypercube Domains

The succeeding result shows that the condition on ψD (δ) in Theorems 4–6 is easily satisfied in the special case where D is a closed and bounded hypercube in Rd . Note that optimization over a closed hypercube is typical in many applications. Proposition 1. Let D = [a, b] ⊆ Rd be a closed hypercube and let `(D) be the length of one side of D. ¶ µ ¶d µ π d/2 δ `(D) d π d/2 1 1 If 0 < δ ≤ 2 `(D), then ψD (δ) = . If δ > 2 `(D), then ψD (δ) ≥ . 2 Γ( d2 + 1) 4 Γ( d2 + 1) Z ∞ Here, Γ is the gamma function defined by Γ(n) := xn−1 e−x dx. 0

Proof. First, consider the case where 0 < δ ≤ (i)

ei

(j)

= 1 and ei

1 2 `(D).

Let ei be the ith unit vector in Rd , i.e.

= 0 for all j 6= i. Fix z ∈ D. Since δ ≤

1 2 `(D),

there exists α1 , . . . , αd ∈

e be the hypercube determined by the {−1, +1} such that z + δαi ei ∈ D for all i = 1, . . . , d. Let D e ⊆ D, and so, µ(B(z, δ) ∩ D) ≥ µ(B(z, δ) ∩ D) e = points z, z + δα1 e1 , . . . , z + δαd ed . Note that D µ ¶d µ ¶d d/2 d/2 1 δ π δ π µ(B(z, δ)) = . Hence, ψD (δ) ≥ . Finally, note that when z is a d d 2 2 Γ( 2 + 1) 2 Γ( d2 + 1) µ ¶d δ π d/2 corner point of D, then µ(B(z, δ) ∩ D) = . 2 Γ( d2 + 1) 15

Next, consider the case where δ > 12 . Again, fix z ∈ D. In this case, there exists α1 , . . . , αd ∈ {−1, +1} such that z + 12 `Dαi ei ∈ D for all i = 1, . . . , d. The proof now follows in a similar manner as before.

4.4

Triangular Distributions

A random variable Y is said to have a triangular distribution with lower limit a, upper limit b, and mode c if its density function is given by  2(y − a)    (b − a)(c − a) g(y) = 2(b − y)    (b − a)(b − c)

if a ≤ y ≤ c if c ≤ y ≤ b

Now suppose D = [a, b] ⊆ Rd is a compact hypercube and that there is a subsequence {Ynk }k≥1 of random vectors where each Ynk has the following properties: (1)

(d)

[A] The random variables Ynk , . . . , Ynk (which are the components of Ynk ) are conditionally independent given the random elements in E(nk )−1 ; and (j)

[B] For each 1 ≤ j ≤ d, the random variable Ynk has a triangular distribution with lower limit a(j) , upper limit b(j) , and mode (x∗(nk )−1 )(j) . (Recall that x∗(nk )−1 is the best solution found so far after (nk ) − 1 function evaluations.) (j)

In this case, each Ynk has conditional density   2(u − a(j) )   ³ ´   (j) − a(j)  (b(j) − a(j) ) (x∗ ) (nk )−1 gn(j) (u | σ(E(nk )−1 )) = (j) k  2(b − u)   ³ ´   (j) (j) (j)  (b − a ) b(j) − (x∗ ) (nk )−1

if a(j) ≤ u ≤ (x∗(nk )−1 )(j) if (x∗(nk )−1 )(j) ≤ u ≤ b(j)

(j)

Let h(j) (u) := inf k≥1 gnk (u | σ(E(nk )−1 )). Now for each 1 ≤ j ≤ d, define  2(u − a(j) )    (j) (b − a(j) )2 q (j) (u) =  2(b(j) − u)   (b(j) − a(j) )2

if a(j) ≤ u ≤ if

a(j) + b(j) 2

a(j) + b(j) ≤ u ≤ b(j) 2

It is easy to check that h(j) (u) ≥ q (j) (u) for all 1 ≤ j ≤ d and for all a(j) ≤ u ≤ b(j) . Since µ({u ∈ D(j) : q (j) (u) = 0}) = µ({u ∈ [a(j) , b(j) ] : q (j) (u) = 0}) = µ({a(j) , b(j) }) = 0, it follows that µ({u ∈ D(j) : h(j) (u) = 0}) = 0. Hence, by Theorem 5, f (Xn∗ ) −→ f ∗ a.s.

16

5

Application to Practical Stochastic Search Algorithms

5.1

Localized Random Search

The following general algorithm for finding the global minimum of the function f (x) over D ⊆ Rd is given on page 44 of Spall (2003) but a convergence proof was not provided in that book. Here, we consider a special case that involves the multivariate Normal distribution and provide a convergence proof using the results from the previous sections. Algorithm B: Localized Random Search Inputs: (1) The objective function f : D → R, where D ⊆ Rd . (2) A deterministic absorbing transformation ρD : Rd → D, i.e. ρD (x) = x for all x ∈ D. Step 1. Pick an initial guess Y1 either randomly or deterministically and set X1 = ρD (Y1 ). Set n = 2. ∗ ∗ Step 2. Generate a new candidate iterate Yn = Xn−1 + Zn , where Xn−1 is the best solution

after n − 1 function evaluations and Zn is a random vector whose conditional distribution given σ(En−1 ) is a Normal distribution with mean vector 0 and diagonal covariance matrix defined by  ³     Cov(Zn ) =    

´ (1) 2 σn 0 .. .

 0 ... ³ ´2 (2) σn ... .. .. . .

0

0

...

0

³ (d) σn

Here, E1 = {Y1 } and En−1 = {Y1 , Z2 , . . . , Zn−1 } for n > 2. Step 3. Set Xn = ρD (Yn ) and evaluate f (Xn ). Step 4. Increment n = n + 1 and go back to Step 2.

In the above algorithm, Step 2 is equivalent to ∗ Yn(j) = (Xn−1 )(j) + Zn(j) , j = 1, . . . , d.

17

0 .. .

       ´2 

(Recall that Y (j) is the jth component of the random vector Y .) Since Zn has a Normal distri(1)

(d)

bution and Cov(Zn ) is a diagonal matrix, it follows that the random variables Zn , . . . , Zn conditionally independent given σ(En−1 ) and each standard deviation

(j) σn .

That is, σnT =

(j) Zn

(1) (d) (σn , . . . , σn )

are

has a normal distribution with mean 0 and

is the vector of mutations for the individual

∗ . components of Xn−1

Corollary 3. Let D be a bounded subset of Rd such that ψD (δ) > 0 for all δ > 0. Suppose the above localized random search algorithm is applied to a real-valued function f on D such that f ∗ := inf x∈D f (x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over D. Furthermore, suppose that there exists a subsequence {nk }k≥1 such that inf

min σn(j) > 0. k

k≥1 1≤j≤d

(Equivalently, lim sup min σn(j) > 0.) Then f (Xn∗ ) −→ f ∗ a.s. n→∞ 1≤j≤d

Proof. First, we check that the Localized Random Search algorithm above follows the GARS framework. We have k1 = 1 and Λ1,1 = Y1 , and for n ≥ 2, we have kn = 1 and Λn,1 = Zn . Moreover, as ∗ in the proof of Corollary 2, Xn−1 is a deterministic function of the random elements in En−1 , and S so, Yn is a deterministic function of the random elements in En = En−1 {Zn } for all n ≥ 2. Now

for the Localized Random Search algorithm, suppose that there exists a subsequence {nk }k≥1 such that inf

> 0. We have min σn(j) k

k≥1 1≤j≤d

∗ Ynk = X(n + Znk , for all k ≥ 1. k )−1

where Znk is a random vector whose conditional distribution given σ(E(nk )−1 ) is the Normal distribution with mean vector 0 and covariance matrix µ³ ´2 ³ ´2 ¶ (1) (1) Cov(Znk ) = diag σnk , . . . , σnk . Define Wk = Znk and Vk = Cov(Wk ) for all k ≥ 1. Note that we have ∗ Ynk = X(n + Wk , for all k ≥ 1. k )−1

³ ´ ´ ³ (1) 2 (1) 2 of the normal random perturbations The eigenvalues of Vk are the variances σnk , . . . , σnk ³ ´2 (j) ∗ for the different components of X(n . Hence, the smallest eigenvalue of V is λ := min σ . k k nk k )−1 1≤j≤d

Since inf

min σn(j) > 0, it follows that inf k≥1 λk > 0. The result now follows from Corollary 2. k

k≥1 1≤j≤d

5.2

Evolutionary Algorithms

There are three main types of evolutionary algorithms for global optimization (B¨ack 1996): genetic algorithms, evolution strategies and evolutionary programming algorithms. Below, we apply one of 18

the convergence results from the previous sections on a simple evolutionary programming algorithm where the selection of parents in each generation is done in a greedy manner. The results in this paper do not directly apply to a standard genetic algorithm for continuous optimization. However, it may be possible to extend some of the results in this paper to prove the convergence of more complex evolutionary programming algorithms and also evolution strategies but this topic is beyond the scope of this paper and it will be the focus of future work. Algorithm C: Evolutionary Programming with Greedy Parent Selection Inputs: (1) The objective function f : D → R, where D ⊆ Rd . (2) A nonnegative fitness function F : D → R+ . (3) The number of offspring in every generation, denoted by µ. (4) A deterministic absorbing transformation ρD : Rd → D, i.e. ρD (x) = x for all x ∈ D. Step 1. (Initialization) Set t = 0 and for each i = 1, 2, . . . , µ, generate Yi according to some probability distribution whose realizations are on Rd , where Yi possibly depends on Y1 , . . . , Yi−1 , and set Xi = ρD (Yi ). Moreover, for each i = 1, 2, . . . , µ, set Pi (0) = Xi . The set P(0) = {P1 (0), P2 (0), . . . , Pµ (0)} = {X1 , X2 , . . . , Xµ } is the initial parent population. Step 2. (Evaluate the Initial Parent Population) For each i = 1, 2, . . . , µ, evaluate the objective function value f (Xi ) and the fitness value F(Xi ). Step 3. (Iterate) While termination condition is not satisfied do Step 3.1. (Update Generation Counter) Reset t := t + 1. Step 3.2. (Generate Offspring by Mutation) For each i = 1, 2, . . . , µ, set Ytµ+i = Mut(Pi (t − 1)) and Xtµ+i = ρD (Ytµ+i ). The set M(t) := {Xtµ+1 , Xtµ+2 , . . . , Xtµ+µ } constitutes the offspring for the current generation. Step 3.3. (Evaluate the Offspring) For each i = 1, 2, . . . , µ, evaluate f (Xtµ+i ) and the fitness value F(Xtµ+i ). Step 3.4. (Select New Parent Population) Select the parent population for the next generaS tion: P(t) = Sel(P(t − 1) M(t)). End. 19

In Step 3.1, the mutation operator is defined as follows: For each t ≥ 1 and i = 1, 2, . . . , µ, Ytµ+i = Mut(Pi (t − 1)) = Pi (t − 1) + Ztµ+i , where Ztµ+i is a random vector whose conditional distribution given σ(Etµ+i−1 ) is a Normal distribution with mean vector 0 and diagonal covariance matrix  ³ ´2 (1) σ 0 ... 0 tµ+i  ³ ´2  (2)  0 σtµ+i ... 0  Cov(Ztµ+i ) =  .. .. .. ..  . . . .  ³ ´2  (d) 0 0 ... σtµ+i

     .   

(1)

(d)

T Here, Etµ+i−1 = {Y1 , . . . , Yµ , Zµ+1 , Zµ+2 , . . . , Ztµ+i−1 }. Moreover, σtµ+i = (σtµ+i , . . . , σtµ+i ) is the

vector of mutations for the individual components of the parent vector Pi (t − 1). That is, for (j)

each j = 1, 2, . . . , d, the conditional distribution of the random variable Ztµ+i given σ(Etµ+i−1 ) is (j)

a Normal distribution with mean 0 and standard deviation σtµ+i . Moreover, we set the algorithm p (j) parameter σtµ+i = F(Pi (t − 1)) as noted in B¨ack (1993). In Step 3.3, the selection of the parent solutions for the next generation is usually accomplished by probabilistic q-tournament selection as described in B¨ack (1993). As q increases, this q-tournament selection procedure becomes more and more greedy. For simplicity, we assume that the selection of parent solutions proceeds in a completely greedy manner. That is, P(t) is simply S the collection of µ solutions from P(t − 1) M(t) with the best fitness values. The more general case of q-tournament selection will be addressed in future work. Corollary 4. Let D be a bounded subset of Rd such that ψD (δ) > 0 for all δ > 0. Suppose the above EP algorithm is applied to a real-valued function f on D such that f ∗ := inf x∈D f (x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over D. Furthermore, suppose that the fitness function F : D → R+ is always defined so that F(x) ≥ τ for some τ > 0 for all x ∈ D. Then f (Xn∗ ) −→ f ∗ a.s. Proof. As before, we first check that the above EP algorithm follows the GARS framework. For n = 1, 2, . . . , µ, we have kn = 1 and Λn,1 = Yn . Moreover, for all n ≥ µ + 1, we have kn = 1 and Λn,1 = Zn . Since the selection of the new parent population in Step 3.3 is done in a greedy manner, it follows that for each integer t ≥ 1 and i = 1, 2, . . . , µ, Pi (t − 1) is a deterministic function of Y1 , Y2 , . . . , Ytµ . This also implies that for each integer t ≥ 1 and i = 1, 2, . . . , µ, Pi (t − 1) is also a 20

deterministic function of Y1 , Y2 , . . . , Ytµ+i−1 . Hence, for each integer t ≥ 1 and i = 1, 2, . . . , µ, we have Ytµ+i = Φ(t−1)µ+i (Etµ+i−1 ) + Ztµ+i , for some deterministic function Φ(t−1)µ+i . Note that this implies that Yµ+k = Φk (Eµ+k−1 ) + Zµ+k , for all k ≥ 1 where Zµ+k is a random vector whose conditional distribution given σ(Eµ+k−1 ) is a Normal distribution with mean vector 0 and diagonal covariance matrix  ³ ´2 (1) σ 0 ... 0 µ+k  ³ ´2  (2)  0 σµ+k ... 0  Cov(Zµ+k ) =  . . .. ..  .. .. . .  ³ ´2  (d) 0 0 ... σµ+k

        

³ ´2 (j) For each integer k ≥ 1 and j = 1, 2, . . . , d, we have σµ+k = F(Pi (t − 1)) ≥ τ , where i and t are the unique integers such that i = 1, . . . , µ, t ≥ 1 and k = (t − 1)µ + i. Define the subsequence {nk }k≥1 by nk = µ + k for all k ≥ 1. Then we have Ynk = Φk (E(nk )−1 ) + Wk , for all k ≥ 1, where Wk = Znk . Let λk be the smallest eigenvalue of Cov(Wk ). Since the eigenvalues of Cov(Wk ) ³ ´2 ³ ´2 (1) (d) are σµ+k , . . . , σµ+k , we have ³ ´2 (j) λk = min σµ+k ≥ τ, 1≤j≤d

and so, inf k≥1 λk ≥ τ > 0. Moreover, the conditional distribution of Wk given σ(E(nk )−1 ) is an elliptical distribution with conditional density given by qk (w | σ(E(nk )−1 )) = γ[det(Vk )]−1/2 Ψ(wT Vk−1 w),

z ∈ Rd

where Ψ(y) = e−y/2 and γ = (2π)−d/2 . Again, Ψ(y) = e−y/2 is monotonically nonincreasing, and so, the conclusion follows from Theorem 6.

6

Related Convergence Results

The purpose of this section is to explore some connections between the results in Section 3 and the results in a paper by Stephens and Baritompa (1998). Recall the notation in Section 3. Let 21

D be a subset of Rd and let C(D) be the set of all continuous functions f : D → R. Moreover, let Xf∗ be the set of all global minimizers of f : D → R. If D is compact, it is well-known that Xf∗ 6= ∅ for all f ∈ C(D). Now suppose that a stochastic global minimization algorithm applied to f : D → R generates the random vector iterates {Xn }n≥1 . The range of the sequence of random vectors {Xn }n≥1 is denoted by range({Xn }n≥1 ), the set of subsequential limit points of {Xn }n≥1 is denoted by slp({Xn }n≥1 ), and the closure of {Xn }n≥1 is denoted by cl({Xn }n≥1 ). Note that range({Xn }n≥1 ), slp({Xn }n≥1 ) and cl({Xn }n≥1 ) are random sets of points in D and cl({Xn }n≥1 ) = S range({Xn }n≥1 ) slp({Xn }n≥1 ). Following the definition in Stephens and Baritompa (1998), the algorithm is said to see the global minimum if cl({Xn }n≥1 ) ∩ Xf∗ 6= ∅. In addition, this algorithm is said to see the point x ∈ D if x ∈ cl({Xn }n≥1 ). Stephens and Baritompa (1998) introduced the notion of a deterministic (or stochastic) sequential sampling algorithm as an algorithm where the next iterate (or sample point) depends only on what they call local information that can be obtained from the previous iterates and on an instance of a random element (in the stochastic case). Examples of local information include the best sample point, the maximum slope between sample points, or an interpolating polynomial through the sample points. On the other hand, examples of nonlocal information include the Lipschitz constant, level set associated with a function value, the number of local minima, or the global minimum itself. The GARS framework (Algorithm A, Section 3) is more general than the idea of a stochastic sequential sampling algorithm since it could incorporate nonlocal information through the parameters of the probability distributions in Step 1.1 or through the deterministic function Θn in Step 1.2. Moreover, in Step 1.2 of the GARS framework, the current iterate possibly depends not only on the previous iterates and their function values (as in stochastic sequential sampling algorithms) but also on the intermediate random elements that were used to generate the previous iterates. Hence, the results in this paper apply to a wider class of algorithms. The following theorem is a special case of Theorem 3.3 in Stephens and Baritompa (1998) when restricted to continuous functions. Note that this also applies to GARS algorithms that satisfy the definition of a stochastic sequential sampling algorithm. Theorem (Stephens and Baritompa 1998) For any probability p and for any stochastic sequential sampling algorithm, P (algorithm sees the global minimum of g) ≥ p, if and only if

P (x ∈ cl({Xn }n≥1 ) ≥ p,

∀g ∈ C(D)

∀x ∈ D, ∀f ∈ C(D)

The above theorem provides a necessary and sufficient condition for an algorithm to see the 22

global minimum of f with a specified probability. In particular, the special case where p = 1 provides a similar result to those we proved in Section 3. The above result provides a more general convergence criterion in that p could be less than 1. However, given a practical stochastic algorithm, it is typically not straightforward to check whether P (x ∈ cl({Xn }n≥1 ) ≥ p, ∀x ∈ D, ∀f ∈ C(D) for a given value of p, especially for the case where p is strictly between 0 and 1. The next theorem says that for stochastic algorithms applied to continuous functions defined over compact domains, converging to the global minimum almost surely is equivalent to seeing the global minimum almost surely. Recall that the global search conditions in Section 3 imply convergence to the global minimum almost surely. Hence, these same conditions can be used to guarantee that the algorithm sees the global minimum almost surely. Theorem 7. Let D be a compact subset of Rd and let f : D → R be a continuous function. Suppose that a stochastic global minimization algorithm applied to f generates a sequence of random vector iterates {Xn }n≥1 whose sequence of best points found so far is given by {Xn∗ }n≥1 . Then T f (Xn∗ ) −→ f ∗ := inf x∈D f (x) > −∞ a.s. if and only if cl({Xn }n≥1 ) Xf∗ 6= ∅ a.s. (i.e., the algorithm sees the global minimum of f a.s.) Proof. First, assume that f (Xn∗ ) −→ f ∗ a.s. Then there exists a set N such that P (N ) = 0 and f (Xn∗ (ω)) −→ f ∗ for all ω ∈ N c . Fix ω ∈ N c . We wish to show that cl({Xn (ω)}n≥1 ) ∩ Xf∗ 6= ∅. Suppose this is not the case. Let fe :=

inf

x∈cl({Xn (ω)}n≥1 )

f (x). Since cl({Xn (ω)}n≥1 ) is a closed

subset of the compact set D, it follows that cl({Xn (ω)}n≥1 ) is also compact. Moreover, since f is a continuous function, it follows that fe = f (e x) for some x e ∈ cl({Xn (ω)}n≥1 ). By assumption, cl({Xn (ω)}n≥1 ) ∩ Xf∗ = ∅, and so, fe = f (e x) > f ∗ . Next, range({Xn∗ (ω)}n≥1 ) ⊆ range({Xn (ω)}n≥1 ) ⊆ cl({Xn (ω)}n≥1 ). Hence, f (Xn∗ (ω)) ≥ fe > f ∗ for all n ≥ 1, and so, {f (Xn∗ (ω))} cannot converge to f ∗ , which is a T contradiction. Hence, cl({Xn (ω)}n≥1 ) Xf∗ 6= ∅ for this ω ∈ N c . Thus, the algorithm sees the global minimum of f a.s. T To prove the converse, assume that cl({Xn }n≥1 ) Xf∗ 6= ∅ a.s. Then there exists a set N T such that P (N ) = 0 and cl({Xn (ω)}n≥1 ) Xf∗ 6= ∅ for all ω ∈ N c . Fix ω ∈ N c and let x∗ ∈ T cl({Xn (ω)}n≥1 ) Xf∗ . Note that x∗ ∈ range({Xn (ω)}n≥1 ) or x∗ ∈ slp({Xn (ω)}n≥1 ), or both. Suppose x∗ ∈ range({Xn (ω)}n≥1 ). Then x∗ = Xne (ω) for some integer n e. Since x∗ ∈ Xf∗ , it e, and so, e. Hence, f (Xn∗ (ω)) = f (x∗ ) = f ∗ for all n ≥ n follows that Xn∗ (ω) = x∗ for all n ≥ n f (Xn∗ (ω)) −→ f ∗ . 23

On the other hand, suppose x∗ ∈ slp({Xn (ω)}n≥1 ). Then there exists a subsequence {Xnk (ω)}k≥1 such that Xnk (ω) −→ x∗ as k → ∞. Since f is continuous, f (Xnk (ω)) −→ f (x∗ ) = f ∗ as k → ∞. Moreover, 0 ≤ f (Xn∗k (ω)) − f ∗ ≤ f (Xnk (ω)) − f ∗ −→ 0 as k → ∞, and so, f (Xn∗k (ω)) −→ f ∗ as k → ∞. Next, since f ∗ > −∞ and {f (Xn∗ (ω))}n≥1 is monotonically nonincreasing, it follows that limn→∞ f (Xn∗ (ω)) exists, and so, limn→∞ f (Xn∗ (ω)) = f ∗ . In either case, f (Xn∗ (ω)) −→ f ∗ for the given ω ∈ N c . Thus, f (Xn∗ ) −→ f ∗ a.s. T¨orn and Zilinskas (1989) proved that a deterministic global minimization algorithm converges to the global minimum of any continuous function on a compact set D ⊆ Rd if and only if the sequence of iterates of the algorithm is dense in D. Stephens and Baritompa (1998) extended this result to any deterministic sequential sampling algorithm on what they call a sufficiently rich class of functions (including continuous functions) and they also proved a stochastic version of their result. Theorem 8 below is another stochastic version of the result by T¨orn and Zilinskas (1989) that is different from the theorem proved by Stephens and Baritompa (1998). Consider a sequence of random vectors {Xn }n≥1 whose realizations are in D ⊆ Rd . We say that {Xn }n≥1 is probabilistically dense in D with guarantee p if P (range({Xn }n≥1 ) ∩ B(x, δ) 6= ∅) ≥ p,

∀x ∈ D, ∀δ > 0

As before, B(x, δ) is the open ball in Rd centered at x with radius δ. If the above condition holds when p = 1, we simply say that {Xn }n≥1 is probabilistically dense in D. The next theorem shows that the sequence of random vector iterates of a stochastic global minimization algorithm is probabilistically dense in D with guarantee p if and only if the algorithm sees any point x ∈ D (including the global minimum of any function f on D) with probability at least p. To prove the next theorem, we need the following lemma. Lemma 1. Let {Xn }n≥1 be a sequence of random vectors whose realizations are in D ⊆ Rd . For any x ∈ D, the following events are equal: [x ∈ cl({Xn }n≥1 )] = [range({Xn }n≥1 ) ∩ B(x, δ) 6= ∅ ∀δ > 0]. Proof. Since cl({Xn }n≥1 ) = range({Xn }n≥1 )

S

slp({Xn }n≥1 ), we have [x ∈ cl({Xn }n≥1 )] ⊆

[range({Xn }n≥1 ) ∩ B(x, δ) 6= ∅ ∀δ > 0]. Next, suppose ω ∈ [range({Xn }n≥1 ) ∩ B(x, δ) 6= ∅ ∀δ > 0]. Then range({Xn (ω)}n≥1 ) ∩ B(x, δ) 6= ∅ ∀δ > 0. Note that either x ∈ range({Xn (ω)}n≥1 ) ⊆ cl({Xn (ω)}n≥1 ) or x 6∈ range({Xn (ω)}n≥1 ). If x 6∈ range({Xn (ω)}n≥1 ), then there are infinitely

24

many elements of range({Xn (ω)}n≥1 ) that are contained in any open ball around x. This implies that there exists an integer n1 such that Xn1 (ω) ∈ B(x, 1). Moreover, there exists an integer n2 > n1 such that Xn2 (ω) ∈ B(x, 12 ). In general, for any integer k > 1, there exists an integer nk > nk−1 such that Xnk (ω) ∈ B(x, k1 ). Clearly, Xnk (ω) −→ x as k → ∞, and so, x ∈ slp({Xn (ω)}n≥1 ) ⊆ cl({Xn (ω)}n≥1 ). In either case, we have ω ∈ [x ∈ cl({Xn }n≥1 )], and so, [range({Xn }n≥1 ) ∩ B(x, δ) 6= ∅ ∀ δ > 0] ⊆ [x ∈ cl({Xn }n≥1 )]. Theorem 8. Let D ⊆ Rd and suppose that a stochastic global minimization algorithm applied to the function f : D → R generates the sequence of random vector iterates {Xn }n≥1 whose realizations are in D. Then {Xn }n≥1 is probabilistically dense in D with guarantee p if and only if P (x ∈ cl({Xn }n≥1 )) ≥ p ∀x ∈ D (i.e., the algorithm sees any point of D with probability at least p). Proof. First, suppose that {Xn }n≥1 is probabilistically dense in D with guarantee p. Fix x ∈ D. From the previous lemma, we have [x ∈ cl({Xn }n≥1 )] = [range({Xn }n≥1 ) ∩ B(x, δ) 6= ∅ ∀δ > 0]. Moreover, it is easy to check that · µ ¶ ¸ 1 [range({Xn }n≥1 ) ∩ B(x, δ) 6= ∅ ∀δ > 0] = range({Xn }n≥1 ) ∩ B x, 6 ∅ for all integers k ≥ 1 = k Hence, µ µ ¶ ¶ 1 P (x ∈ cl({Xn }n≥1 )) = P range({Xn }n≥1 ) ∩ B x, 6 ∅ for all integers k ≥ 1 . = k For every integer k ≥ 1, define the event ¸ · \ µ 1¶ 6 ∅ . = Sk = range({Xn }n≥1 ) B x, k Since {Xn }n≥1 is probabilistically dense in D with guarantee p, it follows that P (Sk ) ≥ p for all k ≥ 1. Moreover, since Sk ⊇ Sk+1 for all k ≥ 1, it follows that {P (Sk )}k≥1 is a nonincreasing sequence that is bounded below, and so, limk→∞ P (Sk ) exists. Hence, Ã∞ ! Ã ! µ ¶ k \ \ P (x ∈ cl({Xn }n≥1 )) = P Si = P lim Si = P lim Sk = lim P (Sk ) ≥ p. i=1

k→∞

i=1

k→∞

k→∞

To prove the converse, suppose that P (x ∈ cl({Xn }n≥1 )) ≥ p ∀x ∈ D. Fix x ∈ D and δ > 0. Note that e 6= ∅ ∀δe > 0) P (range({Xn }n≥1 ) ∩ B(x, δ) 6= ∅) ≥ P (range({Xn }n≥1 ) ∩ B(x, δ) = P (x ∈ cl({Xn }n≥1 )) ≥ p, 25

where the above equality holds because of the previous lemma. Hence, {Xn }n≥1 is probabilistically dense in D with guarantee p. Corollary 5. Let D ⊆ Rd and suppose that a stochastic global minimization algorithm applied to the function f : D → R generates the sequence of random vector iterates {Xn }n≥1 whose realizations are in D. Moreover, suppose that Xf∗ 6= ∅ and {Xn }n≥1 is probabilistically dense in D with guarantee p. Then P (algorithm sees the global minimum of f ) ≥ p. Proof. Let x∗ ∈ Xf∗ . Then P (algorithm sees the global minimum of f ) = P (cl({Xn }n≥1 ) ∩ Xf∗ 6= ∅) ≥ P (x∗ ∈ cl({Xn }n≥1 ) = P (algorithm sees the point x∗ ) ≥ p, where the last inequality follows from Theorem 8. The next theorem (Theorem 9) shows that if a GARS algorithm satisfies the global search condition in Theorem 3 of Section 3, then the resulting sequence of iterates will be probabilistically dense in D, and so, by Corollary 5, the algorithm sees the global minimum of f with probability 1. Note that this end result can also be obtained from Theorems 3 and 7. Hence, Theorem 9 and Corollary 5 provide an alternative proof that any GARS algorithm satisfying the global search conditions in Theorem 3 also sees the global minimum of f with probability 1. Theorem 9. Let f be a real-valued function on D ⊆ Rd . Suppose that a GARS algorithm applied to f satisfies Global Search Condition 3 from Theorem 3: For any z ∈ D and δ > 0, there exists 0 < ν(z, δ) < 1 such that P [Ynk ∈ B(z, δ) ∩ D | σ(E(nk )−1 )] ≥ ν(z, δ), for some subsequence {nk }k≥1 . Then {Xn }n≥1 is probabilistically dense in D, and consequently, P (x ∈ cl({Xn }n≥1 )) = 1 ∀x ∈ D. Proof. Fix x ∈ D and δ > 0. We have P (range({Xn }n≥1 ) ∩ B(x, δ) = ∅) = P (Xn 6∈ B(x, δ) ∀n ≥ 1) = P (Xn 6∈ (B(x, δ) ∩ D) ∀n ≥ 1) (since {Xn }n≥1 ⊆ D for a GARS algorithm) ≤ P (Yn 6∈ (B(x, δ) ∩ D) ∀n ≥ 1) ≤ P (Ynk 6∈ (B(x, δ) ∩ D) ∀k ≥ 1) = P

Ã∞ \

! [Yni 6∈ (B(x, δ) ∩ D)]

i=1

Ã = P

lim

k→∞

k \

! [Yni 6∈ (B(x, δ) ∩ D)]

i=1

Ã = lim P k→∞

k \

! [Yni 6∈ (B(x, δ) ∩ D)]

i=1

26

= lim

k→∞

k Y

¡ ¢ P Yni 6∈ (B(x, δ) ∩ D) | Yn1 6∈ (B(x, δ) ∩ D), . . . , Yni−1 6∈ (B(x, δ) ∩ D)

i=1

k Y ¡ ¡ ¢¢ = lim 1 − P Yni ∈ (B(x, δ) ∩ D) | Yn1 6∈ (B(x, δ) ∩ D), . . . , Yni−1 6∈ (B(x, δ) ∩ D) k→∞

i=1

k Y ≤ lim (1 − ν(x, δ)) = lim (1 − ν(x, δ))k = 0 k→∞

k→∞

i=1

(since 0 < ν(x, δ) < 1)

Hence, P (range({Xn }n≥1 ) ∩ B(x, δ) = ∅) = 0, and so, P (range({Xn }n≥1 ) ∩ B(x, δ) 6= ∅) = 1 for all x ∈ D and for all δ > 0, i.e., {Xn }n≥1 is probabilistically dense in D. Moreover, by Theorem 8, P (x ∈ cl({Xn }n≥1 )) = 1 ∀x ∈ D Following the definition in Stephens and Baritompa (1998), we say that an algorithm localizes the global minimizers of a function f : D → R if its sequence of iterates {Xn }n≥1 satisfies ∅ 6= slp({Xn }n≥1 ) ⊆ Xf∗ . Theorem 10. Let D be a compact subset of Rd and suppose a stochastic global minimization algorithm applied to the function f : D → R generates the sequence of random vectors {Xn }n≥1 whose realizations are in the compact set D. If {Xn }n≥1 is probabilistically dense in D and each of the random vectors in {Xn }n≥1 has a continuous probability distribution, then P (x ∈ slp({Xn }n≥1 )) = 1 for all x ∈ D and P (algorithm does not localize the global minimizers of f ) = 1. Proof. From Theorem 8, we have P (x ∈ cl({Xn }n≥1 )) = 1 ∀x ∈ D. Fix x ∈ D. Note that P (x ∈ cl({Xn }n≥1 )) = P (x ∈ slp({Xn }n≥1 )) + P (x 6∈ slp({Xn }n≥1 ) and x = Xn for some integer n). Now P (x 6∈ slp({Xn }n≥1 ) and x = Xn for some integer n) ≤ P (x = Xn for some integer n) Ã∞ ! ∞ [ X = P [x = Xn ] ≤ P (x = Xn ) = 0 (since each Xn has a continuous distribution). n=1

n=1

Hence, P (x ∈ slp({Xn }n≥1 )) = P (x ∈ cl({Xn }n≥1 )) = 1. In addition, if x e ∈ D \ Xf∗ , then P (algorithm does not localize the global minimizers of f ) = P (slp({Xn }n≥1 ) 6⊆ Xf∗ ) + P (slp({Xn }n≥1 ) = ∅) = P (slp({Xn }n≥1 ) 6⊆ Xf∗ ) ≥ P (e x ∈ slp({Xn }n≥1 )) = 1.

27

(since D is compact)

Consider any GARS algorithm that satisfies the global search condition from Theorem 3. From Theorem 9, the sequence of iterates {Xn }n≥1 will be probabilistically dense in D. Hence, in the case where each Xn has a continuous probability distribution, Theorem 10 implies that each point of D is a subsequential limit of the sequence of iterates of the algorithm with probability 1, and so, the algorithm does not localize the global minimizers of D with probability 1. This suggests that the worst-case convergence rate of the algorithm will be slow (and this will be confirmed in the next section) since it essentially searches the whole domain to find the global minimum. However, if we focus on the sequence of best points {Xn∗ }n≥1 , the next theorem says that any subsequential limit of {Xn∗ }n≥1 is a global minimizer almost surely. Theorem 11. Let D be a compact subset of Rd and let f : D → R be a continuous function. Suppose that a stochastic global minimization algorithm applied to f generates a sequence of iterates {Xn }n≥1 whose sequence of best points found so far {Xn∗ }n≥1 satisfies the condition f (Xn∗ ) −→ f ∗ := inf x∈D f (x) > −∞ a.s. Then slp({Xn∗ }n≥1 ) ⊆ Xf∗ a.s. Proof. Since f (Xn∗ ) −→ f ∗ a.s., it follows that there exists a set N such that P (N ) = 0 and f (Xn∗ (ω)) −→ f ∗ for all ω ∈ N c . Fix ω ∈ N c and let x e ∈ D \ Xf∗ . We wish to show that x e 6∈ slp({Xn∗ (ω)}n≥1 ). From the proof of Theorem 7, there exists x∗ ∈ Xf∗ such that x∗ ∈ cl({Xn (ω)}n≥1 ). Since x e 6∈ Xf∗ , we have f (e x) − f (x∗ ) > 0. Now, since f is continuous over D, it follows that there is a δ > 0 such that whenever x ∈ B(x∗ , δ), we have 1 1 x) − f (x∗ )), or equivalently, f (x) ≤ (f (e x) + f (x∗ )). |f (x) − f (x∗ )| = f (x) − f (x∗ ) ≤ (f (e 2 2 Furthermore, since x∗ ∈ cl({Xn (ω)}n≥1 ), it follows that there is an integer n e such that Xne (ω) ∈ B(x∗ , δ). Note that for any integer n ≥ n e, we have 1 x) + f (x∗ )) < f (e x), f (Xn∗ (ω)) ≤ f (Xne∗ (ω)) ≤ f (Xne (ω)) ≤ (f (e 2 and so, lim sup f (Xn∗ (ω)) < f (e x). Suppose x e ∈ slp({Xn∗ (ω)}n≥1 ). Then there is a subsequence n→∞

e. {Xn∗k (ω)}k≥1 that converges to x

x) Since f is continuous, it follows that f (Xn∗k (ω)) → f (e

x), which yields a contradiction. Hence, x e 6∈ as k → ∞, and so, lim supn→∞ f (Xn∗ (ω)) ≥ f (e slp({Xn∗ (ω)}n≥1 ). This implies that D \ Xf∗ ⊆ D \ slp({Xn∗ (ω)}n≥1 ), or equivalently, slp({Xn∗ (ω)}n≥1 ) ⊆ Xf∗ . Thus, slp({Xn∗ }n≥1 ) ⊆ Xf∗ a.s. 28

7

Convergence Rates

In the GARS framework, we allowed for the possibility that the trial random vector iterates {Yn }n≥1 are dependent. Moreover, in the convergence results for the GARS framework, the global search conditions that are necessary for almost sure convergence to the global minimum are only required for some subsequence of the iterations. Because of these, it is hard to perform any kind of convergence analysis for the GARS framework. However, in the special case of simple random search, where the Yn ’s are all independent and identically distributed random vectors, there had been some convergence analyses (e.g. Spall (2003)). Below, we provide a simple result that applies to the case when the Yn ’s are not necessarily independent and not necessarily identically distributed but the result assumes that Global Search Condition 3 in Theorem 3 is satisfied by all the iterates. Theorem 12. Let f be a real-valued function that has a unique global minimizer x∗ on a set D ⊆ Rd in the sense of Theorem 2 and let f be continuous at x∗ . Suppose that a GARS algorithm satisfies Global Search Condition 3 from Theorem 3 for all iterations: For any z ∈ D and δ > 0, there exists 0 < ν(z, δ) < 1 such that P [Yn ∈ B(z, δ) ∩ D | σ(En−1 )] ≥ ν(z, δ),

(6)

Then P [Xi ∈ B(x∗ , δ) ∩ D for some 1 ≤ i ≤ n] ≥ 1 − (1 − ν(x∗ , δ))n . Proof. By Theorem 3, Xn∗ −→ x∗ a.s. Moreover, the probability that the algorithm lands in the region B(x∗ , δ) ∩ D within n function evaluations is given by P [Xi ∈ B(x∗ , δ) ∩ D for some 1 ≤ i ≤ n] = 1 − P [X1 6∈ B(x∗ , δ) ∩ D, X2 6∈ B(x∗ , δ) ∩ D, . . . , Xn 6∈ B(x∗ , δ) ∩ D] =1−

n Y

P [Xi 6∈ B(x∗ , δ) ∩ D|X1 6∈ B(x∗ , δ) ∩ D, X2 6∈ B(x∗ , δ) ∩ D, . . . , Xi−1 6∈ B(x∗ , δ) ∩ D]

i=1

= 1−

n Y

(1 − P [Xi ∈ B(x∗ , δ) ∩ D|X1 6∈ B(x∗ , δ) ∩ D, X2 6∈ B(x∗ , δ) ∩ D, . . . , Xi−1 6∈ B(x∗ , δ) ∩ D]),

i=1

Since P [Xi ∈ B(x∗ , δ) ∩ D | σ(Ei−1 )] ≥ P [Yi ∈ B(x∗ , δ) ∩ D | σ(Ei−1 )] ≥ ν(x∗ , δ), for all i = 1, . . . , n, it follows that P [Xi ∈ B(x∗ , δ) ∩ D for some 1 ≤ i ≤ n] ≥ 1 − (1 − ν(x∗ , δ))n .

29

The above theorem provides a lower bound for the probability that the algorithm lands within a δ-neighborhood of the global minimizer with n iterations (or function evaluations). Now fix the neighborhood radius δ > 0 and the probability requirement 0 < ξ < 1. If we set n = dlog(1 − ξ)/ log(1 − ν(x∗ , δ))e, then P [Xi ∈ B(x∗ , δ) ∩ D for some 1 ≤ i ≤ n] ≥ ξ. We then determine how large this particular n would be for special cases. Corollary 6. Let f be a real-valued function on D = [0, 1]d such that f ∗ := inf f (x) > x∈D

− ∞.

Moreover, suppose that f has a unique global minimizer x∗ on D in the sense of Theorem 2 and that f is continuous at x∗ . Consider a GARS algorithm and suppose that for each n ≥ 1, Yn has a conditional density gn (y | σ(En−1 )) satisfying the following condition: h(y) := inf n≥1 gn (y | σ(En−1 )) ≥ β for all y ∈ D, where β > 0 is a constant. By Theorem 4, Xn∗ −→ x∗ a.s. Fix the neighborhood radius 0 < δ ≤ 1/2 and the probability requirement 0 < ξ < 1. If we set & , Ã !' µ ¶d δ βπ d/2 n = log(1 − ξ) log 1 − , 2 Γ( d2 + 1)

(7)

then P [Xi ∈ B(x∗ , δ) ∩ D for some 1 ≤ i ≤ n] ≥ ξ. Proof. Fix z ∈ D and 0 < δ ≤ 1/2. For any n ≥ 1, we have Z Z P [Yn ∈ B(z, δ) ∩ D | σ(En−1 )] = gn (y|σ(En−1 )) dy ≥ B(z,δ)∩D

h(y) dy

B(z,δ)∩D

µ ¶d βπ d/2 δ =: ν(z, δ), ≥ β dy = βµ(B(z, δ) ∩ D) ≥ βψD (δ) = 2 Γ( d2 + 1) B(z,δ)∩D Z

where ψD (δ) is defined in Theorem 4 and its value in this case is given by Proposition 1. Hence, (6) from Theorem 12 holds, and so, for the above choice of n, we get P [Xi ∈ B(x∗ , δ) ∩ D for some 1 ≤ i ≤ n] ≥ 1 − (1 − ν(x∗ , δ))n ≥ ξ.

Consider the special case of Corollary 6 where Yn has the uniform distribution on the closed hypercube D = [0, 1]d for all n ≥ 1. Then Xn ≡ Yn for all n ≥ 1. In this case, h(y) = 1 for all y ∈ D and we can choose β = 1. Again, fix 0 < δ ≤ 1/2 and 0 < ξ < 1. If we set n equal to the value given by (7) with β = 1, then P [Xi ∈ B(x∗ , δ) ∩ D for some 1 ≤ i ≤ n] ≥ ξ. Hence, this gives a value of n that can guarantee that the well-known random (“blind”) search algorithm lands within a δ-neighborhood of the global minimizer. But how large is this value of n? That is, what is the complexity of n in terms of the problem dimension d. 30

Proposition 2. Consider the assumptions of Corollary 6. The value of n given in (7) grows exponentially in d. −1 1 Proof. From elementary calculus, it is easy to verify that log(1−x) ≥ 2x for all x ∈ (0, 21 ]. Now, √ √ ¢d ¡ since 0 < δ ≤ 1/2 and d ≥ 1, it follows that 2δ π 1/2 ≤ 4π and Γ( d2 + 1) ≥ 2π , and so, µ ¶d δ π d/2 1 ≤ . Moreover, observe that 0 < β ≤ 1. (If β > 1, then gn (y | σ(En−1 )) in 2 Γ( d2 + 1) 2 µ ¶d βπ d/2 δ 1 Corollary 6 cannot be a conditional density for each n ≥ 1.) Hence, ≤ and so, it 2 Γ( d2 + 1) 2 follows that ¶ √ µ 2d Γ( d2 + 1) −1 π 4 d Ã !≥ √ ≥ µ ¶d 4β π 2δ d βπ d/2 δ βπ d/2 log 1 − 2 Γ( d2 + 1)

Multiplying the above inequality by − log(1 − ξ) > 0, it follows that the value of n in (7) above satisfies µ ¶ √ log(1 − ξ) 4 d − π Ã !≥ n≥ log(1 − ξ) √ µ ¶d 4β π βπ d/2 δ log 1 − 2 Γ( d2 + 1) Since

√ − π 4β

log(1 − ξ) is a strictly positive constant and

√4 π

> 2, this shows that the value of n given

by (7) grows exponentially in d. Hence, this theorem shows that convergence quickly becomes very poor as the dimension of the problem increases. However, as pointed out by Spall (2003), the no free lunch theorems (Wolpert and Macready 1997) indicate that this simple GARS algorithm is no worse than any other algorithm when performance is averaged over the entire range of possible optimization problems. In practice though, some algorithms are tailored to perform better than others on certain classes of problems with specific characteristics.

8

Summary

We proved some results that guarantee almost sure convergence to the global minimum for a general class of adaptive stochastic search algorithms that follow the GARS (Generalized Adaptive Random Search) framework. The GARS framework is an extension of the well-known simple (“blind”) random search algorithm where the random iterates are not necessarily independent. In the GARS framework, we allow for the possibility that a number of intermediate random elements are first

31

generated before the trial random vector iterate is computed. Moreover, if the trial iterate falls outside the domain of the problem, then it is mapped to the domain via an absorbing transformation. By imposing some conditions on the random vector iterates and the probability distributions that generate them (i.e., the global search conditions in Theorems 1 and 3 and subsequent theorems), the convergence of a GARS algorithm that satisfies these conditions is guaranteed. In addition, the global search condition only needs to be satisfied by a subsequence of the iterations in order to guarantee convergence. This makes the results applicable to a wide range of practical stochastic global optimization algorithms, including those that perform both local and global search and also including those that combine both stochastic and deterministic search strategies. We also proved convergence results (Theorems 4 and 5) that involve simple requirements on the conditional densities of the trial random vector iterates that are easy to verify in practice. Moreover, in Theorem 6, we provided some simple conditions that guarantee convergence when using an elliptical distribution, such as the multivariate Normal or Cauchy distribution, to generate the trial random vector iterates. In Section 5, we provided a convergence proof for some practical stochastic global optimization algorithms, including an evolutionary programming algorithm with greedy parent selection. In Section 6, we explored some connections with the results by Stephens and Baritompa (1998). In particular, we showed that for stochastic global minimization algorithms applied to continuous functions defined over compact domains, converging almost surely to the global minimum is equivalent to seeing the global minimum as defined by Stephens and Baritompa (1998). Moreover, we introduced the notion of a sequence of random vector iterates being probabilistically dense in the domain and showed that this is also equivalent to seeing the global minimum with probability 1 under the usual assumptions. In addition, we proved that a GARS algorithm satisfying the global search condition in Theorem 3 generates a sequence of iterates that is probabilistically dense in the domain, and consequently, the algorithm sees any point of the domain (including the global minimizers) with probability 1. Finally, in Section 7, we proved some simple results on the convergence rate of a GARS algorithm.

Acknowledgements I would like to thank Prof. Shane Henderson from the School of Operations Research & Information Engineering at Cornell University for some helpful comments during the early stages of this paper. I would also like to thank the anonymous referees for their helpful comments and suggestions.

32

References 1. B¨ack, T. Evolutionary Algorithms in Theory and Practice. Oxford University Press: New York; 1996. 2. B¨ack, T., Rudolph, G., and Schwefel, H.-P. Evolutionary programming and evolution strategies: similarities and differences. In: Fogel, D.B. and Atmar, J.W. (Eds.), Proceedings of the Second Annual Conference on Evolutionary Programming. Evolutionary Programming Society: La Jolla, CA; 1993. pp. 11–22. 3. Chin, D.C. Comparative study of stochastic algorithms for system optimization based on gradient approximations. IEEE Transactions on Systems, Man, and Cybernetics - B 1997; 27; 244–249. 4. Fang, K.-T. and Zhang, Y.-T. Generalized Multivariate Analysis. Science Press: Beijing. Springer-Verlag: Berlin; 1990. 5. Kiefer, J. and Wolfowitz, J. Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics 1952; 23(3); 462–466. 6. Maryak, J.L. and Chin, D.C. Global random optimization by simultaneous perturbation stochastic approximation. IEEE Transactions on Automatic Control 2008; 53(3); 780–783. 7. Pinter, J.D. Global Optimization in Action. Kluwer Academic Publishers: Dordrecht; 1996. 8. Resnick, S.I. A Probability Path. Birkh¨auser: Boston; 1999. 9. Solis, F.J. and Wets, R.J.-B. Minimization by random search techniques. Mathematics of Operations Research 1981; 6(1); 19–30. 10. Spall, J.C. Introduction to Stochastic Search and Optimization. John Wiley & Sons, Inc.: New Jersey; 2003. 11. Spall, J.C. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 1992; 37; 332–341. 12. Stephens, C.P. and Baritompa, W. Global optimization requires global information. Journal of Optimization Theory and Applications 1998; 96(3); 575–588.

33

13. Wolpert, D.H. and Macready, W.G. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1997; 1(1); 67–82. 14. Yao, X., Liu, Y., and Lin, G. Evolutionary programming made faster. IEEE Transactions on Evolutionary Computation 1999; 3(2); 82–102. 15. Zabinsky, Z.B. Stochastic Adaptive Search in Global Optimization. Kluwer Academic Publishers; 2003. 16. Zabinsky, Z.B. and Smith, R.B. Pure adaptive search in global optimization. Mathematical Programming 1992; 53; 323–338. 17. Zhigljavsky, A.A. Theory of Global Random Search, Kluwer Academic Publishers: Dordrecht; 1991. 18. T¨orn, A. and Zilinskas, A. Global Optimization, Springer-Verlag: Berlin, Germany; 1989.

34

Recommend Documents

The Convergence Guarantees of a Non-convex Approach for Sparse ...

Probabilistic convergence guarantees for type-II pulse-coupled ...

Generalized Stochastic Subdivision - JP Lewis