STOCHASTIC SIMULATION ON INTEGER CONSTRAINT SETS∗ 1

Report 4 Downloads 39 Views
SIAM J. OPTIM. Vol. 9, No. 1, pp. 53–61

c 1998 Society for Industrial and Applied Mathematics

STOCHASTIC SIMULATION ON INTEGER CONSTRAINT SETS∗ I. H. DINWOODIE† Abstract. Bounds are given on the number of steps sufficient for convergence of simulation algorithms on domains of nonnegative integer constraint sets. Key words. Markov chains, eigenvalues, annealing, integer optimization AMS subject classifications. 60J20, 65K05, 90C10 PII. S1052623496313842

1. Introduction. This article is concerned with convergence of Markov chains on nonnegative integer constraint sets and applications to simulated annealing algorithms for optimization. Despite the lack of applicable results on its performance, the annealing algorithm is used for optimization of nonlinear functions on discrete domains. One application of the algorithm is finding modes of probability distributions on finite sets, a problem which arises in Bayesian statistics and image analysis (see [6] and [14]). It is used for other problems in combinatorial optimization as well, some of which are described in [13]. Here we are interested in domains of nonnegative integer lattice points on hyperplanes, which arise in integer optimization problems, image analysis, and statistics. Symmetric Markov chains on these domains were constructed in [4] using algebraic techniques. Whereas optimal cooling schedules have been widely studied [1], [7], [8], [9], we are interested in establishing clear bounds on the time required for a given accuracy δ > 0 and reliability ε > 0. The main result is Theorem 3.1, which gives a sufficient number of steps in the algorithm in a computable form. We expect that the results can be improved as new technology in Markov chains becomes available. They are based on geometrical techniques from [3] and [9], which also are used in the more general and abstract study [10]. Let us establish some notation. Let µ > 0 be a probability distribution function d r on the set S = {x ∈ Z+ , A(x) = b ∈ Z+ }, where A is a linear map or matrix with nonnegative integer entries such that S is finite. We show first in section 2 how to simulate from µ using techniques from [4], then we get convergence rates from eigenvalue estimates and apply these results in section 3 to the case where µ is chosen to put most of its mass where f is small. Our Markov chain is homogeneous, which means that the parameter β corresponding to the reciprocal of temperature is held fixed over time at a level which gives the desired stationary distribution. 2. The algorithm. We define a symmetric Markov chain on S as follows. Let Q(ξ1 , . . . , ξd ) be the ring of polynomials in variables ξ1 , . . . , ξd with coefficients in the rational numbers Q. If x is a vector of nonnegative integers, define X x = ξ1x1 · · · ξdxd . + − Let M = {g1 , . . . , gm }, where gi ∈ Z d and {X gi − X gi : i = 1, . . . , m} is a Gr¨obner basis for the ideal IA in Q(ξ1 , . . . , ξd ) given by IA = hX v − X w : A(v) = ∗ Received by the editors December 18, 1996; accepted for publication (in revised form) January 5, 1998; published electronically October 30, 1998. http://www.siam.org/journals/siopt/9-1/31384.html † Department of Mathematics, Tulane University, New Orleans, LA 70118 (ihd@math. tulane.edu).

53

54

I. H. DINWOODIE

A(w)i. Recall that g = g + − g − , where g + = max{g, 0}, g − = max{−g, 0}. Our ordering on monomials is purely lexicographic, based on the indeterminate ordering ξ1 > ξ2 > · · · > ξd . Now define a Markov chain on S as follows. Let x = (x1 , . . . , xd ) denote an element of S, and let Nx be the set of its neighbors in S, so Nx = {x ± gi : x ± gi ≥ 0, gi ∈ M }. Let K(x, ·) be the probability vector K(x, y) =

1 2m

for y ∈ Nx .

P Also, K(x, x) = 1 − y∈Nx K(x, y) makes the vector sum to 1 and will be positive precisely when |Nx | < 2m. The transition can be realized by uniformly choosing an element ±gi from among the 2m choices {±g1 , . . . , ±gm } and adding it to x if the result is nonnegative. Then K is symmetric, irreducible, and aperiodic. Recall that µ > 0 is an arbitrary distribution on S. Let Kµ be the transition matrix given by Kµ (x, y) = K(x, y) min{µ(y)/µ(x), 1},

x 6= y,

and the holding probability makes the matrix stochastic. Let the spectrum of Kµ be 1 = λ1 > λ2 ≥ · · · ≥ λ|S| > −1, and let γ = 1 − λ2 . In the next section, µ will depend on a parameter β > 0 interpreted as the reciprocal of temperature. To estimate γ, observe that γ = inf φ

= inf φ

hφ, (I − Kµ )φiµ kφ − hφ, 1iµ 1k2µ P (φ(x) − φ(y))2 µ(x)Kµ (x, y) (x,y)

P

(φ(x) − φ(y))2 µ(x)µ(y)

(x,y)

P

φ(e)2 Q(e)

e∈G

= inf P φ

φ(f )2 Qc (f )

,

f ∈Gc

where φ ranges through nonconstant functions on the state space S, G is the graph with edge e = {x, y} if and only if Q(e) = µ(x)Kµ (x, y) (= µ(y)Kµ (y, x)) > 0, and Gc is the complete graph with edges f = {x, y} connecting all ordered pairs, Qc (f ) = µ(x)µ(y), and φ(f )2 = (φ(y) − φ(x))2 . Now this representation can be used as in [3] and [9] to bound 1/γ from above in the form of a Poincar´e inequality, which we use in Lemma 2.1 below. Define the following quantities. Let µmax = max{µ(x)}, µmin = min{µ(x)}, ρ = µmax /µmin . The following eigenvalue estimate uses a result of [3], which is implicit in [9]. Recall that m is the number of moves in the set M . Lemma 2.1. Let γ = 1 − λ2 . Then 1/γ ≤ mρµmax |S|3 . Proof. If x > y are two points in S ordered lexicographically, form a path from x + − to y by dividing the monomial difference X x −X y by the Gr¨obner basis {X gi −X gi : 1 ≤ i ≤ m}, ordered in some arbitrary but fixed way. The multidegrees of the lead terms in the division give a path pxy which joins the endpoints x and y with decreasing (in lexicographic order) path segments to a common point in S. The number of edges in this path pxy is no greater than #x − 1, if #x is the rank of x in the set S using

STOCHASTIC SIMULATION ON INTEGER CONSTRAINT SETS

55

lexicographic order (the smallest element of S in lexicographic P order has rank 1, the largest has rank |S|). Define the measure of length |pxy |Q = e∈pxy Q(e)−1 . P Proposition 1 of [3] shows that 1/γ ≤ maxe pxy ∋e |pxy |Q µ(x)µ(y), where pxy is the path from x to y constructed above and e is an edge in the graph on S, say {a, a±gα }, with Q(e) = µ(a)Kµ (a, a±gα ) = µ(a)K(a, a±gα ) min{µ(a±gα )/µ(a), 1}. −1 Then Q(e)−1 ≤ µmin (2m). For x ∈ S, the maximum number of edges in a path joining x to y < x is #x − 1, so |pxy |Q ≤ (#x − 1)2m/µmin . With e = {a, a − gα }, the paths through e can be partitioned into collections starting at the different maximal values x ≥ a. There can be at most #x − 1 paths starting from x ≥ a and going to some point less than x, and the length of each is at most #x − 1. Thus X X (#x − 1)2 µ2 max |pxy |Q µ(x)µ(y) ≤ 2m/µmin pxy ∋e

x≥a

2

≤ 2m µ

µ max / min

X

(#x − 1)2

x

 = 2m µ2 max /µmin (|S| − 1)|S|(2|S| − 1)/6 ≤ mρµmax |S|3 . The main contribution of the bound in Lemma 2.1 is the estimate for γ when µ is uniform. Since the paths in the estimate joining two points do not change with µ, they cannot yield the optimal result for a particular µ (see [9]). On the other hand, the technique gives a practical estimate of the path length |pxy |Q . In some examples, one can find a shorter path by going through intermediate points and using various orderings on the variables ξ1 , . . . , ξd for different path segments. This applies in particular to Example 3.3. Finally let Tµ be the Markov chain obtained by running Kµ a Poisson(1) number of steps each time to avoid complications from negative eigenvalues. If we let ∆ = Kµ −I be the generator for Kµ , we can define (2.1)

Tµ = e∆ ,

which is reversible with stationary distribution µ. The spectrum of e∆ is {exp(λi − 1), i = 1, . . . , |S|}. Recall that any reversible and irreducible kernel T with stationary distribution µ > 0 implies that multiplying the matrix T with a column vector on the right gives a self-adjoint operator on L2 (µ), which leaves invariant 1⊥ , and multiplying T with a row vector on the left gives a self-adjoint operator on L2 (1/µ), which leaves invariant µ⊥ . If h·, ·i1/µ denotes the inner product in L2 (1/µ), |T n (x, A) − µ(A)| = h(δx − µ)T n , (IA − µ(A))µi1/µ ≤ kδx − µk1/µ kT n k1/µ k(IA − µ(A))µk1/µ p and kδx −µk1/µ ≤ 1/ µ(x), k(IA −µ(A))µk1/µ ≤ 1/2, kT n k1/µ is its largest eigenvalue as an operator restricted to µ⊥ , which in our situation is exp(−γ). For two distributions µ and ν on S, define the total variation distance kν − µk = supA⊂S |ν(A) − µ(A)|. Proposition 2.1. Let ε > 0, and let Tµ be defined in (2.1). Then kTµn (x, ·)−µk ≤ p ε for all n ≥ −mρµmax |S|3 log(ε µ(x)). p Proof. Since kTµn (x, ·) − µk ≤ 1/ µ(x) exp(−nγ) (see [3]), it is sufficient that p p n ≥ − log(ε µ(x))/γ, which follows if n ≥ − log(ε µ(x))(mρµmax |S|3 ).

56

I. H. DINWOODIE

3. Application to simulated annealing. Let f : S → R, and let fmin = min{f (x) : x ∈ S} = f (x∗ ), fmax = max{f (x) : x ∈ S}, and set h = fmax − fmin . Our optimization problem is to minimize f . Let µβ be the Gibbs measure on S given by µβ (x) = e−βf (x) |S|−1 /φ(−β), where φ is the moment generating function for f with respect to the uniform distribution on S: (3.1)

φ(t) =

X etf (x) x

|S|

.

Also, let φβ (t) = Eβ etf = φ(t − β)/φ(−β). Let (log φ)∗ denote the convex conjugate of the function log(φ), given by (log φ)∗ (a) = supt∈R {ta − log φ(t)}. Recall that µβ converges, as β gets large, to the uniform distribution on the points where f attains its minimum, which we make precise in Lemma 3.1 below. Lemma 3.1. Let δ > 0, let ε > 0, and set a = fmin + δ. Let β > 0 be sufficiently large that both Eβ (f ) < a and a(−β) − log φ(−β) ≤ log(ε) + (log φ)∗ (a). Then µβ ({x : f (x) > a}) ≤ ε. Proof. Clearly µβ ({x : f (x) > a}) ≤ e−ta Eβ etf = e−ta φβ (t) for all t ≥ 0, so log µβ ({x : f (x) > a}) ≤ − sup{ta − log φβ (t)} t≥0

= − sup{ta − log φβ (t)} t∈R

(since a > Eβ (f ))

= − sup{(t − β)a − log φ(t − β)} − aβ − log φ(−β) t

= −(log φ)∗ (a) − βa − log φ(−β). Let β > 0 be sufficiently large that Eβ (f ) < a and a(−β) − log φ(−β) ≤ log(ε) + (log φ)∗ (a). Then the result follows. Let x = X0 , X1 , . . . be the Markov chain defined in (2.1) with stationary distribution µβ , transition matrix Tβ = e∆ with ∆ = I − Kβ , and probability measure Pxβ on the sample space. Note that max{µβ (x)} = µβ (x∗ ) = µβ,max . Theorem 3.1. Let a = fmin + δ for δ > 0, and let β > 0 be sufficiently large that ≤ log(ε) + (log φ)∗ (a). Then Pxβ {f (Xn ) > a} ≤ 2ε Eβ (f ) < a and a(−β) − log φ(−β) p βh 3 for all n ≥ −mµβ,max e |S| log(ε µβ (x)). Remark. For the uniform distribution, the result simplifies to p n ≥ −m|S|2 log(ε/ |S|). Proof. First, Pxβ {f (Xn ) > a} = Pxβ {Xn ∈ f −1 (a, ∞)} − µβ (f −1 (a, ∞)) + µβ (f −1 (a, ∞)) ≤ kTβn (x, ·) − µβ k + µβ (f −1 (a, ∞)). p By Proposition 2.1, kTβn (x, ·) − µβ k ≤ ε if n ≥ −mµβ,max eβh |S|3 log(ε µβ (x)), since the parameter ρ = max{µβ }/ min{µβ } = exp(β(fmax − fmin )) = exp(βh). In addition, µβ (f −1 (a, ∞)) ≤ ε if β is sufficiently large (depending on δ and ε) by Lemma 3.1. Theorem 3.1 suggests the following. If ε = 1/2 and the Markov chain starts at a point x not far from x∗ so that µβ (x) ≈ µβ (x∗ ) ≈ 1, then a sufficient number of iterations is roughly meβh |S|3 . This does not give the exact exponential rate of

STOCHASTIC SIMULATION ON INTEGER CONSTRAINT SETS

57

growth in β of [9], but the bound leaves no unspecified constants and does not rely on detailed knowledge of the function f . It may be simpler and faster to get the same reliability by running several independent chains with moderate values of β and observing the minimum over all these chains. 2 Example 3.1. Let S = {x ∈ Z+ : x1 + x2 = k − 1} for an integer k > 0. Then |S| = k. Let β = 0, which corresponds to infinite temperature. Then µβ is the uniform distribution on S = {(0, k−1), . . . , (k−1, 0)}. A Gr¨ obner basis consists of the difference x1 − x2 which corresponds to the vector g1 = (1, −1), and the Markov chain K0 is essentially a reflecting random walk with second largest eigenvalue cos(π/k) [5, p. 389]. Then the chain T0 = e∆ has the second largest eigenvalue exp(cos(π/k) − 1) ≈ exp(−(π 2 /2)/k 2 ). On the other hand, the bound of Lemma 2.1 gives γ ≥ |S|−2 = k −2 , so the second eigenvalue exp(−γ) ≤ exp(−1/k 2 ), which is not an unreasonable bound. For β > 0 and the objectivep function f , Theorem 3.1 says that a sufficient number of steps is eβh k 2 (kµβ,max ) log(ε µβ (x)). The dependence of β on the desired δ > 0 and ε > 0 is explained in Lemma 3.1 and requires some estimates of the moment generating function φ. Example 3.2. Consider the knapsack problem [11, p. 14] with general increasing utility functions fi : R+ → R+ , fi (0) = 0. The problem is to maximize f1 (q1 ) + · · · + fd (qd ), where qi represents a nonnegative integer quantity of object i, subject to the overall weight constraint hw, qi ≤ W, where w = (w1 , . . . , wd ) is a vector of positive integer weights and W is a positive integer. Add a slack variable xd+1 so that its corresponding indeterminate ξd+1 is less than ξd . This ordering is important to get a simple Gr¨obner basis. For x = (x1 , . . . , xd , xd+1 ) d+1 , let f (x) = −(f1 (x1 ) + · · · + fd (xd )). Then the problem becomes one to min∈ Z+ d+1 imize f (x) over S, where S ⊂ Z+ is defined by w1 x1 + · · · + wd xd + xd+1 = W . To apply Theorem 3.1, observe that h = fmax − fmin = 0 + max{f1 (q1 ) + · · · + Pd fd (qd )} ≤ i=1 fi (W/wi ). Also, |S| is the coefficient on z W in the generating function g(z) =

d 1 1 Y , 1 − z i=1 1 − z wi

which is easily computed and satisfies |S| ≤ (W + d)d /d!. By Schur’s theorem [17, p. 90], |S| is well approximated by W d /(d!w1 . . . wd ) when W is large compared to d. That a set of d moves on S from a Gr¨obner basis is, in general, easy to describe. Each corresponds to incrementing a coordinate 1 through d by ±1 and adjusting the slack variable accordingly to maintain the equation w1 x1 + · · · + wd xd + xd+1 = W (cf. Proposition 3.1). Theorem 3.1 says that a sufficient number of steps for the uniform distribution p (β = 0) is roughly d|S|2 log(ε/ |S|), or essentially dW 2d /(d!2 w12 . . . wd2 ), if we ignore the logarithm. For β > 0, the quantity is roughly deβh W 3d log(ε)/(d!3 w13 · · · wd3 ). This quantity is polynomial in W for fixed d but is not polynomial in both W and d. The Ibarra–Kim theorem [15, p. 262] indicates that when the objective function is linear, there exists a dynamic programming “approximation” algorithm that is polynomial in both d and W, which would in theory be superior. The annealing method has the advantage in terms of generality, since it can easily be applied with any objective function. Neither the annealing algorithm nor the exact dynamic programming approach is fully polynomial in all the parameters.

58

I. H. DINWOODIE

In general, inequality constraints such as in the knapsack problem can be treated quite simply. Consider the situation where a state space S0 is the intersection of a d r finite number of half-spaces, say S0 = {x ∈ Z+ : Ax ≤ b}, where b ∈ Z+ and the nonnegative integer matrix A of rank r is such that |S0 | < ∞. This includes the knapsack problem of Example 3.3. By adding r slack variables, S0 is equivalent to a d+r with equality constraints. For the algebra, we need r new finite state space S in Z+ indeterminates ψi , which we order ξ1 > · · · > ξd > ψ1 > · · · > ψr . A Gr¨obner basis for the appropriate ideal is the generating set {ξi − ΨA(ei ) , 1 ≤ i ≤ d}, with ei the basis element for Rd , since the S-polynomials leave remainder 0 when divided by this set. These polynomials correspond to d moves given by incrementing or decrementing each of the original coordinates, chosen uniformly from the d choices, and adjusting the slack variables accordingly. This is then a special case of the Markov chain described in [2] for uniform generation within an arbitrary convex set of lattice points in Z d . A path between points x and y can be constructed by moving within S0 along the edges of a d-dimensional rectangle as follows. Let z = min{x, y}, which belongs to S0 . Join x to z by decrementing the first coordinate, then the second, etc. Then join y to z in the same manner. The two segments form a path from x to y within S0 . This path can be somewhat shorter than the one that arises from the division algorithm. The number of steps in the path pxy is Σi |xi − yi | ≤ 2 max{Σi ui : u ∈ S0 }. Let ai − 1 bound the ith coordinate of elements in S0 , so max{xi : (x1 , . . . , xd ) ∈ S0 } ≤ ai − 1, i = 1, . . . , d. Proposition 3.1. Let S0 be as above with D = max{Σi xi : x ∈ S0 }, and let A = Qd S) defined i=1 ai . Let Tβ be the Markov transition operator on S0 (or equivalently p in (2.1). Then kTβn (x, ·) − µβ k ≤ ε for all n ≥ −d2d+1 eβh µβ,max AD2 log(ε µβ (x)). Remark. When β = 0, µβ,max = |S|−1 , so the bound is roughly p −d2d+1 D2 (A/|S|) log(ε/ |S|) for the uniform distribution. For a rectangular region A, A/|S| = 1, and this specializes to a quantity on the order of d2d+1 D2 . D measures the diameter of S0 in the L1 sense, so this is consistent with the results of [2], which suggest that for most convex regions in dimension d, the number of steps required is on the order of the squared diameter. The bounds of Theorem 3.1 and Proposition 3.1 are comparable when |S| is comparable to D. Proof. First we estimate the gap γ = 1 − λ2 for Kβ . With Proposition 1′ of [3], we see that 1/γ ≤ maxe Q(e)−1 Σpxy ∋e |pxy |µβ (x)µβ (y), where |pxy | ≤ 2D denotes the number of edges in the path joining points x and y. Now Q(e)−1 ≤ 2d/µβ,min . To bound Σpxy ∋e 1, consider the edge e joining vertices u = (u1 , u2 , . . . , ui , . . . , ud ) and u − ei (here ei is the ith basis element in Rd ). We need to count first all the ordered pairs (x, y) ∈ S0 × S0 such that their connecting path can traverse e. Then half this number will bound the number of unordered pairs. The edge e is either on the path from x down to z = min{x, y} or from y down to z. Since the ith coordinate is changed along the edge e, the coordinates 1, . . . , i − 1 remain fixed throughout the remaining part of the path. Thus uj = min{xj , yj }, j = 1, . . . , i − 1. Also, since coordinates i + 1, . . . , d have not yet been visited, uj = xj for j = i + 1, . . . , d, or uj = yj for j = i + 1, . . . , d, depending on whether the edge is in the segment from x to z or from y to z. Partition the pairs (x, y), whose connecting path goes through e into 2i−1 groups, where a group is identified with a sequence of 0’s and 1’s of length i − 1, and a 0 in the jth place indicates xj ≥ yj , whereas a 1 indicates xj < yj , 1 ≤ j ≤ i − 1. Now the

STOCHASTIC SIMULATION ON INTEGER CONSTRAINT SETS

59

size of each of these groups is at most 2a1 a2 · · · ai−1 ai Dai+1 · · · ad = 2AD as follows. For a particular sequence of 0’s and 1’s, the possible values of (x, y) are constrained so that xj = uj for indices j ≤ i − 1 where there is a 1 (indicating xj < yj , and thus xj = min(xj , yj ) = uj ), and similarly yj = uj for indices j ≤ i − 1 where there is a 0 (indicating yj ≤ xj , and thus yj = min(xj , yj )). For indices i < j ≤ d, either (xi+1 , . . . , xd ) = (ui+1 , . . . , ud ) and the y coordinates are unclear, or (yi+1 , . . . , yd ) = (ui+1 , . . . , ud ) and the x coordinates are unclear. Finally, the ith coordinates in both x and y can take at most ai and also at most max{xi : x ∈ S0 } + 1 ≤ D + 1 values, so the number of such pairs is at most 2(a1 a2 · · · ai−1 × ai D × ai+1 · · · ad ). Then summing over all 2i−1 groups and dividing by 2 to get unordered pairs {x, y} we get Σpxy ∋e 1 ≤ 2i−1 (2AD/2) ≤ 2d−1 AD. Therefore 1/γ ≤ (2d/µβ,min )µ2β,max 2D(2d−1 AD) = dρµβ,max 2d+1 AD2 = d2d+1 eβh µβ,max AD2 . p Again using the basic bound kTβn (x, ·) − µβ k ≤ 1/ µβ (x) exp(−nγ), we see that p kTβn (x, ·) − µβ k ≤ ε for n ≥ − log(ε µβ (x))/γ, which occurs if n ≥ −d2d+1 eβh µβ,max AD2 log(ε

q µβ (x)).

If we apply Proposition 3.1 to the knapsack problem of Example 3.2, we can let ai = (W/wi ) + 1. Then A is approximately the product W d /(w1 · · · wd ), and the quantity |S| ≈ W d /(d!w1 . . . wd ) ≈ A/d! for W large and fixed d. Since D ≤ W/ mini wi , we get the estimate d2d+1 W 2 /(mini wi )2 (A/|S|), or roughly the quantity d2d+1 d!W 2 /(mini wi )2 . For fixed d and weights wi , this grows more slowly in W than the estimate dW 2d /(d!2 w12 . . . wd2 ) from Example 3.2. We mention finally that significant improvements are possible in situations where a certain block structure is present in the constraint set. Suppose that the constraint d set S ⊂ Z+ can be described by g independent constraints on disjoint sets of di variables, 1 ≤ i ≤ g, as follows. Let Ai : Rdi → Rri be a matrix with nonnegative di integer entries which define a constraint set Si = {x ∈ Z+ : Ai (x) = bi }, i = 1, . . . , g. Then we assume that S has the product form S = S1 × · · · × Sg . Let bi index the start of the variables for constraint i, so bi = d1 +· · ·+di−1 +1, i = 1, . . . , g. Let Mi of size mi be the set of moves corresponding to a Gr¨obner basis for the symmetric Markov chain on each of these sets Si , i = 1, . . . , g, considered separately. Let Ki be the symmetric kernel on Si given by (3.2)

Ki (x, y) =

1 2mi

if y ≥ 0 and y = x ± f for some f in Mi ,

and Ki vanishes otherwise. This means that one updates the di coordinates corresponding to the constraints Ai with moves from Mi . Take arbitrary positive weights w1 , . . . , wg with w1 + · · · + wg = 1, and form the irreducible kernel K on S by (3.3)

K(h1 ⊗ h2 · · · ⊗ hg )(x) =

g X

wi h1 ⊗ · · · ⊗ Ki (hi ) ⊗ · · · ⊗ hg (x),

i=1

which means that one chooses a block of coordinates i with probability wi , and then one runs the chain Ki in Si to update those di coordinates while leaving the others unchanged. The eigenvectors of K are the weighted averages of tensor products of

60

I. H. DINWOODIE

those for the kernels {Ki , i = 1, . . . , g}, weighted by the family {wi }, with eigenvalues being the corresponding weighted averages. It is then a simple fact that the gap γ for K, the difference between unity and the second largest eigenvalue, is γ = min{wi γi : i = 1, . . . , g}, where γi is one minus the second largest eigenvalue for the kernel Ki on the set Si . Below we consider a special case with uniform distribution on onedimensional blocks. Example 3.3 (Reflecting random walk in a box [2]). Consider the state space d S0 = {x ∈ Z+ : 0 ≤ xi ≤ a − 1} for an integer a > 1. The reflecting random walk makes a transition from a lattice point by uniformly choosing one of the 2d neighbors and moving to the one selected if the candidate is within the box. The boundary points have some positive holding probability. Another description is that one uniformly chooses one of the d dimensions or coordinates to update, then one runs the one-dimensional reflecting random walk one step on that coordinate in the space {0, 1, . . . , a − 1}. 2d The state space S0 is equivalent for our purposes to S = {x ∈ Z+ : xi,1 + xi,2 = a − 1, i = 1, . . . , d} by adding d slack variables. Applying Lemma 2.1 at this point to S yields a spectral estimate that is not accurate for d ≥ 2. Recall that the exact gap γ = (1 − cos(π/a))/d ≈ π 2 /2a2 d, which is on the order of the square of the euclidean diameter of S [2]. To get something comparable with our method, write S = S1 × · · · × Sd , where 2 : xi,1 + xi,2 = a − 1}. The Markov chain on Si is run with moves Si = {x ∈ Z+ Mi = {(1, −1)}. With uniform weights wi = 1/d, the kernel (3.3) is exactly the reflecting walk in the box, and Lemma 2.1 gives γi ≥ |Si |−2 = a−2 . Then K has gap at least 1/a2 d as the average of the kernels Ki , and this result is of the right size in the parameters a and d. We conclude from this example that in situations where a product structure exists on the state space (and also on the objective function, which we have taken to be constant in this example) the bound of Lemma 2.1 can be significantly improved. 4. Conclusions. Nonnegative integer constraint sets are difficult and interesting domains for optimization and simulation problems. The results in this paper give general bounds on the time required for a given accuracy in some problems of simulation on such domains without prior enumeration of the state space. They are formulated in terms of computable quantities. In some interesting examples the bounds are quite accurate, in particular, when the state space S is low dimensional. An example where they would not be accurate would be simulation on the set of multigraphs with given vertex degrees, which can be interpreted as simulation on symmetric nonnegative integer matrices with certain row and column sums. Here the dimension is on the order of the number of vertices, and we would not expect the results to be useful in this case. The basic technique for estimating eigenvalues of Lemma 2.1 is generally not powerful in high dimensions (see [2]) but does not require detailed properties of the objective function f which appear in theoretical results on annealing. Furthermore, the techniques for estimating the spectral gap γ and the path length between two elements of the state space may be adapted in particular situations to yield substantial improvements. Examples of this are shown for inequality constraints and when a product structure exists on the constraint set. Improving the bounds of this paper and extending them to the nonreversible annealing algorithms treated in somewhat abstract terms in [12] and [16] would serve as interesting problems for further related research.

STOCHASTIC SIMULATION ON INTEGER CONSTRAINT SETS

61

REFERENCES [1] O. Catoni, Applications of sharp large deviations estimates to optimal cooling schedules, Ann. Inst. H. Poincar´ e Probab. Statist., 27 (1991), pp. 463–518. [2] P. Diaconis and L. Saloff-Coste, Nash inequalities for finite Markov chains, J. Theoret. Probab., 9 (1996), pp. 459–510. [3] P. Diaconis and D. Stroock, Geometric bounds for eigenvalues of Markov chains, Ann. Appl. Probab., 1 (1991), pp. 36–61. [4] P. Diaconis and B. Sturmfels, Algebraic Algorithms for Sampling from Conditional Distributions, Technical Report 6, Department of Statistics, Stanford University, Stanford, CA, 1993. [5] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, 2nd ed., John Wiley, New York, 1957. [6] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Trans. Pattern Analysis and Machine Intelligence, 6 (1984), pp. 721– 741. [7] B. Gidas, Nonstationary Markov chains and convergence of the annealing algorithm, J. Stat. Phys., 39 (1985), pp. 73–131. [8] B. Hajek, Cooling schedules for optimal annealing, Math. Oper. Res., 13 (1988), pp. 311–329. [9] R. Holley and D. Stroock, Simulated annealing via Sobolev inequalities, Comm. Math. Phys., 115 (1988), pp. 553–569. [10] S. Ingrassia, On the rate of convergence of the Metropolis algorithm and Gibbs sampler by geometric bounds, Ann. Appl. Probab., 4 (1994), pp. 347–389. [11] E. L. Johnson, Integer Programming: Facets, Subadditivity, and Duality for Group and Semigroup Problems, CBMS-NSF Regional Conf. Ser. in Appl. Math., SIAM, Philadelphia, 1980. [12] L. Miclo, Remarques sur l’hypercontractivit´ e et l’´ evolution de l’entropie pour des chaˆines de Markov finies, in Seminaire de Probabilit´ es XXXI, J. Az´ ema, ed., Springer, New York, 1997, pp. 136–167. [13] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C, 2nd ed., Cambridge University Press, New York, 1992. [14] B. Ripley, Stochastic Simulation, John Wiley, New York, 1987. [15] A. Schrijver, Theory of Linear and Integer Programming, John Wiley, New York, 1986. ´, Rough large deviation estimates for the optimal convergence speed exponent of [16] A. Trouve generalized simulated annealing algorithms, Ann. Inst. H. Poincar´ e Probab. Statist., 32 (1996), pp. 299–348. [17] H. Wilf, Generating Functionology, Academic Press, San Diego, 1988.