ASTRO-DF: A CLASS OF ADAPTIVE SAMPLING TRUST-REGION ALGORITHMS FOR DERIVATIVE-FREE SIMULATION OPTIMIZATION SARA SHASHAANI
∗,
FATEMEH S. HASHEMI
† , AND
RAGHU PASUPATHY
‡
Abstract. We consider unconstrained optimization problems where only “stochastic” estimates of the objective function are observable as replicates from a Monte Carlo oracle. The Monte Carlo oracle is assumed to provide no direct observations of the function gradient. We present ASTRO-DF — a class of derivative-free trust-region algorithms, where a stochastic local interpolation model is constructed, optimized, and updated iteratively. Function estimation and model construction within ASTRO-DF is adaptive in the sense that the extent of Monte Carlo sampling is determined by continuously monitoring and balancing metrics of sampling and structural errors within ASTRODF. Such balancing of errors is designed to ensure that Monte Carlo effort within ASTRO-DF is sensitive to algorithm trajectory, sampling more whenever an iterate is inferred to be close to a critical point and less when far away. We demonstrate the almost-sure convergence of ASTRO-DF’s iterates to a first-order critical point when using linear or quadratic stochastic interpolation models. The question of using more complicated models, e.g., regression or stochastic kriging, in combination with adaptive sampling is worth further investigation and will benefit from the methods of proof presented here. We speculate that ASTRO-DF’s iterates achieve the canonical Monte Carlo convergence rate, although a proof remains elusive. Key words. derivative-free optimization, simulation optimization, trust-region AMS subject classifications.
1. INTRODUCTION. We consider unconstrained simulation optimization (SO) problems, that is, optimization problems in continuous space where the objective function(s) can only be expressed implicitly via a Monte Carlo oracle. The Monte Carlo oracle is assumed to not provide any direct observations of the function derivatives. SO has recently gathered attention due to its versatile formulation, allowing the user to specify functions involved in an optimization problem implicitly, e.g., through a stochastic simulation. As a result, SO allows virtually any level of problem complexity to be embedded, albeit at the possible price of a computationally burdensome and slow Monte Carlo oracle. SO has seen wide recent adoption — see, for example, applications in telecommunication networks [30], traffic control [39] epidemic forecasting [37] and health care [1]. Recent editions of the Winter Simulation Conference (www.informs-sim.org) have dedicated an entire track to the SO problem and its various flavors. For a library of SO problems, see www.simopt.org and [43, 44]. The presence of a Monte Carlo oracle lends flexibility to the SO problem formulation, but it also brings with it a simply-stated complication: the lack of uniform deterministic error guarantees. Specifically, suppose f (x, n) is the Monte Carlo estimate of the unknown desired function value f (x) at the point x, and n represents the extent of Monte Carlo effort. Then, simple probability arguments reveal that deterministic guarantees of the sort |f (x, n) − f (x) | ≤ ǫ, ǫ > 0 do not hold irrespective of the size of n; instead, one has to be content with probabilistic precision guarantees of the form P{|f (x, n) − f (x) | > ǫ} ≤ α for n ≥ n0 (α). The analogous situation for function derivative estimation using Monte Carlo is worse. If the derivative estimate ˆ (x) := (∇ ˆ 1 f (x), ∇ ˆ 2 f (x), . . . , ∇ ˆ q f (x)) is constructed using a central-difference ap∇f ∗ Department
of Industrial Engineering, Purdue University (email:
[email protected]) and Systems Engineering, Virginia Tech (email:
[email protected]) ‡ Department of Statistics, Purdue University (email:
[email protected])
† Industrial
1
2
Shashaani, Hashemi, and Pasupathy
proximation as ˆ i f (x) = 2c−1 ∇ n (f (x + cn ei , n) − f (x − cn ei , n)), i = 1, 2, . . . , q, then, as in the function estimation context, no uniform guarantees on the accuˆ (x) are available in general. Furthermore, the rate at which ∇f ˆ (x) conracy of ∇f verges to ∇f (x) depends crucially on the choice of {cn }, with the best possible rate O(n−1/3 ) under generic Monte Carlo sampling being much slower than the corresponding O(n−1/2 ) rate for function estimation. (See [4] for this and related results.) Most importantly, implementing such finite-difference derivative estimates within an SO algorithm is well recognized to be a delicate issue, easily causing instabilities. In any event, the lack of uniform deterministic guarantees in the SO context means that estimation error inevitably accumulates across iterations of an algorithm, thereby threatening consistency of the resulting iterates. Algorithms for solving SO have to somehow contend with such potential non-convergence due to mischance, either through the introduction of gain sequences as in stochastic approximation [33] or through appropriate sampling as in sample average approximation or retrospective approximation [32, 40]. A second complication within SO, but one that it partially shares with black-box deterministic optimization contexts, is the lack of information about function structure. Structural properties such as convexity, uni-modality, and differentiability, if known to be present, can be exploited when designing optimization algorithms. Such properties, when appropriate, are usually assumed within the deterministic context, and an appropriate solution algorithm devised. In SO, however, structural assumptions about the underlying true objective and constraint function, even if correct, may not provide as much leverage during algorithm development. This is because, due to the presence of stochastic error, the true objective and constraint functions are never directly observed; and, making structural assumptions about their observed sample-paths is far more suspect. Remark 1. Another aspect that is unique to SO is noteworthy. Monte Carlo calls are typically the most compute-intensive operations within SO contexts. And, depending on the nature of the SO algorithm, different number of Monte Carlo calls may be expended across iterations, e.g., constant in SA [33], varying but predetermined in RA [40], or random in SCSR [42]. This means that the elemental measure of effort in SO — the number of Monte Carlo oracle calls — may not have a simple relationship with the notion of “iterations” defined within the specific SO algorithm, forcing a need for more careful book-keeping. This is why iterative SO algorithms are well-advised to measure convergence and convergence rates not in terms of the number of iterations as deterministic iterative structures do, but rather in terms of the total number of Monte Carlo calls. 1.1. ASTRO-DF and Overview of Contribution. Our particular focus in this research is that of developing a class of algorithms for solving low to moderate dimensional SO problems that have no readily discernible structure. We are inspired by the analogous problem in the deterministic context that has spurred the development of a special and arguably very useful class of optimization methods called model-based trust-region derivative-free (TRO-DF) algorithms [22, 46, 21, 5]. TRO-DF algorithms are typified by two aspects: (i) they eschew the direct computation and use of derivatives for searching, and instead rely on constructed models of guaranteed accuracy in specified “trust regions”; (ii) the algorithmic search evolves by repeatedly constructing and optimizing a local model within a dynamic trust-region, explicitly restricting the
ASTRO-DF
3
distance between the successive iterates returned by the algorithm. The aspect in (i) is particularly suited for adaptation to SO contexts where direct derivative estimation can be delicate and unstable, requiring careful choice of step-sizes [4]; the aspect in (i) also aids efficiency because models constructed in previous iterations can be re-used with some updating, and no effort is expended for explicit estimation of derivatives. The aspect in (ii) runs counter to efficiency, but is designed to lend stability through steps that are more circumspect. We construct a family of adaptive sampling trust-region optimization derivativefree (ASTRO-DF) algorithms for the SO context. In their most rudimentary form, ASTRO-DF algorithms follow a familiar idea for iteratively estimating the first and second order critical points of a function. Given a current random iterate Xk that approximates the first-order critical point of interest, ASTRO-DF constructs a tractable “local” stochastic model using Monte Carlo observations of the objective function at carefully chosen points around Xk . The constructed model is then optimized within ˜ k+1 . Next, the local region in which it is constructed to obtain a candidate solution X ˜ the objective function is observed (using Monte Carlo) at Xk+1 and compared against ˜ k+1 . If the observed decrease in function values the value predicted by the model at X ˜ from Xk to Xk+1 exceeds the decrease predicted by the constructed model in a certain ˜ k+1 is accepted as the next iterate Xk+1 . As a vote stochastic sense, the candidate X of confidence on the constructed model, the trust-region radius is then expanded by a factor. Otherwise, that is, if the predicted decrease is much lower than the observed ˜ k+1 is rejected, the trustdecrease (again, in a certain precise sense), the candidate X region radius is shrunk, and the local model is updated in an attempt to improve accuracy. This iterative process then repeats to produce a sequence of iterates {Xk }. Remark 2. Throughout this paper, we use the term “sampling” to refer to the act of obtaining replicates using multiple runs of the Monte Carlo oracle at a fixed point. This is not to be confused with sampling design points in the search region. So, when we say that the sample size is n, we mean that n amount of Monte Carlo effort was expended to obtain the function estimate at a fixed point. The above ideas for model construction, trust-region management, and candidate point acceptance say nothing about how much Monte Carlo effort to expend. Since all observations for function estimation and model construction are based on Monte Carlo, the resulting accuracy estimates are at best probabilistic, leading us to the question of how much to sample. Too little Monte Carlo effort threatens convergence due to accumulated stochastic and deterministic errors, and too much Monte Carlo sampling means reduced overall efficiency. Identifying the correct Monte Carlo sampling trade-off is more than a theoretical question, and answering it adequately entails more than broad prescriptions on sampling rates. To produce good implementations, sampling prescriptions ought to be specific to the problem at hand, which usually means relying on inference based on algorithm trajectory. To resolve the issue of how much to sample, we propose that a simple strategy called adaptive sampling be incorporated within derivative-free trust-region algorithms in the SO context. Recognizing that the error in function and model estimation can be decomposed orthogonally into error due to sampling and error due to structure (or bias), adaptive sampling seeks to ensure that “just adequate” Monte Carlo sampling is performed by balancing these errors. For example, when constructing a local model, Monte Carlo sampling in ASTRO-DF is adaptive in the sense that sampling continues until a certain continuously monitored metric of model quality exceeds a metric of sampling variability. A similar rule is employed when estimating the ob-
4
Shashaani, Hashemi, and Pasupathy
jective function at a point for purposes of candidate acceptance. We believe that such adaptive sampling paves the way for efficiency because it reacts to the observed algorithm tractory and, as we shall see, keeps the different sources of error within the algorithm in lock-step. The resulting algorithm remains practical because of the simplicity of the proposed adaptive sampling rule — sample until the estimated standard error falls below a certain specified power of the prevailing trust region radius. Remark 3. Adaptive sampling as an idea is not new and has been used with great success in other areas such as sequential confidence interval construction [20, 27] and SO on finite spaces [29]. Adaptive sampling, while invaluable as an implementation idea, introduces substantial complications when analyzing algorithm behavior. Akin to what happens during sequential sampling in the context of confidence interval construction [20, 27], the explicit dependence of the extent of Monte Carlo sampling on algorithm trajectory causes systematic early stopping and consequent bias in the function estimates obtained within ASTRO-DF. In other words, when using adaptive sampling, E[f (x, n)] 6= f (x) in general since the sample size n is a stopping time that will depend on f (x, n). Demonstrating that ASTRO-DF’s iterates converge to a first-order critical point with probability one then entails demonstrating that the bias effects of adaptive sampling, especially when used within the derivative-free trust region context, wear away asymptotically. We accomplish this by first (generically) characterizing a relationship between the moments of the adaptive sample size and the function estimates at stopping, and then showing that the errors induced in model construction, function estimation, and algorithm recursion remain in lock-step throughout ASTRO-DF’s evolution. We note that ASTRO-DF, as presented here, assumes that a stochastic linear or stochastic quadratic interpolation model is constructed during the model-construction step of the algorithm. While such models are reasonable and have seen wide use in the analogous deterministic context, other possibly more powerful model construction techniques such as regression or stochastic kriging [3] should be considered in place of interpolation models, especially alongside adaptive sampling. Ongoing research investigates this question and it is our belief that the proof techniques that we present in this paper will carry over, albeit with some changes. 2. NOTATION AND PROBLEM STATEMENT. We use bold font for vectors, script font for sets, lower case font for real numbers and upper case font d for random variables. Hence {Xk } denotes a sequence of random vectors 1 in2 IR , xp= (x1 , x2 , . . . , xd ) denotes a d-dimensional vector of real numbers, Y := y , y , . . . , y 1 2 denotes a set of p real vectors, . . , Yp denotes a set of p random n and Y := Y , Y , .o
vectors. The set B (x; r) = y ∈ IRd : ky − xk ≤ r is the closed ball of radius r > 0 with center x. D For a sequence of random vectors {Xk }, Xk −→ X denotes convergence in distrip
wp1
bution, Xk − → X denotes convergence in probability, and Xk −−→ X denotes convergence with probability one or almost-sure convergence. For a sequence of real numbers {ak }, we say ak = o (1) if limk→∞ ak = 0; we say ak = O (1) if {ak } is bounded, that is, there exists a constant M > 0 such that |ak | < M for large enough k. For sequences of real numbers {ak }, {bk }, we say that ak ∼ bk if limk→∞ ak /bk = 1. For a sequence of random variables {Xk }, we say Xk = Op (1) if {Xk } is stochastically bounded, that is, given ǫ > 0 there exists M (ǫ) ∈ IR such that P{Xk ∈ (−M (ǫ), M (ǫ))} ≥ 1 − ǫ for all k ≥ K(ǫ) ∈ N.
5
ASTRO-DF
The SO problem we consider is formally stated as follows: (2.1)
Problem P : minimize f (x) subject to x ∈ X,
where f : IRd → IR is bounded from below and has Lipschitz continuous gradients, and X is a known convenient set such as IRd . Furthermore, the function f (x) = E[F (x)] is the expectation of a random function F (x) that is observable through Monte Carlo. This means, for instance, that one can generate n identically distributed samples or replicates Fi (x), j = 1, 2, . . . , n of F (x) by “executing” the Monte Carlo estimator Pnsimulation n times at the point x. This leads to the √ F¯ (x, n) = n−1 j=1 Fj (x) having standard error estimated as σ ˆF (x, n) / n where 2 Pn σ ˆF2 (x, n) = n−1 j=1 Fj (x) − F¯ (x, n) . We assume that no direct observations of the gradient ∇f (·) , e.g., through IPA [4, p. 214], are available through the Monte Carlo oracle. This means that methods seeking a gradient estimate need to resort to indirect methods such as as finite differencing [4, p. 209], leading to biased estimators. An algorithm for solving the above problem will be evaluated based on its ability to return a (random) sequence of iterates {Xk } converging in some rigorously defined probabilistic metric to a first- or second-order critical point of the function f . SO algorithms that return iterates {Xk } guaranteed to converge to a critical point with probability one will be called consistent. The phrase “canonical rate” is sometimes used to refer to the fastest achievable convergence rate under generic Monte Carlo sampling. In the current context, we will say that √ the random sequence of iterates {Xk } achieves the canonical Monte Carlo rate if Wk ∇f (Xk ) = Op (1), where Wk is the total Monte Carlo effort expended after k iterations. 3. PRELIMINARIES. In this section, we list some key definitions and results that will be used throughout the rest of the document. For more details on the definitions, consult [22] and [38]. i Definition 3.1. (Poised and Λ-Poised Sets) Given x ∈ X and ∆ > 0, let Y = y ∈ B (x; ∆) , i = 1, 2, . . . , p be a finite set and Φ (z) = (φ1 (z) , φ2 (z) , . . . , φq (z)) be a polynomial basis on X ⊆ IRd . Define
(3.1)
φ1 y 1 φ1 y 2 P (Φ, Y) = .. .
φ1 (yp )
φ2 y 1 φ2 y 2 .. .
... ... .. .
φ2 (yp )
...
φq y 1 φq y 2 . .. . φq (yp )
Then, Y is said to be a “poised set” in B (x; ∆) if the matrix P (Φ, Y) is nonsingular. A poised set Y is said to be “Λ-poised” in B (x; ∆) if Λ ≥ max
max |lj (z)| ,
j=1,...,p z∈B(x;∆)
where lj (z) are the Lagrange polynomials associated with Y. Definition 3.2. (Polynomial Interpolation Models) Let f : X ⊆ IRd → IR be a real-valued function and let Y and Φ be as defined in Definition 3.1 with p = q. Suppose we can find α = (α1 , . . . , αp ) such that (3.2)
T P (Φ, Y) α = f y1 , . . . , f (yp ) .
6
Shashaani, Hashemi, and Pasupathy
(Such an α is guaranteed to exist if Y is poised.) Then the function m(z) : B (x; ∆) → IR given by (3.3)
m(z) =
p X
αj φj (z)
j=1
is said to be a polynomial interpolation model of f on B (x; ∆). As a special case, m (z) is said to be a linear interpolation model of f on B (x; ∆) if Φ(z) := (φ1 , φ2 , . . . , φp ) = (1, z1 , z2 , . . . , zd ) , and a quadratic interpolation model of f on B (x; ∆) if Φ(z) := (φ1 , φ2 , . . . , φp ) = 1, z1 , z2 , . . . , zd , 21 z12 , z1 z2 , . . . , 21 z22 , . . . , 12 zd2 . Definition 3.3. (Stochastic Interpolation Models) A model constructed as in Definion 3.2 but with sampled function estimates is called a stochastic interpolation model. Specifically, analogous to (3.2), suppose α ˆ = (αˆ1 , αˆ2 . . . , α ˆ p ) is such that T , . . . , F¯ (yp , n (yp )) , Pp Then the stochastic function M (z) : B (x; ∆) → IR given as M (z) = j=1 α ˆ j φj (z) is said to be a stochastic polynomial interpolation model of f on B (x; ∆), where Φ (z) = (φ1 (z) , φ2 (z) , . . . , φq (z)) and P (Φ, Y) are as in Definion 3.2. Definition 3.4. (Fully-linear and Fully-quadratic Models) Given x ∈ X, m (z) : B (x; ∆) → IR, m ∈ C 1 is said to be a (κef , κeg )-fully-linear model of f on B (x; ∆) m if it has a Lipschitz continuous gradient with Lipschitz constant νgL , and there exist constants κef , κeg (not dependent on z and ∆) such that P (Φ, Y) α ˆ = F¯ y1 , n y1
(3.4)
, F¯ y2 , n y2
|f (z) − m (z)| ≤ κef ∆2 ;
k∇f (z) − ∇m (z)k ≤ κeg ∆.
Similarly m(z) : B (x; ∆) → IR, m ∈ C 2 is said to be a (κef , κeg , κeh )-fully-quadratic model of f on B (x; ∆) if it has a Lipschitz continuous second derivative with Lipschitz m constant νhL , and constants κef , κeg , κeh (not dependent on z, ∆) such that
(3.5)
|f (z) − m (z)| ≤ κef ∆3 ;
k∇f (z) − ∇m (z)k ≤ κeg ∆2 ;
2
∇ f (z) − ∇2 m (z) ≤ κeH ∆.
A linear interpolation model constructed using a poised set Y can be shown to be (κef , κeg )-fully-linear; likewise, a quadratic interpolation model constructed using a poised set Y can be shown to be (κef , κeg , κeh )-fully-quadratic. Definition 3.5. (Cauchy Reduction) Step s is said to achieve κf ed fraction of Cauchy reduction for m (·) on B (x; ∆) with some ∆ > 0, if k∇m (x)k κf cd k∇m (x)k min , ∆ , (3.6) m (x) − m (x + s) ≥ 2 k∇2 m (x)k where ∇m (x) and ∇2 m (x) are the
model gradient and the model Hessian at point x. We assume k∇m (x)k / ∇2 m (x) = +∞ when ∇2 m (x) = 0. A Cauchy step with κf cd = 1 will be obtained if the model is minimized along the steepest descent direction within B (x; ∆) [22, p. 175]. The following results will be used at various points in the paper. The first of these (Theorem 3.6) is a seminal result that is routinely used in sequential sampling
7
ASTRO-DF
contexts. The second result (Theorem 3.7), extensively used in our analysis, is a variation of Lemma 2 and Theorem 1 in [26]. While the postulates and the setting of Theorem 3.7 differ somewhat, a proof follows along lines established in [26]. For this reason, we have chosen not to include a proof. Lemma 3.8, for which we provide a proof sketch in the Appendix, is a stochastic variant of a corresponding result that appears in Chapter 3 of [22]. Theorem 3.6 (Chow and Robbins, 1965). Suppose random variables Xi , i = Pn ¯ n = n−1 Pn Xi , σ ¯n 2, 1, 2, . . . are iid with variance σ < ∞, X ˆn2 = n−1 i=1 Xi − X i=1 and {an } a sequence of positive constants such that an → a as n → ∞. If d σ ˆ , N = inf n ≥ 1 : √ ≤ an n wp1 wp1 ˆN /σ −−→ 1 as d → 0. then d2 N/ a2 σ 2 −−→ 1 and σ Theorem 3.7. Suppose random variables Xi , i = 1, 2, . . . are iid with E[X1 ] = Pn ¯n 2, and E[|X1 |4v ] < ∞ for some v ≥ 2. Let σ ˆn2 = n−1 i=1 Xi − X 0, E[X12 ] = σ 2 > 0, P ¯ n = n−1 n Xi . If where X i=1 κ σ ˆn γ , γ ∈ (0, 1] N = inf n ≥ λ : √ ≤ √ n λ then the following hold. (i) As λ → ∞, wp1
P{N < ∞} = 1 and N −−→ ∞. (ii) As λ → ∞ and for every s < v, E[N s ] ∼ σ 2s κ−2s λs . (iii) For every ǫ ∈ (0, 1), P{N ≤ σ 2 κ−2 λ(1 − ǫ)} = βλ−(v−1)γ , where β is a constant that might depend only on v and moments of X1 . (iv) As λ → ∞, 2 ¯N E[X ] ∼ κ2 λ−1 . Lemma 3.8. Let Y = Y1 , Y2 , . . . , Yp be aΛ-poised set on B Y1 ; ∆ . Let m (z) be an interpolation model of f on B Y1 ; ∆ . Let M (z) be the corresponding stochasticinterpolation of f on B Y1 ; ∆ constructed using observations model i i i i F¯ Y , N Y = f Y + E for i = 1, 2, . . . , p. (i) For all z ∈ B Y1 ; ∆ , |M (z) − m (z)| ≤ Λ max F¯ Yi , N Yi − f Yi . i=1,2,...,p
(ii) If M (z) is a stochastic linear interpolation model of f on B Y1 ; ∆ , then there exist positive constants κegL1 , κegL2 such that for z ∈ B Y1 ; ∆ , qP d+1 i 1 2 i=2 (E − E ) k∇M (z) − ∇f (z)k ≤ κegL1 ∆ + κegL2 . ∆
8
Shashaani, Hashemi, and Pasupathy
If M (z) is a stochastic quadratic interpolation model of f on B Y1 ; ∆ , then there exist positive constants κegQ1 , κegQ2 such that q P(d+1)(d+2)/2 i 2 (E − E 1 ) i=2 2 . k∇M (z) − ∇f (z)k ≤ κegQ1 ∆ + κegQ2 ∆ 4. RELATED WORK. Much progress has been made in recent times on solving various flavors of the SO problem. The predominant solution methods in the simulation literature fall into two broad categories called Stochastic Approximation (SA) and Sample-Average Approximation (SAA). SA and SAA have enjoyed a long history with mature theoretical and algorithmic literature. More recently, newer classes of algorithms that can be described as “stochastic versions” of iterative structures in the deterministic context have emerged. 4.1. SA and SAA. Virtually all stochastic approximation type methods are subsumed by the following generic form: (4.1)
Xk+1 = ΠD (Xk − ak Yk ) ,
where ΠD (x) is the projection of the point x onto the set D, {ak } is a user-chosen positive-valued scalar sequence, and Yk is an estimator of the gradient ∇f (Xk ) of the function f at the point Xk . When direct Monte Carlo observations of the objective function f are available, the most common expression for Yk is either the central-difference approximation Yk = 2c−1 k (F (Xk + ck ) − F (Xk − ck )) or the forwarddifference approximation Yk = c−1 (F (X k + ck ) − F (Xk )) of the gradient ∇f (Xk ), k where {ck } is a positive-valued sequence and F : IRd → IR is the observable estimator of the objective function f : IRd → IR. The resulting recursion is the famous KieferWolfowitz process [31, 15]. More recent recursions include an estimated Hessian Hk (·) of the function f at the point Xk as part of Yk : (4.2)
Yk = (2ck )−1 Hk−1 (Xk ) (F (Xk + ck ) − F (Xk − ck )) ,
making the resulting recursion in (4.1) look closer to the classical Newton’s iteration in the deterministic context. The Hessian estimator Hk (·) has d2 entries, and hence, most methods that use (4.2) in (4.1) estimate Hk (·) either using a parsimonious design (e.g., [51, 52]), or construct it from the history of observed points. As can be seen in (4.1), the SA recursion is simply stated and implemented, and little has changed in its basic structure since 1951, when it was first introduced by Robbins and Monro [47] for the context of finding a zero of a “noisy” vector function. Instead, much of the research over the ensuing decades has focused on questions such as convergence and convergence rates of SA type algorithms, the effect of averaging on the consistency and convergence rates of the iterates, and efforts to choose the sequence {ak } in an adaptive fashion. Some good entry points into the vast SA literature include [33, 36, 45, 41]. See [17, 16, 54] recent attempts to address the persistent dilemma of choosing the gain sequence {ak } to ensure good finite-time performance. SAA, in contrast to SA, is more a framework than an algorithm to solve SO problems. Instead of solving Problem P , SAA asks to solve a “sample-path” Problem Pn (to optimality) to obtain a solution estimator Xn . Formally, in the unconstrained context, SAA seeks to solve (4.3)
Problem Pn : minimize f (x, n) subject to x ∈ X,
ASTRO-DF
9
where f (x, n) is computed using a “fixed” sample of size n. SAA is attractive in that Problem Pn becomes a deterministic optimization problem and SAA can bring to bear all of the advances in deterministic nonlinear programming methods [11] of the last few decades. SAA has been the subject of a tremendous amount of theoretical and empirical research over the last two decades. For example, the conditions that allow the transfer of structural properties from the sample-path to the limit function f (x) [32, Propositions 1,3,4]; the sufficient conditions for the consistency of the optimal value and solution of Problem Pn assuming the numerical procedure in use within SAA can produce global optima [50, Theorem 5.3]; consistency of the set of stationary points of Problem Pn [50, 6]; convergence rates for the optimal value [50, Theorem 5.7] and optimal solution [32, Theorem 12]; expressions for the minimum sample size m that provides probabilistic guarantees on the optimality gap of the sample-path solution [49, Theorem 5.18]; methods for estimating the accuracy of an obtained solution [35, 8, 9]; and quantifications of the trade-off between searching and sampling [48], have all been thoroughly studied. SAA is usually not implemented in the vanilla form Pn due to known issues relating to an appropriate choice of the sample size n. There have been recent advances [41, 24, 8, 9, 10] aimed at defeating the issue of sample size choice. 4.2. Stochastic TRO. Two algorithms that are particularly noteworthy competitors to what we propose here are STORM [19] and the recently proposed algorithm by Larson and Billups [34] (henceforth LB2014). While the underlying logic in both of these algorithms are very similar to that in ASTRO-DF, key differences arise in terms of what has been assumed about the quality of the constructed models and how such quality can be achieved in practice. A key postulate that guarantees consistency in STORM, for example, is that the constructed models are of a certain specified quality (characterized through the notion of probabilistic full linearity) with a probability exceeding a fixed threshold. The authors provide a way to construct such models using function estimates constructed as sample means. Crucially, the prescribed sample means in STORM use a sample size that is derived using the Chebyshev inequality with an assumed upper bound on the variance. By contrast, the sample sizes in ASTRO-DF are determined adaptively by balancing squared bias and variance estimates for the function estimator. While this makes the sample size in ASTRODF a stopping time [12] thereby complicating proofs, such adaptive sampling enables ASTRO-DF to differentially sample across the search space, leading to efficiency. LB2014, like STORM, uses random models. Unlike STORM, however, the sequence of models constructed in LB2014 are assumed to be accurate (as measured by a certain rigorous notion) with a probability sequence that converges to one. A related version [13] of LB2014 address the issue of differing levels of (spatial) stochastic error through the use of weighted regression schemes, where the weights are chosen heuristically. Another noteworthy algorithm for the context we consider in this paper is VNSP, proposed by Deng and Ferris [23, 24]. VNSP uses a quadratic interpolation model within a trust-region optimization framework, and is derivative-free in the sense that only function estimates are assumed to be available. Model construction, inference, and improvement, along with (nondecreasing) sample size updates happen within a Bayesian framework with an assumed Gaussian conjugate prior. Convergence theory for VNSP is accordingly within a Bayesian setting. In the slightly more tangential context where unbiased gradient estimates are assumed to be available, a number of trust-region type algorithms have emerged
10
Shashaani, Hashemi, and Pasupathy
in the last decade or so. STRONG or Stochastic Trust-Region Response-Surface Method [18], for instance, is an adaptive sampling trust-region algorithm for solving SO problems that is in the spirit of what we propose here. A key feature of STRONG is local model construction through a design of experiments combined with a hypothesis testing procedure. STRONG assumes that the error in the derivative observations are additive and have a Gaussian distribution. Amos et. al. [2] and Bastin et. al. [7] are two other examples of algorithms that treat the setting where unbiased observations of the gradient are assumed to be available. (The former, in fact, assumes that unbiased estimates of the Hessian of the objective function are available.) Bastin et. al. [7] is specific to the problem of estimation within mixed-logit models. 5. ASTRO–DF OVERVIEW AND ALGORITHM LISTING. ASTRODF is an adaptive sampling trust-region derivative-free algorithm whose essence is encapsulated within four repeating steps: (i) local stochastic model construction and certification through adaptive sampling; (ii) constrained optimization of the constructed model for identifying the next candidate solution; (iii) re-estimation of the next candidate solution through adaptive sampling; and (iv) iterate and trust-region update based on a (stochastic) sufficient decrease check. These stages appear as Steps 1-4 in Algorithm 1. In what follows, we describe each of these steps in further detail. Step 0 of Algorithm 1 is the initialization step where the iteration number k and trust-region radius ∆k are initialized. In Step 1 of Algorithm 1, a stochastic model of the function f (·) in the trustregion B(Xk ; ∆k ) having specific properties is constructed using Algorithm 2. The aim of Algorithm 2 is to construct a model of a specified quality within a trust-region having radius smaller than a fixed multiple of the model gradient. During the jth iteration of Algorithm 2, a poised set Yk , {Y1 , Y2 , . . . , Yp } in the “candidate” ˜ k ) having radius wj−1 ∆ ˜ k and center Xk := Y1 is chosen; Monte trust region B(Xk , ∆ Carlo function estimates are then obtained at each of the points in Yk . Sampling at each point in Yk is adaptive and continues (Steps 1(b) and 1(c)) until the estimated p standard errors σ ˆ Yi , N Yi / N (Yi ) of the function estimates F¯ Yi , N Yi drop below a slightly inflated square of the candidate trust-region radius. A linear (or quadratic) interpolation model is then constructed using the obtained function estimates in Step 1(d). (If a linear interpolation model is constructed, p = d + 1, and if a quadratic interpolation model is constructed, p = (d + 1)(d + 2)/2.) If (j) ˜ k ) is such that the candidate trust-region the resulting model Mk (z), s ∈ B(Xk , ∆ (j) j−1 ˜ radius w ∆k is too large compared to the model gradient ∇Mk (Xk ), that is, if (j) ˜ k wj−1 > µ∇M (Xk ), then the candidate trust region radius is shrunk by a factor ∆ k w and control is returned back to Step 1(a). On the other hand, if the candidate trust region radius is smaller than the product of µ and the model gradient, then the resulting stochastic model is accepted but over an updated incumbent trustregion radius given by Step 2. (Step 2 of Algorithm 2, akin to [21], updates the incumbent trust-region radius to the point in the interval [∆k , wj−1 ∆k ] that is closest to β∇Mk (Xk )). We emphasize the following three issues pertaining to the model resulting from the application of Algorithm 2. (i) Due to the nature of the chosen poised set Yk , the (hypothetical) limiting model mk (Xk ) constructed from true function observations on Yk will be fully-linear (or fully-quadratic) on the updated trust region B(Xk , ∆k ). Of course, the model mk (Xk ) is unavailable since true function evaluations are
ASTRO-DF
11
Algorithm 1 ASTRO-DF Main Algorithm Algorithm Input: Initial guess x0 ∈ Rd , initial trust-region radius ∆0 , maximum trust-region radius ∆max , initial sample size n0 , sample size lower bound sequence {λk } such that k (1+ǫ) = O(λk ), threshold η1 > 0, trust-region expansion constant γ1 > 1, trust-region contraction constant γ2 ∈ (0, 1), and adaptive sampling constant κoas . STEP 0: Initialization. Set k = 0, Xk = x0 , ∆k = ∆0 . STEP 1: Model Construction with Adaptive Sampling. Go to Algorithm 2 with initial trust-region radius ∆k . Obtain model Mk and updated trust-region radius ∆k . STEP 2: Optimization. Approximate Sk = arg minksk≤∆k Mk (Xk + s). STEP 3: Estimation with Adaptive Sampling. Estimate F¯ (Xk + Sk , N (Xk + Sk )) where κoas ∆2 σ ˆ (Xk + Sk , n) √ ≤ √ k . (5.1) N (Xk + Sk ) = max λk , min n : n λk STEP 4: Update. Compute success ratio ρˆk as ρˆk =
F¯ (Xk , Nk ) − F¯ (Xk + Sk , N (Xk + Sk )) . Mk (Xk ) − Mk (Xk + Sk )
If ρˆk ≥ η1 , ˜ k+1 = min {γ1 ∆k , ∆max } . Set Xk+1 = Xk + Sk , Nk+1 = N (Xk + Sk ), and ∆ Additionally update Yk+1 = Yk \Ymax ∪ Xk+1 , where Ymax := arg maxYi ∈Yk Xk+1 − Yi . Else ˜ k+1 = γ2 ∆k . If Xk + Sk 6= Ymax , Set Xk+1 = Xk , Nk+1 = N (Xk ), and ∆ then update Yk+1 = Yk \Ymax ∪ (Xk + Sk ). Increase k by one and go to Step 1.
unavailable; and it makes no sense to talk about whether the constructed model Mk (Xk ) is fully linear (or fully quadratic) since it is constructed from stochastic function estimates. (ii) By construction, the trust region resulting from the application of Algorithm 2 has a radius that is at most β times the model gradient ∇Mk (Xk ). (iii) The structure of adaptive sampling in Step 1 of Algorithm 2 is identical to that appearing for estimation in Step 3 of Algorithm 1. The adaptive sampling step simply involves sampling until the estimated standard error of the function estimate comes within a factor of the square of the incumbent trust-region radius. As our convergence proofs will reveal, balancing the estimated standard error to any lower power of the incumbent trust-region radius will threaten consistency of ASTRO-DF’s iterates. Let us now resume our discussion of Algorithm 1. In Step 1, Algorithm 1 executes Algorithm 2 to obtain a model Mk (z) , z ∈ B (Xk ; ∆k ) whose limiting approximation is fully linear (or fully quadratic) as observed in (i) above. Step 2 in Algorithm 1 then approximately solves the constrained optimization problem ˜ k+1 = Xk + Sk satisSk = arg minksk≤∆k Mk (Xk + s) to obtain a candidate point X
12
Shashaani, Hashemi, and Pasupathy
Algorithm 2 ASTRO-DF Adaptive Model Improvement Algorithm ˜ k , sample set Yk , trust-region Algorithm Input: Initial trust region radius ∆ contraction factor w ∈ (0, 1), trust-region and gradient balance constant µ, and gradient inflation constant β. STEP 0: Initialize. Set j = 1. STEP 1: Construct. Repeat 2 p (a) Rename Xk as Y1 , and improve Yk = Y1 , Y , . . . , Y by adding or j−1 ˜ kw removing points to make it a poised set in B Xk ; ∆ . i i i (b) For all Y , i = 1, 2, . . . , p estimate F¯ Y , N Y such that (5.2)
2 j−1 ˜ κ ∆ w ias k σ ˆ Y ,n i √ √ ≤ . N Y = max λk , min n : n λk
i
(j)
(c) Construct a linear (or quadratic) interpolation model Mk (Xk + s). (d) Increment j by 1.
(j) ˜ k wj−1 ≤ µ Until ∆
∇Mk (Xk ) . (j)
(j)
STEP 2: Update. Set Mk (Xk + s) = Mn k ), k (Xk + s), n ∇Mk (Xk ) = ∇Mk (Xoo (j) 2 2 j−1 ˜ ˜ ∇ Mk (Xk ) = ∇ M (Xk ), and ∆k = min ∆k , max β k∇Mk (Xk )k , ∆k w . k
fying the κf cd -Cauchy decrease as defined in Assumption 4. ˜ k+1 provides sufficient deIn preparation for checking if the candidate solution X crease, Step 3 of Algorithm 1 obtains Monte Carlo samples of the objective p function at ˜ k+1 , until the estimated standard error σ X ˆ (Xk + Sk , N (Xk + Sk )) / N (Xk + Sk ) of F¯ (Xk + Sk , N (Xk + Sk )) is smaller than a slightly inflated square of the trust−1/2 region radius λk κoas ∆2k , subject to the sample size being at least as big as λk (see Remark 4). In Step 4 of Algorithm 1, the obtained function estimate is used to check if the ˜ k+1 exceeds ratio ρˆk of the predicted to the observed function decrease at the point X ˜ a fixed threshold η1 . If ρˆk exceeds the threshold η1 , the candidate Xk+1 is accepted as the new iterate Xk+1 , the iteration is deemed successful, and the trust-region is ˜ k+1 is rejected expanded. If ρˆk falls below the specified threshold η1 , the candidate X (though it may remain in the sample set), the iteration is deemed unsuccessful, and ˜ k+1 is set as the initial trust-region radius the trust-region is shrunk. In either case, ∆ for the next iteration, and the algorithm control is returned to Step 1. Remark 4. The sequence {λk } appearing as the first argument of the “max” function in the expression for the adaptive sample size (in Step 3 of Algorithm 1) is standard for all adaptive sampling contexts, e.g., [20, 27], and intended to nullify the effects of mischance without explicitly participating in the limit. It will become evident through our proofs that the probability of the first argument in the expression for the adaptive sample size being binding will decay to zero as k becomes large. In fact, the first argument λk in the expression for the adaptive sample size can be reduced to λγk for any γ ∈ (0, 1] with little effect in the ensuing proofs. We have chosen not to do
ASTRO-DF
13
this in view of not making the user choose yet another parameter. 6. CONVERGENCE ANALYSIS OF ASTRO-DF. As noted in Section 5, the convergence behavior of ASTRO-DF depends crucially on the behavior of three error terms: (i) stochastic sampling error arising due to the fact that function evaluations are through Monte Carlo simulation; (ii) model bias arising due to the choice of local model; and (iii) stochastic interpolation error arising due to the fact that model prediction at unobserved points is a combination of the model bias and the error in (i). (The analysis in the deterministic context involves only the error in (ii).) Accordingly, driving the errors in (i) and (ii) to zero sufficiently fast, while ensuring the fully linear or quadratic sufficiency of the expected model, guarantees almost sure convergence. Driving the errors in (i) and (ii) to zero sufficiently fast is accomplished by forcing the sample sizes to increase across iterations at a sufficiently fast rate, something that we ensure by keeping the estimated standard error of all function estimates in lock step with the square of the trust-region radius. The trust-region radius is in turn also kept in lock-step with the model gradient through the model improvement Algorithm 2. Such a deliberate lock-step between the model error, trust-region radius, and the model gradient is aimed at efficiency without sacrificing consistency. In what follows, we provide a formal proof of the wp1 convergence of ASTRO-DF’s iterates. Recall that we assume that the models being constructed within Step 1 of Algorithm 1 are either linear or quadratic. Furthermore, we focus only on convergence to a first-order critical point of the function f . We first list six standing assumptions that are assumed to hold for the ensuing results. Additional asssumptions will be made as and when required. Assumption 1. The function f is differentiable and bounded from below. Furthermore, it has Lipschitz continuous gradients, that is, there exists νgL such that k∇f (x) − ∇f (y)k ≤ νgL kx − yk for all x, y ∈ X ⊆ IRd . Assumption 2. There exists a first-order critical point associated with Problem P , i.e., there exists x∗ such that: ∇f (x∗ ) = 0. Assumption 3. The Monte Carlo oracle, when executed at Xk , generates independent and identically distributed random variates Fj (Xk ) = f (Xk ) + ξi | Fk , where ξ1 , ξ2 , . . . is a martingale-difference sequence adopted to Fk such that supk E[|ξi |4v | Fk ] < ∞ for some v ≥ 2. Assumptions 1 – 3 are arguably mild assumptions relating to the problem. The following two assumptions relate to the nature of the proposed algorithm ASTRO-DF. Assumption 4. The minimizer obtained in the trust-region sub-problem (Step 2 of Algorithm 1) satsifies a κf cd -Cauchy decrease with κf cd > 0, that is, k∇Mk (Xk )k κf cd k∇Mk (Xk )k min , ∆ Mk (Xk ) − Mk (Xk + Sk ) ≥ k . 2 k∇2 Mk (Xk )k Assumption 5. There exists a positive constant κbhm , such that for every iteration k and every x ∈ X, the model Hessian ∇2 Mk (x) satisfies
2
P lim ∇ Mk (x) ≤ κbhm = 1. k→∞
Assumption 6. The “lower-bound sequence” {λk } is chosen to satisfy k (1+ǫ) = O(λk ) for some ǫ > 0.
14
Shashaani, Hashemi, and Pasupathy
Assumption 4 holds automatically for linear interpolation models since the model Hesssian ∇2 Mk (x) = 0. For quadratic interpolation models, it can be shown that if Sk is chosen as Sk = tC ∇Mk (Xk ), where tC = arg min Mk (Xk − α∇Mk (Xk )), α∈[0,∆k ]
then Sk satisfies a 21 -Cauchy decrease. Steps Sk based on an inexact line search can also be acccommodated to satisfy a κf cd Cauchy decrease. (See Section 10.1 in [22] for additional details.) The constant κbhm appearing in Assumption 5 is an algorithm parameter. Assumption 5 can be enforced through a check that is executed each time the model is constructed or updated. Likewise, Assumption 6 imposes a (weak) minimum increase on the sample sizes for estimation and model construction operations within ASTRO-DF. Remark 5. It is our view that the minimum rate of increase on the lower bound sequence {λk } can be reduced to a logarithmic increase instead of what has been assumed in Assumption 6. In the notation of Theorem 3.7, this will require a large¯ N > t} after assuming the existence deviation type bound on the tail probability P{|X| ¯ To the best of of the moment-generating function of random variables comprising X. our knowledge there currently exist no such results for fixed-width confidence interval stopping, which is the context of Theorem 3.7. 6.1. Main Results. We start with a result demonstrating that when Assumptions 1 – 6 hold, the sequence of function estimates observed across the iterates is almost surely bounded from below, that is, mischance cannot lead ASTRO-DF’s iterates to wander in an unbounded fashion. Lemma 6.1. Let Assumptions 1 – 6 hold. Then, ¯ P lim F (Xk , Nk ) = −∞ = 0. k→∞
Proof. We know from Assumption 1 that f is bounded from below. Hence we can write, ¯ (6.1) P lim F (Xk , Nk ) = −∞ ≤ P F¯ (Xk , Nk ) − f (Xk ) ≥ c i.o. . k→∞
However, P F¯ (Xk , Nk ) − f (Xk ) ≥ c = E P F¯ (Xk , Nk ) − f (Xk ) ≥ c | Fk 2 (6.2) ≤ E[c−2 E[ F¯ (Xk , Nk ) − f (Xk ) | Fk ]],
where the last inequality follows from the Chebyshev bound [12]. Now invoke part (iv) of Theorem 3.7 with the sample size expression in (5.1) to notice that (6.3)
E[ F¯ (Xk , Nk ) − f (Xk )
2
| Fk ] ∼ κ2oas ∆4k λ−1 k
as λk → ∞. Since the limit in (6.3) turns out to be uniform in k due to Assumption 3, we can write for any δ > 0 and large-enough k that 2 (6.4) E[ F¯ (Xk , Nk ) − f (Xk ) | Fk ] ≤ (1 + δ)κ2oas ∆4k λ−1 k .
15
ASTRO-DF
Since ∆k ≤ ∆max , we see from (6.2) and (6.4) that for large enough k, (6.5) P F¯ (Xk , Nk ) − f (Xk ) ≥ c ≤ c−2 (1 + δ)κ2oas ∆4max λ−1 . k
The right-hand side of (6.5) is summable since k (1+ǫ) = O(λk ) for some ǫ > 0; we can thus invoke the first Borel-Cantelli Lemma [12] and conclude that the right-hand side of (6.1) is zero. Next, we state a theorem that plays a crucial role in proving the overall convergence of ASTRO-DF iterates. Recall that even in deterministic derivative-free trust-region algorithms, unlike trust-region algorithms where derivative observations are available, the trust-region radius necessarily needs to converge to zero for successful convergence. Theorem 6.2 states that this is indeed the case for ASTRO-DF. The proof rests on Lemma 6.1 and the assumed sufficient Cauchy decrease guarantee during Step 2 of Algorithm 1. wp1 Theorem 6.2. Let Assumptions 1 – 6 hold. Then ∆k −−→ 0 as k → ∞. Proof. In what follows, let F˜ (Xi (ω) , Ni (ω)) denote the function estimate at the point Xi before entering the model improvement step during the ith iteration (or the function estimate at the candidate point at the end of the (i − 1)th iteration), and F¯ (Xi (ω) , Ni (ω)) the function estimate upon exiting the model improvement step during the ith iteration. We can then write F¯ (Xk (ω) , Nk ) = F¯ (X1 (ω) , N1 ) +
(6.6)
k−1 X
A1i + A2i
i=1
where the summands A1i = F˜ (Xi+1 (ω) , Ni+1 (ω)) − F¯ (Xi (ω) , Ni (ω)) and A2i = F¯ (Xi+1 (ω) , Ni+1 (ω)) − F˜ (Xi+1 (ω) , Ni+1 (ω)) . We now make the following observations about A1i and A2i . (a) If i is an unsuccessful iteration, then A1i = 0 since Xi (ω) = Xi+1 (ω). (b) If i is a successful iteration, n we know by odefinition that ρˆi ≥ η1 . If we denote κef d = (2µ)−1 η1 κf cd min (µκbhm ) have
(6.7)
−1
, 1 , then by Assumptions 4 and 5, we
A1i ≤ η1 (Mi+1 (Xi+1 (ω)) − Mi (Xi (ω))) k∇Mi (Xi (ω))k η1 , △i (ω) ≤ − κf cd k∇Mi (Xi (ω))k min 2 k∇2 Mi (Xi (ω))k ≤ −κef d ∆2i (ω).
(c) For any given c > 0, (6.5) in the proof of Lemma 6.1 assures us that P{|A2i | > c i.o.} = 0. This implies that except for a set of measure zero, |A1i | ≤ c for large enough iteration i. Now suppose D := {ω : limk→∞ ∆k (ω) 6= 0} denotes the set of sample-paths for which the trust-region radius does not decay to zero. For contraposition, suppose D has positive measure. Consider a sample-path ω0 ∈ D. Since unsuccessful iterations are necessarily contracting iterations, we can find δ(ω0 ) > 0 and a sub-sequence of successful iterations {kj } in the sample-path ω0 such that ∆kj ≥ δ(ω0 ). This implies from observation (b) above that (6.8)
A1kj ≤ −κef d δ 2 (ω0 ).
16
Shashaani, Hashemi, and Pasupathy
Now the iterations kj + 1, . . . , kj+1 − 1 are all unsuccessful iterations, implying from observation (a) above that A1kj +ℓ = 0, ℓ = 1, 2, . . . , kj+1 − kj − 1.
(6.9)
Also, by the observation (c) above, and choosing c = large-enough i, (6.10)
1 2 3 κef d δ (ω0 ),
we see that for
2 |F¯ (Xi (ω) , Ni (ω)) − F˜ (Xi (ω) , Ni (ω)) | ≤ κef d δ 2 (ω0 ). 3
We then write for large-enough j, kj+1 −1
X
kj+1 −1
A1ℓ + A2ℓ = A1kj +
ℓ=kj
A2ℓ
ℓ=kj
= (6.11)
X
A1kj
+ F¯ Xkj+1 −1 (ω) , Nkj+1 +1 (ω) − F˜ Xkj +1 (ω) , Nkj +1 (ω) ,
1 ≤ − κef d δ 2 (ω0 ), 3
where the first equality follows from observation (a) above, the second equality follows from the definition of A2ℓ , and the third inequality follows from (6.8) and (6.10). The inequality in (6.11) (and the fact that there is an entire sequence {kj } of successful iterations) means that limk→∞ F¯ (Xk (ω) , Nk ) = −∞ thus contradicting Lemma 6.1. The assertion of the theorem thus holds. Relying on Theorem 6.2, we now show that the model gradient converges to the true gradient almost surely. This, of course, does not imply that the true gradient itself converges to zero — a fact that will be established subsequently. wp1 Lemma 6.3. Let Assumptions 1 – 6 hold. Then k∇Mk (Xk ) − ∇f (Xk )k −−→ 0 as k → ∞. Proof. We assume that the stochastic model Mk constructed via Algorithm 2 terminates in finite time. (We prove that this assumption holds through Lemma C.1 ˜ k wjk −1 denote the trust-region in the Appendix.) In Step 1 of Algorithm 2 let ∆ radius over which the model is constructed. (Note that due to Step 2 of Algorithm 2, ˜ k wjk −1 may or may not equal the ending trust-region radius ∆k upon completion ∆ of k iterations of ASTRO-DF.) Then, we know from part (ii) of Lemma 3.8 that q 2 Pp θ Eki − Ek1 i=2 j −1 k ˜ kw + κ2 k∇Mk (Xk ) − ∇f (Xk )k ≤ κ1 ∆ , ˜ k wjk −1 ∆ where Ek1 = F¯ (Xk , N (Xk )) − f (Xk ) is the error of the sampled function estimate at the center point of the trust-region, and Eki = F¯ Yi , N Yi −f Yi for i = 2, . . . , p are the errors of the sampled function estimates at the interpolation points. (Note that p = d + 1 and θ = 1 in the linear interpolation models, and p = (d + 1)(d + 2)/2 and θ = 2 in the quadratic interpolation models. For the quantities κ1 and κ2 refer to part (ii) of Lemma 3.8.) For readability we let Xk = Y1 . wp1 wp1 ˜ k wjk −1 − We know from Theorem 0 as q k → ∞, and hence, ∆ −→ qP 6.2 that ∆k −−→ 2 Pp i Pp 2 p i 1 i 1 Ek − Ek = i=2 Ek − Ek1 . 0 as k → ∞. Also, i=2 (E − E ) ≤ i=2
ASTRO-DF
17
Considering these two observations, it suffices to show that as k → ∞, (6.12)
˜ k wjk −1 ∆
p −1 X wp1 i Ek − Ek1 − −→ 0. i=2
Towards this, we write for c > 0 and large enough k and some δ > 0, p X Pp E i − E 1 oi h n k i=1 ˜ k wjk −1 | Fk k ≥c ≤ E P Eki − Ek1 ≥ c ∆ P ˜ k wj−1 ∆ i=2 −2 ˜ k wjk −1 ˜ k wjk −1 )4 λ−1 ≤ 2(p − 1)c−2 ∆ (1 + δ)κ2ias (∆ k
(6.13)
≤ 2(p − 1)c−2 (1 + δ)κ2ias ∆2k λ−1 k ,
where the penultimate inequality above follows from arguments identical to those leading to (6.5) in the proof Lemma 6.1 after using the adaptive sample size expression in (5.2). Since the right-hand side of (6.13) is summable, we can invoke the first BorelCantelli lemma [12] to conclude that (6.12) holds. We now show that for large enough iteration k, the steps within ASTRO-DF are always successful with probability one. This result is important in that it implies that the model gradient and the trust-region radius will remain in lock-step for large k, almost surely. The proof proceeds by dividing the model error into three components, each of which is shown to be controlled with probability one. Theorem 6.4. Let Assumptions 1 – 6 hold. Then P {|ˆ ρk − 1| ≥ 1 − η1 , i.o.} = 0 for any η1 ∈ (0, 1). (j ) Proof. At the end of Step 1 of Algorithm 2, let mk k (z) be the interpolation model of f constructed on the poised set Yk . (Of course, we cannot construct mk (·) explicitly (j ) because the true function Then mk k (z) is a (κef , κeg )-fully values are unknown.) ˜ k wjk −1 , by the Lemma in ˜ k wjk −1 and since ∆k ≥ ∆ linear model of f on B Xk ; ∆
[22, p. 200] we have that mk (·) is a (κef , κeg )-fully-linear model of f on B (Xk ; ∆k ). In addition, Algorithm 2 ensures that ∆k ≤ µ k∇Mk (Xk )k. Assumption 4 on the Cauchy decrease in the minimization problem implies that k∇Mk (Xk )k κf cd k∇Mk (Xk )k min , ∆ Mk (Xk ) − Mk (Xk + Sk ) ≥ k 2 k∇2 Mk (Xk )k κf cd ∆k ≥ (6.14) k∇Mk (Xk )k min , ∆k 2 µκbhm where κmd = (2µκbhm )−1 min(µκ−1 bhm , 1)κf cd . Recall that F¯ (Xk , Nk ) − F¯ (Xk + Sk , N (Xk + Sk )) < η1 ρˆk := Mk (Xk ) − Mk (Xk + Sk )
and that F¯ (Xk , Nk ) = Mk (Xk ). Now using Bonferroni’s inequality and (6.14), we can write P {ˆ ρk < η1 } = P {|1 − ρˆk | ≥ 1 − η1 } ≤ P F¯ (Xk + Sk , N (Xk + Sk )) − Mk (Xk + Sk ) ≥ (1 − η1 ) κmd ∆2k (6.15) ≤ P Err1 ≥ η ′ ∆2k + P Err2 ≥ η ′ ∆2k + P Err3 ≥ η ′ ∆2k ,
18
Shashaani, Hashemi, and Pasupathy
where Err 1 := |Mk (Xk + Sk ) − mk (Xk + Sk )|, Err 2 := |mk (Xk + Sk ) − f (Xk + Sk )|, Err3 := f (Xk + Sk ) − F¯ (Xk + Sk , N (Xk + Sk )) , and η ′ = 3−1 (1 − η1 ) κmd . (It is useful to interpret three errors Err1 , Err2 and Err3 on the right-hand side of (6.15) as the stochastic interpolation error, the deterministic model error, and the stochastic sampling error respectively.) In what follows, we establish P {ˆ ρk < η1 i.o.} = 0 by demonstrating that each of the errors Err1 , Err2 , Err3 exceeding η ′ ∆2k infinitely often has probability zero. We first analyze the stochastic interpolation error probability P Err1 ≥ η ′ ∆2k appearing on the right-hand side of (6.15). Using part (i) of Lemma 3.8 and relabeling Xk to Y1 for readability, we write F¯ Yi , N Yi − f Yi > η ′ ∆2k max P Err1 > η ′ ∆2k ≤ P Yi ∈Yk , i=1,2,...,p
p X ≤ P F¯ Yi , N Yi − f Yi > η ′ ∆2k i=1
(6.16)
≤
p X i=1
E P F¯ Yi , N Yi − f Yi > η ′ ∆2k | Fk .
Now using (6.16) and arguments identical to those leading to (6.5) in the proof of Lemma 6.1 (and the sample size expression in Step 1(b) of Algorithm 2), we can then say for large enough k and some δ > 0 that P Err1 > η ′ ∆2k ≤ p(η ′ ∆2k )−2 (1 + δ)κ2ias ∆4k λ−1 k
(6.17)
≤ pη ′−2 (1 + δ)κ2ias λ−1 k .
1+ǫ Since λk is chosen so that = ǫ > 0, we see that (6.17) implies O(λk ) for some k i i ¯ − f Yi > η ′ ∆2k i.o. = 0. This in turn implies from (6.16) that P F Y , N Y that (6.18) P |Mk (Xk + Sk ) − mk (Xk + Sk )| ≥ η ′ ∆2k i.o. = 0. Next we analyze the stochastic interpolation error probability P Err2 ≥ η ′ ∆2k appearing on the right-hand side of (6.15). Since we know from the postulates of the theorem that mk (z) is a fully linear model of f on B (Xk ; ∆k ), implying that if η is chosen so that η ′ = 31 (1 − η1 )κmd > κef , we have (6.19) P |mk (Xk + Sk ) − f (Xk + Sk )| ≥ η ′ ∆2k i.o. = 0. Finally, we analyze the stochastic sampling error probability P Err3 ≥ η ′ ∆2k appearing on the right-hand side of (6.15). Using arguments identical to those leading to (6.5) in the proof of Lemma 6.1, it is seen that (6.20) P f (Xk + Sk ) − F¯ (Xk + Sk , N (Xk + Sk )) ≥ η ′ ∆2k i.o. = 0.
Conclude from (6.18), (6.19), and (6.20) that each of errors Err1 , Err2 , Err3 exceeding η ′ ∆2k infinitely often has probability zero and the assertion of Theorem 6.4 holds. Lemma 6.5. If there exists a constant κlbg > 0, such that k∇Mk (Xk )k ≥ κlbg for large k, then there exists a constant κlbd > 0 such that ∆k ≥ κlbd for large k with probability one.
19
ASTRO-DF
Proof. Let Kg (ω) > 0 be such that k∇Mk (Xk ) (ω)k ≥ κlbg if k > Kg (ω). From Theorem 6.4, we know that there exists Ks (ω) > 0 such that k is a successful iteration (with probability one) if k > Ks (ω). Since Ks (ω)−1 is the last unsuccessful iteration, ˜ K (ω) = γ1 ∆K (ω)−1 implying that ∆ ˜ k (ω) > ∆k−1 (ω) for all k ≥ Ks (ω). we have ∆ s s For k − 1 ≥ max {Kg (ω) , Ks (ω)}, consider the two cases below when Algorithm 2 starts. ˜ k (ω) ≥ µ k∇Mk (Xk ) (ω)k): Since ∆ ˜ k (ω) ≥ µ k∇Mk (Xk ) (ω)k, the Case I (∆ inner loop of Algorithm 2 is entered, implying that the trust-region radius ∆k ≥ min(β, µ)k∇Mk (Xk )k ≥ min(β, µ)κlbg . ˜ k (ω) < µ k∇Mk (Xk ) (ω)k): In this scenario, the inner loop of AlCase II (∆ ˜ k−1 = η1 ∆k−1 implying that the gorithm 2 is not entered, implying that ∆k = ∆ trust-region radius expands from the previous iteration. Case I and Case II iterations are mutually exclusive and collectively exhaustive. Case I iterations imply, under the assumed postulates, that ∆k ≥ min(β, µ)κlbg ; Case II iterations result in Conclude from these assertions an expanded trust-region radius. that ∆k (ω) ≥ min min {β, µ} κlbg , ∆max{Kg ,Ks } . We are now fully setup to demonstrate that ASTRO-DF’s iterates converge to a first-order critical point with probability one. wp1 Theorem 6.6. Let Assumptions 1 – 6 hold. Then k∇f (Xk )k −−→ 0 as k → ∞. Proof. Lemma 6.5 and Theorem 6.2 together imply that lim inf k→∞ k∇Mk (Xk )k = 0. This, along with Lemma 6.3, implies that lim inf k→∞ k∇f (Xk )k = 0 almost surely. We now use the lim-inf convergence just established to prove the assertion of Theorem 6.6 through a contrapositive argument. Suppose we have a subsequence of iterations {ti } such that k∇f (Xti )k > 2ǫ. Due to the lim-inf type convergence just established, there exists another subsequence of iterations {li } such that k∇f (Xli )k < ǫ; furthermore let the subsequences {li } and {ti } be chosen so that ǫ ≤ k∇f (Xk )k ≤ 2ǫ for k ∈ {li + 1, ..., ti − 1}. By Lemma 6.3 we then have that for large enough i, ǫ ≤ k∇Mk (Xk )k ≤ 2ǫ where k ∈ {li + 1, ..., ti − 1}. Using Theorem 6.4, choose K1 ∈ N such that all iterations k ≥ K1 are successful. This implies that k∇Mk (Xk )k κf cd ¯ ¯ min ∆k , F (Xk , Nk ) − F (Xk+1 , Nk+1 ) ≥ η1 k∇Mk (Xk )k 2 κbhm κf cd ǫ (6.21) ≥ η1 ǫ . min ∆k , 2 κbhm Also, by Lemma 6.2, there exists K2 ∈ N such that for all k ≥ K2 , ∆k ≤ κ−1 bhm ǫ. So, for k ≥ max {K1 , K2 } we can use (6.21) to write −1 (6.22) ∆k ≤ 2 F¯ (Xk , Nk ) − F¯ (Xk+1 , Nk+1 ) (η1 ǫκf cd ) .
Now,
kXti − Xli k ≤ ≤ (6.23)
tX i −1 j=li
kXj − Xj+1 k ≤
tX i −1
∆k
j=li
tX i −1 2 F¯ (Xk , Nk ) − F¯ (Xk+1 , Nk+1 ) η1 ǫκf cd j=li
2 F¯ (Xt , Nt ) − F¯ (Xl , Nl ) , ≤ i i i i η1 ǫκf cd
20
Shashaani, Hashemi, and Pasupathy
where the second inequality in (6.23) is from (6.22). We know that for k ≥ K1 , F¯ (Xk ) is monotone decreasing and bounded from wp1 below (by Lemma 6.1), and therefore F¯ (Xt , Nt ) − F¯ (Xl , Nl ) −−→ 0 as i → ∞. i
i
i
i
wp1
This along with (6.23) implies then that kXti − Xli k −−→ 0. Conclude, since the wp1
gradient function ∇f (x) is continuous, that k∇f (Xti ) − ∇f (Xli )k −−→ 0 as k → ∞. We have arrived at a contradiction since we had argued through our choice of {ti }, {li } that k∇f (Xti ) − ∇f (Xli )k ≥ ǫ. 7. FURTHER REMARKS AND DISCUSSION. Over the last decade or so, derivative-free trust-region algorithms have deservedly enjoyed great attention and success in the deterministic optimization context. Analogous algorithms for the now widely prevalent and important Monte Carlo simulation optimization context, where only stochastic function oracles are available, is poorly studied. This paper develops adaptive sampling trust-region optimization derivative-free algorithms (called ASTRO-DF) for solving low to moderate dimensional simulation optimization problems. The key idea within ASTRO-DF is to endow a derivative-free trust-region algorithm with an adaptive sampling strategy for function estimation. The extent of such sampling at a visited point depends on the estimated proximity of the point to a solution, calculated by balancing the estimated standard error of the function estimate with a certain power of the incumbent trust-region radius. So, just as one might expect of efficient algorithms, Monte Carlo sampling in ASTRO-DF tends to be low during the early iterations compared to later iterations, when the visited points are more likely to be closer to a first-order critical point. More importantly, however, the schedule of sampling is not predetermined (as in most stochastic approximation and sample-average approximation algorithms) but instead adapts to the prevailing algorithm trajectory and the needed precision of the function estimates. We have shown that ASTRO-DF’s iterates exhibit global convergence to a firstorder critical point with probability one. While the proofs we have presented are detailed, convergence follows from two key features of ASTRO-DF: (i) the stochastic interpolation models are constructed across iterates in such a way that the error in the stochastic interpolation model is guaranteed to remain in lock-step with (a certain power of) the trust-region radius; and (ii) the optimization within the trust-region step is performed in such a way as to guarantee Cauchy reduction. Remarkably, the features (i) and (ii) together ensure that the sequence of trust region radii necessarily need to converge to zero with probability one, and that the model gradient, the true gradient and the trust region radius all have to remain in lock-step, thus guaranteeing convergence to a first-order critical point with probability one. The key driver for efficiency in both steps (i) and (ii) is adaptive sampling, making all sample sizes within ASTRO-DF stopping times that are explicitly dependent on algorithm trajectory. Three other points are worthy of mention. (i) Our proofs have demonstrated global convergence to first-order critical points. Corresponding proofs of convergence to a second-order critical point can be obtained in an identical fashion by driving some measure of second-order stationarity to zero instead of the model gradient. We speculate that the corresponding proofs do not involve much more insight than what has been presented here for first-order convergence. (ii) The adaptive sampling ideas and the ensuing proofs we have presented in this paper are for the specific case of stochastic interpolation models. It seems to us, however, that the methods of proof presented in this paper can be
21
ASTRO-DF
co-opted (with care) into other potentially more powerful model construction ideas such as regression [14] and kriging [3, 53]. Which of such ideas result in the best derivative-free trust region algorithms (as measured by finite-time and asymptotic efficiency) remains to be seen. (iii) We have presented no proof that ASTRO-DF’s iterates achieve the Monte Carlo canonical rate [4]. Demonstrating that ASTRO-DF’s iterates achieve the canonical rate will rely on rate results for derivative-free trust region algorithms in the deterministic context, some of which are only now appearing [25]. We speculate, however, that ASTRO-DF’s iterates do enjoy the canonical rate, as our analogous work [28] in a different context has demonstrated. Appendix A. Proof of part (i) of Lemma 3.8. We know that for all z ∈ B Y1 ; ∆ , m (z) =
p X i=1
p X li (z) f Yi ; M (z) = li (z) F¯ Yi , N Yi , i=1
where lj (z) are the Lagrange polynomials associated with the set Y. Since Y is Λ-poised in B Y1 ; ∆ , we know (see Chapter 3 in [22]) that (A.1)
Λ ≥ Λl =
max
max
i=1,2,...,p z∈B(Y 1 ;∆)
|li (z)| .
Now write, for z ∈ B Y1 ; ∆ , p X i i i ¯ |M (z) − m (z)| = li (z) F Y , N Y −f Y i=1 i i − f Yi ≤ pΛl max F¯ Y , N Y i∈{1,2,...,p} ≤ pΛ max F¯ Yi , N Yi − f Yi , i∈{1,2,...,p}
where the last inequality follows from (A.1). Appendix B. Proof of part (ii) of Lemma 3.8. If the model M (·) is a stochastic linear interpolation model, we see that for i = 1, 2, 3, . . . , p (B.1)
Yi − Y1
T
∇M Y1 = M Yi − M Y1 = f Yi − f Y1 + E i − E 1 .
Now re-trace the steps of the proof on pages 26 and 27 of [22], while carrying the additional term E i − E 1 appearing on the right-hand side of (B.1). If the model M (·) is a stochastic quadratic interpolation model, we write z ∈ B(Y1 , ∆), M (z) = c + zT g + 12 zT Hz = f (z) + ef (z); ∇M (z) = Hz + g = ∇f (z) + eg (z); ∇2 M (z) = H = ∇2 f (z) + eH (z). Now write, after subtracting the expression for M (z) from that for M (Yi ), i = 1, . . . , p + 1 to get 1 (B.2) (Yi −z)T g + (Yi −z)T H(Yi −z)+(Yi −z)T Hz = f (Yi )−f (z)+E i −ef (z). 2
22
Shashaani, Hashemi, and Pasupathy
Notice that the equation in (B.2) identical to the corresponding equation on page 53 of [22] except for the extra term E i appearing on the right-hand side of (B.2). Re-trace the steps on pages 53, 54, and 55 of [22]. Appendix C. Model Improvement Algorithm Termination. In what follows, we demonstrate through the following result that the model improvement algorithm (Algorithm 2) terminates with probability one, whenever the incumbent solution Xk is not a first-order critical point. Lemma C.1. Suppose the incumbent solution Xk ∈ X during the kth iteration is not first-order critical, that is, ∇f (Xk ) 6= 0. Then Algorithm 2 terminates in a finite number of steps with probability one. Proof. Set ∇f (Xk ) = c 6= 0. We will prove the assertion through a contradiction argument. First, we notice that the contraction loop (Step 1) in Algorithm 2 is not entered ˜ k , in which case which case Algorithm 2 terminates trivially. if µ k∇Mk (Xk )k ≥ ∆ ˜ k and that the contraction loop in Step 1 Next, suppose µ k∇Mk (Xk )k < ∆ (j) of Algorithm 2 is infinite. Let ∇Mk (Xk ) denote the model
gradient during the
(j) ˜ k wj−1 , ∀j ≥ 1. jth iteration of the contraction loop. Then µ ∇Mk (Xk ) < ∆
(j) wp1 ˜ j−1 → 0 and therefore This means, since w < 1, that ∆w
∇Mk (Xk ) −−→ 0 as j → ∞. Furthermore, due to the sampling rule in (5.2) and by Theorem 3.6, we (j) i(j) i is the sample size at point → ∞ as j → ∞, where N Y have that N Y (j)
(j)
Yi after the jth iteration of the contraction loop. Now, if Eki denotes the error i(j) th due to sampling at point Y after j iteration of the contraction loop, that is, the (j) (j) (j) i i(j) i i , we can write for large enough k and some −f Y E = F¯ Y , N Y δ > 0, (j) P (j) p pi=1 Eki − Ek1 X h n (j) oi (j) ˜ k wj−1 | Fk ≥c ≤ E P Eki − Ek1 ≥ c∆ (C.1) P ˜ k wj−1 ∆ i=1 −2 ˜ k wj−1 ≤ 2(p − 1)c−2 ∆ (1 + δ)κ2ias ∆4k λ−1 k
(C.2)
≤ 2(p − 1)c−2 (1 + δ)κ2ias ∆2k λ−1 k ,
where the penultimate inequality above follows from arguments identical to those leading to (6.5) in the proof Lemma 6.1 after using the adaptive sample size expression in (5.2). Since the right-hand side of (C.1) is summable, we conclude (j) −1 P (j) wp1 i p ˜ k wj−1 − E 1 −−→ 0. by Borel-Cantelli’s first lemma [12] that ∆ E i=2
k
k
This implies, from Lemma 3.8 and since Algorithm 2 maintains full-linearity, that
(j)
−1 P θ i
p (j) 1(j) wp1 ˜ k wj−1 ˜ k wj−1 +κ2 ∆ E − E
∇f (Xk ) − ∇Mk (Xk ) ≤ κ1 ∆ k k −−→ i=1 0 as j → ∞, where θ, κ1 , and κ2 are from Lemma 3.8. We have arrived at a contra
(j) wp1 diction since we argued that ∇Mk (Xk ) −−→ 0 but then k∇f (Xk )k = c 6= 0 by the contrapositive assumption.
ASTRO-DF
23
REFERENCES [1] O. Alagoz, A. J. Schaefer, and M. S. Roberts. Optimization in organ allocation. In P. Pardalos and E. Romeijn, editors, Handbook of Optimization in Medicine. Kluwer Academic Publishers, 2009. [2] B. D. Amos, D. R. Easterling, L. T. Watson, W. I. Thacker, B. S. Castle, and M. W. Trosset. Algorithm xxx: Qnstopquasinewton algorithm for stochastic optimization. 2014. [3] B. Ankenman, B. L. Nelson, and J. Staum. Stochastic kriging for simulation metamodeling. Operations research, 58(2):371–382, 2010. [4] S. Asmussen and P. W. Glynn. Stochastic simulation: Algorithms and analysis, volume 57. Springer Science & Business Media, 2007. [5] A. S. Bandeira, K. Scheinberg, and L. N. Vicente. Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization, 24(3):1238–1264, 2014. [6] F. Bastin, C. Cirillo, and P. L. Toint. Convergence theory for nonconvex stochastic programming with an application to mixed logit. Mathematical Programming, 108:207–234, 2006. [7] F. Bastin, C. Cirillo, and Ph. L. Toint. An adaptive monte carlo algorithm for computing mixed logit estimators. Computational Management Science, 3(1):55–79, 2006. [8] G. Bayraksan and D. P. Morton. Assessing solution quality in stochastic programs. Mathematical Programming Series B, 108:495–514, 2007. [9] G. Bayraksan and D. P. Morton. A sequential sampling procedure for stochastic programming. Operations Research, 59(4):898–913, 2011. [10] G. Bayraksan and P. Pierre-Louis. Fixed-width sequential stopping rules for a class of stochastic programs. SIAM Journal on Optimization, 22(4):1518–1548, 2012. [11] M. S. Bazaara, H. Sherali, and C. M. Shetty. Nonlinear Programming: Theory and Algorithms. John Wiley & Sons, New York, NY., 2006. [12] P. Billingsley. Probability and Measure. Wiley, New York, NY., 1995. [13] S. C. Billups, J. Larson, and P. Graf. Derivative-free optimization of expensive functions with computational error using weighted regression. SIAM Journal on Optimization, 23(1):27– 53, 2013. [14] S. C. Billups, J. Larson, and P. Graf. Derivative-free optimization of expensive functions with computational error using weighted regression. SIAM Journal on Optimization, 23(1):27– 53, 2013. [15] J. Blum. Approximation methods which converge with probability one. Annals of Mathematical Statistics, 25(2):382–386, 1954. [16] M. Broadie, D. M. Cicek, and A. Zeevi. An adaptive multidimensional version of the KieferWolfowitz stochastic approximation algorithm. In M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, editors, Proceedings of the 2009 Winter Simulation Conference, pages 601–612. Institute of Electrical and Electronics Engineers: Piscataway, New Jersey, 2009. [17] M. Broadie, D. M. Cicek, and A. Zeevi. General bounds and finite-time improvement for the kiefer-wolfowitz stochastic approximation algorithm. Operations Research, 59(5):1211– 1224, 2011. [18] K. Chang, L. J. Hong, and H. Wan. Stochastic trust-region response-surface method (strong)a new response-surface framework for simulation optimization. INFORMS Journal on Computing, 25(2):230–243, 2013. [19] R. Chen, M. Menickelly, and K. Scheinberg. Stochastic optimization using a trust-region method and random models, 2015. [20] T. S. Chow and H. Robbins. On the asymptotic theory of fixed-width sequential confidence intervals for the mean. The Annals of Mathematical Statistics, pages 457–462, 1965. [21] A. R. Conn, K. Scheinberg, and L. N. Vicente. Global convergence of general derivativefree trust-region algorithms to first-and second-order critical points. SIAM Journal on Optimization, 20(1):387–415, 2009. [22] A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to derivative-free optimization, volume 8. Siam, 2009. [23] G. Deng and M. C. Ferris. Adaptation of the UOBYQA algorithm for noisy functions. In Proceedings of the 38th conference on Winter simulation, pages 312–319. Winter Simulation Conference, 2006. [24] G. Deng and M. C. Ferris. Variable-number sample-path optimization. Mathematical Programming, 117(1-2):81–109, 2009. [25] R. Garmanjani, D. J´ udice, and L. N. Vicente. Trust-region methods without using derivatives: Worst case complexity and the non-smooth case. 2015. [26] M. Ghosh and N. Mukhopadhyay. Sequential point estimation of the mean when the distribution
24
Shashaani, Hashemi, and Pasupathy
is unspecified. Communications in Statistics - Theory and Methods, pages 637–652, 1979. [27] M. Ghosh, N. Mukhopadhyay, and P. K. Sen. Sequential Estimation. Wiley Series in Probability and Statistics, 1997. [28] F. S. Hashemi, S. Ghosh, and R. Pasupathy. On adaptive sampling rules for stochastic recursions. In Proceedings of the 2014 Winter Simulation Conference, pages 3959–3970. IEEE Press, 2014. [29] S. G. Henderson and B. L. Nelson, editors. volume 13 of Handbooks in Operations Research and Management Science: Simulation. Elsevier, 2006. [30] Y. T. Hou, Y. Shi, and H. D. Sherali. Applied optimization methods for wireless networks. Cambridge University Press, 2014. [31] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics, 23:462–466, 1952. [32] S. Kim, R. Pasupathy, and S. G. Henderson. A guide to SAA. Frederick Hilliers OR Series. Elsevier, 2014. [33] H. J. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer-Verlag, New York, NY., 2003. [34] J. Larson and S. C. Billups. Stochastic derivative-free optimization using a trust region framework. Under review at Computational Optimization and Applications, 2014. [35] W. K. Mak, D. P. Morton, and R. K. Wood. Monte Carlo bounding techniques for determining solution quality in stochastic programs. Operations Research Letters, 24:47–56, 1999. [36] A. Mokkadem and M. Pelletier. A generalization of the averaging procedure: The use of twotime-scale algorithms. SIAM Journal on Control and Optimization, 49:1523, 2011. [37] E. O. Nsoesie, R. J. Beckman, S. Shashaani, K. S. Nagaraj, and M. V. Marathe. A simulation optimization approach to epidemic forecasting. PloS one, 8(6):e67164, 2013. [38] J. M. Ortega and W. C. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York, NY., 1970. [39] C. Osorio and M. Bierlaire. A surrogate model for traffic optimization of congested networks: an analytic queueing network approach. Report TRANSP-OR, 90825:1–23, 2009. [40] R. Pasupathy. On choosing parameters in retrospective-approximation algorithms for stochastic root finding and simulation optimization. Operations Research, 58:889–901, 2010. [41] R. Pasupathy and S. Ghosh. Simulation optimization: A concise overview and implementation guide. INFORMS TutORials. INFORMS, 2013. [42] R. Pasupathy, P. W. Glynn, S. G. Ghosh, and F. S. Hahemi. How much to sample in simulationbased stochastic recursions? 2014. Under Review. [43] R. Pasupathy and S. G. Henderson. A testbed of simulation-optimization problems. In Proceedings of the 38th conference on Winter simulation, pages 255–263. Winter Simulation Conference, 2006. [44] R. Pasupathy and S. G. Henderson. SimOpt: A library of simulation optimization problems. In S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, editors, Proceedings of the 2011 Winter Simulation Conference. Institute of Electrical and Electronics Engineers: Piscataway, New Jersey, 2011. [45] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. [46] M. J. D. Powell. UOBYQA: unconstrained optimization by quadratic approximation. Mathematical Programming, 92(3):555–582, 2002. [47] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951. [48] J. Royset and R. Szechtman. Optimal budget allocation for sample average approximation. Operations Research, 2011. Under Review. [49] A. Ruszczynski and A. Shapiro, editors. Stochastic Programming. Handbook in Operations Research and Management Science. Elsevier, New York, NY., 2003. [50] A. Shapiro, D. Dentcheva, and A. Ruszczynski. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, PA, 2009. [51] J. C. Spall. Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45:1839–1853, 2000. [52] J. C. Spall. Introduction to Stochastic Search and Optimization. John Wiley & Sons, Inc., Hoboken, NJ., 2003. [53] M. L. Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York, NY, 1999. [54] F. Yousefian, U. V. A Nedi´ c, and Shanbhag. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56–67, 2012.