Uniform Convergence of Sample Average Approximation with Adaptive Importance Sampling Jeremy Staum
Andreas W¨achter
Alvaro Maggiar
Mingbin Feng
July 22, 2015
Keywords: adaptive importance sampling, adaptive multiple importance sampling, sample average approximation, parametric integration, uniform convergence Abstract We study sample average approximations under adaptive importance sampling. Based on a Banach-space-valued martingale strong law of large numbers, we establish uniform convergence of the sample average approximation to the function being approximated. In the optimization context, we obtain convergence of the optimal value and optimal solutions of the sample average approximation.
1
Introduction
In stochastic optimization and simulation metamodeling, we are interested in optimizing or approximating a function g : X → R given by Z g(x) = G(x, ξ) dξ. (1) Ξ
This general parametric integral includes the special cases Z g(x) = E[F (x, ξ)] = F (x, ξ)h(ξ) dξ
(2)
Ξ
where ξ is a random vector with density h and Z g(x) = Ex [F (ξ)] =
F (ξ)h(x, ξ) dξ
(3)
Ξ
where Ex denotes an expectation taken under the assumption that ξ is a random vector with density h(x, ·). The Monte Carlo method is applicable by sampling from a density φ on Ξ, with the property that φ(ξ) > 0 for any ξ such that there exists an x ∈ X such that G(x, ξ) > 0. Then, when ξ is sampled from φ, for all x ∈ X , the Monte Carlo estimator G(x, ξ)/φ(ξ) has expectation g(x). Suppose that, for each i = 1, . . . , N , ξi is sampled from density φi . We study the convergence of the sample average approximation gˆ given by N X 1 G(x, ξi ) . gˆN (x) = N φi (ξi ) i=1
1
(4)
The convergence of gˆN (x) to g(x) as sample size N → ∞ for a single fixed x is relatively elementary. Results about convergence of the sample average approximation gˆN to the function g are also valuable. Conclusions about mean integrated squared error, a figure of merit in simulation metamodeling, are supported by an analysis of convergence of gˆN to g in the L2 sense (Ankenman et al., 2010; Rosenbaum and Staum, 2013). Results about uniform convergence, i.e., convergence in the L∞ sense, guarantee the quality of the simulation metamodel evaluated at any point x. Uniform convergence of the sample average approximation is used to analyze the quality of approximate solutions to stochastic optimization problems (Shapiro et al., 2009, § 5.1). When the samples ξ1 , . . . , ξN are independent and identically distributed, uniform convergence of the sample average approximation is obtained from a uniform law of large numbers (Shapiro et al., 2009, § 7.2.5). In this paper, we establish results about the convergence of the sample average approximation gˆN to g under adaptive importance sampling, which makes ξ1 , . . . , ξN neither identically distributed nor independent. Some earlier papers have addressed convergence of the sample average approximation under general sampling, i.e., when ξ1 , . . . , ξN are not independent and identically distributed. Dai et al. (2000) proved results about convergence of solutions to sample average approximation problems when ξ1 , . . . , ξN is neither identically distributed nor independent, but did not discuss uniform convergence of gˆN to g. Korf and Wets (2001) proved a theorem about epi-convergence of gˆN to g, from which convergence of solutions to sample average approximation problems follows. One of their assumptions is that {ξi }∞ i=1 forms an ergodic process, which may not be true or easy to verify in all applications. Homem-de-Mello (2008) established results on uniform convergence of gˆN to g, and of solutions to sample average approximation problems, under general sampling. He also particularly considered Latin hypercube sampling and quasi-Monte Carlo sampling. His results were generalized by Xu (2010). These papers contain assumptions that X is compact and F (·, ξ) is almost surely continuous. In particular, they assume that F (·, ξ) in Equation (2) is Lipschitzcontinuous (Homem-de-Mello, 2008) or H-calm (Xu, 2010), and place conditions on corresponding coefficients of continuity. By employing different mathematical tools, we are able to dispense with these assumptions, assuming instead that G(·, ξ) is almost surely bounded. We use a strong law of large numbers for martingales taking values in a Banach space (Hoffmann-Jørgensen and Pisier, 1976). The Banach space is a function space that contains the functions of x, such as g and G(·, ξ).
1.1
Importance Sampling
For the stochastic formulations of the problem, represented by Equations (2) or (3), estimating g(x) by sampling from a density other than h or h(x, ·) is called “importance sampling” (IS). Royset and Polak (2004) presented a result on uniform convergence of the sample average approximation when ξ1 , . . . , ξN are independently sampled from an identical IS distribution. This was for a particular stochastic optimization problem of the form in Equation (2) in which the original and IS distributions were both normal, and the function F was bounded. We obtain uniform convergence of the sample average approximation under more general conditions on the distributions and the function F , and for adaptive importance sampling. In adaptive IS, ξ1 , . . . , ξN are sampled from different IS densities φ1 , . . . , φN , respectively, and the IS distribution φi for ξi can depend on ξ1 , . . . , ξi−1 . We also wish to mention multiple importance sampling (MIS), although we have not proved results about it. The idea of MIS (Veach and Guibas, 1995; Owen and Zhou, 2000) P is to regard the sampling of ξ1 , . . . , ξN as stratified sampling from the mixture distribution N j=1 φj /N , thus justifying the estimator N X G(x, ξi ) (5) PN j=1 φj (ξi ) i=1 2
instead of the IS estimator in Equation (4). In some situations, the IS estimator has high variance due to the possibility of the denominator φi (ξi ) being very small. An advantage of MIS is that if P any term in the denominator N φ (ξ ) is not small, then the denominator is not small. j=1 j i In MIS, the IS densities φ1 , . . . , φN are regarded as fixed; φi cannot depend on ξ1 , . . . , ξi−1 . Cornuet et al. (2012) proposed adaptive multiple importance sampling (AMIS), which allows for such dependence. The convergence of AMIS, i.e., consistency in the statistical sense, is addressed by Marin et al. (2014) using a weak law of large numbers for triangular arrays. AMIS procedures vary in the way that they learn from simulation output and reuse it. Consistency results for some variants of AMIS are yet to be obtained (Marin et al., 2014). We are also not aware of any uniform convergence results for AMIS. We hope the present work will further research on strong and uniform convergence of AMIS.
1.2
Applications
In the remainder of the introduction, we explain possible uses in stochastic optimization and simulation metamodeling. Shapiro et al. (2009, § 5.5.3) warn that a single importance sampling distribution “usually does not work for different points x and consequently cannot be used for a whole optimization procedure.” Homem-de-Mello and Bayraksan (2015, § 9.3.4) mention successful applications of IS to particular stochastic optimization problems, but assert that “a major difficulty that arises when employing IS methods in the context of iterative algorithms for more general stochastic optimization problems is that changes in x throughout an optimization routine can result in different IS distributions for different x.” We believe that the techniques of MIS and AMIS will prove valuable in overcoming this difficulty, and that for the same reason, they will be valuable in simulation metamodeling. Initial applications of MIS and AMIS have already taken place in stochastic optimization (Maggiar et al., 2015) and simulation metamodeling (Feng and Staum, 2015). These applications are in the setting of Equation (3), and share the feature that the function F is expensive to evaluate. This creates a motivation to re-use all previous evaluations of the function F . Suppose that at iterations i = 1, . . . , N of a stochastic optimization or adaptive simulation metamodeling procedure, we sampled ξi from PN F (ξi ). Then we are in a situation of AMIS where we P h(xi , ·) and evaluated F (ξ )h(x, ξ )/ estimate g(x) by N i i j=1 h(xj , ξi ). i=1
2
Main Results
To recapitulate with more mathematical detail: Let X be a subset of Rn and Ξ be a subset of Rd . Let G be a function from X × Rd to R whose support is contained in X × Ξ. We do not assume X is bounded or closed. Let (Ω, G, Q) be a probability space on which there is an infinite sequence d ∞ of random vectors {ξi }∞ i=1 , each ξi being a G-measurable function from Ω to R . Define {Fi }i=1 as the natural filtration of this sequence, i.e., Fi contains the information in ξ1 , . . . , ξi . Suppose that under Q, for every i ∈ N, the conditional distribution of ξi given Fi−1 has a density φi . Let Ξi represent the support of φi ; this subset of Rd can be random. We are concerned with uniform convergence as N → ∞ of the sample average approximation gˆN defined by Equation (4) to the function g defined by Equation (1). Assumption 1. The function g is bounded on X . The purpose of the following assumption is to ensure that the ratios in Equation (4) are finite. Assumption 2. With probability one, for every i ∈ N, Ξ ⊆ Ξi . 3
It is standard to assume that F in Equation (2) is a Carath´eodory function, i.e., F (x, ξ) is continuous in x and measurable in ξ (Shapiro et al., 2009, p. 156). We do not assume continuity in x. Assumption 3. For all x ∈ X , G(x, ·) is a measurable function on Rd . With probability one, for every i ∈ N, G(·, ξi ) is a bounded function on X . It follows from Assumption 3 that for each N ∈ N, the sample average approximation gˆN is a random, bounded function. That is, it is a function from Ω to the Banach space L∞ (X ) of bounded functions on X , equipped with the sup norm k · k∞ . Assumption 4. There exist non-negative constants k and b such that, with probability one, for every i ∈ N, for all x ∈ X and ξ ∈ Ξi , |G(x, ξ)| ≤ k exp(bkξk2 ). φi (ξ) Assumptions on the unconditional moment generating function of F (x, ξ) in Equation (2), for each x ∈ X , are common in this type of analysis (Dai et al., 2000; Homem-de-Mello, 2008; Xu, 2010). In Assumption 5, we focus on the moment generating function Mi of the conditional distribution of kξi k2 given Fi−1 , given by Z exp(skξk2 )φi (ξ) dξ. Mi (s) = E[exp(skξi k2 )|Fi ] = Ξi
This Mi is a random function because ξi is a random function. We avoid involving x in a moment generating function; uniform convergence will be obtained by combining Assumption 5 with Assumption 4’s growth bound in kξk2 , which is uniform over x ∈ X . P −p Assumption 5. There exists p > 1 such that ∞ i=1 i E[Mi (pb)] < ∞, where b is as in Assumption 4. Section 4.1 contains an example in which we verify that these assumptions hold. The proofs of the following theorems are deferred to Section 5. Theorem 1. If Assumptions 1, 2, 3, 4, and 5 hold, then with probability one, limN →∞ kˆ gN −gk∞ = 0. Next we consider the convergence of optimal solutions of the optimization problem min g(x)
(6)
min gˆN (x).
(7)
x∈X
and its sample-average approximation x∈X
Let ϑ∗ and ϑˆN denote the optimal objective values of (6) and (7), respectively. Similarly, let S∗ and SˆN denote the set of optimal solutions of (6) and (7), respectively. Finally, we define the distance of a point x ∈ X to a set B ⊆ X as dist(x, B) = inf x0 ∈B kx − x0 k2 and the deviation of a set A ⊆ X from the set B as D(A, B) = sup dist(x, B). x∈A
Theorem 2. Suppose that Assumptions 1, 2, 3, 4, and 5 hold, and that g is continuous. Further assume that there exists a compact set C ⊆ X such that S∗ is non-empty and contained in C, and with probability one, for N large enough, SˆN is non-empty and contained in C. Then, with probability one, limN →∞ ϑˆN = ϑ∗ and limN →∞ D(SˆN , S∗ ) = 0. 4
3
Results When Some Parameters Converge
The parameter vector x in the parametric integral (1) may include some parameters that are of primary interest and other parameters that are of auxiliary interest or are mere nuisance parameters. We write x = (x0 , y) where x0 is of primary interest and y is not. We provide results relevant to sample-average approximation and optimization over x0 alone, in the case where the sample-average approximations are constructed using a convergent sequence of random values of the y parameters. For example, these values of the y parameters may represent estimators of statistical parameters, stochastic processes that converge to a limiting random variable, decisions that are updated and converge over time, etc. To be mathematically precise, let us assume that in the framework established in Section 2, 0 X = X 0 × Y, where X 0 ⊆ Rn and Y ⊆ Rm for some n0 and m that sum to n. There is a sequence m of random vectors {YN }∞ N =1 , each YN being a G-measurable function from Ω to R . This sequence need not be adapted to the filtration {Fi }∞ i=1 . We analyze problems in which this sequence converges to a limiting random variable Y∗ . Assumption 6. There exists a random variable Y∗ such that limN →∞ kYN − Y∗ k2 = 0 with probability one. Y : Ω → L (X 0 ) given by We study the convergence of sample average approximations gˆN ∞ Y gˆN (x0 )
N X 1 G((x0 , YN ), ξ) = N φi (ξi ) i=1
to the function g Y : Ω → L∞ (X 0 ) given by g Y (x0 ) = g((x0 , Y∗ )). We obtain this uniform convergence under the following further assumption. m Assumption 7. There exist a function R a function gy : R → R that is continuous at0 0 and Gξ : Ξ → R whose integral Ξ Gξ (ξ) dξ exists and is finite, such that, for all x ∈ X 0 , y1 , y2 ∈ Y, and ξ ∈ Ξ, |G((x0 , y1 ), ξ) − G((x0 , y2 ), ξ)| ≤ gy (y1 − y2 )Gξ (ξ). (8)
The proof of the following theorem is deferred to Section 5.3. Y − Theorem 3. If Assumptions 1, 2, 3, 4, 5, 6, and 7 hold, then with probability one, limN →∞ kˆ gN g Y k∞ = 0.
Finally, we consider the optimization problem min g Y (x0 )
(9)
Y min gˆN (x0 ).
(10)
x0 ∈X 0
and its sample-average approximation x0 ∈X 0
Y Let ϑY∗ and ϑˆYN denote the optimal objective values of (9) and (10), respectively. Let S∗Y and SˆN denote the set of optimal solutions of (9) and (10), respectively. Section 4.2 contains an example that we fit into this framework, and for which we verify that the assumptions of our theorems hold. Theorem 4 follows from Theorem 3 in the same way that Theorem 2 follows from Theorem 1.
5
Theorem 4. Suppose Assumptions 1, 2, 3, 4, 5, 6, and 7 hold, and that for all y ∈ Y, g((·, y)) is a continuous function on X 0 . Further assume that there exists a compact set C ⊆ X 0 such that, with Y is non-empty probability one, S∗Y is non-empty and contained in C and for N large enough, SˆN Y , S Y ) = 0. and contained in C. Then, with probability one, limN →∞ ϑˆYN = ϑY∗ and limN →∞ D(SˆN ∗
4 4.1
Examples Reliability-Based Optimal Design
To illustrate, let us consider the problem studied by Royset and Polak (2004), describe it in our framework of Section 2, and verify that our assumptions hold. This problem fits the form of Equation (2). The original density h = ϕ, the (d-variate) standard normal density. F (x, ξ) represents the conditional failure probability of a system with design x given a random scenario described by ξ. In this problem, the sampling is independent and identically distributed: all the sampling densities φi are normal with mean vector µ and covariance matrix σ 2 I. The sampling standard deviation σ > 1 is larger the original standard deviation. The conditional failure probability F is clearly bounded between 0 and 1, and Royset and Polak (2004) show that it is continuously differentiable in both arguments, under their assumptions. Therefore Assumption 3 holds. It also follows that Assumption 1 holds. Assumption 2 holds, because the support of φi is Rd . For Assumption 4, we consider |G(x, ξ)|/φi (ξ) = |F (x, ξ)|h(ξ)/φi (ξ). Assumption 4 holds because F is bounded and the likelihood ratio h(ξ) ϕ(ξ) 1 kξ − µk22 d 2 = −d = σ exp − kξk2 φi (ξ) 2 σ2 σ ϕ((ξ − µ)/σ) ! 2 − 2hµ, ξi − (σ 2 − 1)kξk2 kµk 2 2 2 = σ d exp 2σ 2 ! 2 + 2kµk kξk kµk 2 2 2 . ≤ σ d exp 2σ 2 In Assumption 5, we consider the moment generating function Mi of the conditional distribution of kξi k2 given Fi−1 . The sampling densities of ξi are the same for all i, and they are not random; let M = MP i denote the moment generating function of kξi k2 for all i. For any p > 1, to show that −p the series ∞ i=1 i E[Mi (pb)] < ∞, it suffices to show that M (pb) < ∞. Represent ξi = µ + σZi where Zi is d-variate standard normal. Then kξi k2 ≤ dµ + σkZi k2 . Because pb ≥ 0, ˜ (σpb) M (pb) ≤ E[exp(pb(dµ + σkZi k2 ))] = exp(pbdµ)M ˜ is the moment generating function of kZi k2 . The distribution of kZi k2 is chi-squared with d where M 2 degrees of freedom; its density at t is td/2−1 exp(−t/2)/Γ(d/2)2d/2 . Using the substitution t(u) = u2 , the density of kZi k2 at u is ud−2 exp(−u2 /2)t0 (u)/Γ(d/2)2d/2 = ud−1 exp(−u2 /2)/Γ(d/2)2d/2−1 . ˜ (s) = E[exp(skZi k2 )] is finite for all s, and M (pb) < ∞ follows. From this it can be seen that M
4.2
Regression Models for Step Computation in Optimization Algorithm
As a further illustration in which the importance sampling is adaptive, let us consider the problem studied by Maggiar et al. (2015). We describe the problem in our framework of Sections 2–3 and verify that the assumptions of Theorem 4 are satisfied. 6
Their work addresses the minimization of the function F : Y → R given by Z F (ξ)σ −d ϕ((ξ − y)/σ) dξ, F (y) = Ξ
where Y ⊂ Rd is a compact set, Ξ = Rd , and ϕ is the (d-variate) standard normal density. The integral is finite because F : Rd → R exhibits sub-exponential growth. That is, there exist positive constants kF and bF such that for all ξ ∈ Rd , |F (ξ)| ≤ kF exp(bF kξk2 ). In this context, F (ξ) is the output of a deterministic computer simulation run with input ξ. This output is subject to numerical noise, making F a non-smooth function. The objective function value F (y) is obtained by smoothing by convolution of F with the normal density function with mean y and covariance matrix σ 2 I. The derivative-free trust-region optimization algorithm proposed in Maggiar et al. (2015) computes steps based on a model mx0 (·; y). The model mx0 (·; y) built around a point y ∈ Y approximates the simulation output F (ξ) by mx0 (ξ; y). Here, x0 denotes the parameters in the model, such as the coefficients x0 = (b, g, Q) in a quadratic model given by mx0 (ξ; y) = b+hg, ξ−yi2 + 21 hξ−y, Q(ξ−y)i2 . 0 Let X 0 = Rn . Define r(x0 , y, ξ) = (mx0 (ξ; y) − F (ξ))2 , the squared error of the model in approximating the simulation output F (ξ). Parameters for a model built around y can be found by solving the regression problem Z r(x0 , y, ξ)σ −d ϕ((ξ − y)/σ) dξ.
min
x0 ∈X 0
Ξ
The integral here has the form of Equation (1) with G((x0 , y), ξ) = r(x0 , y, ξ)σ −d ϕ((ξ − y)/σ). The random vector ξi is sampled from φi , the normal density with mean vector Ti and covariance matrix σ 2 I, i.e., φi (ξ) = σ −d ϕ((ξ − Ti )/σ). The random vector Ti is measurable given Fi−1 : the center Ti of the sampling density φi is chosen in a way determined by how the derivativefree optimization procedure uses the simulation inputs and outputs (ξ1 , F (ξ1 )), . . . , (ξi−1 , F (ξi−1 )). Maggiar et al. (2015) guarantee that every center Ti generated by the algorithm lies in Y. For any ω ∈ Ω, let {YN (ω)}∞ N =1 be a subsequence of the sequence of iterates chosen by the algorithm, such that {YN (ω)}∞ N =1 converges to a limit point Y∗ (ω). Such a subsequence exists, due to compactness of Y; thus Assumption 6 holds. The parameters of a model built around an iterate YN are chosen to be an optimal solution of the sample-average approximation problem (10). The convergence analysis of the optimization algorithm in Maggiar et al. (2015) requires Y , S Y ) = 0, i.e., the chosen model parameters converge to optimal establishing that limN →∞ D(SˆN ∗ model parameters for a model built around Y∗ . Maggiar et al. (2015) show that there exists a compact set C ⊆ X 0 such that, with probability Y is non-empty and contained one, S∗Y is non-empty and contained in C and for N large enough, SˆN in C. It is also shown that r : X 0 × Y × Rd → R has the following properties: • For any ξ ∈ Rd , r(·, ·, ξ) is continuous on X 0 × Y. • The growth of r in kξk2 is sub-exponential, i.e., the exist kr > 0 and ar > 0 so that |r(x0 , y, ξ)| ≤ kr exp(ar kξk2 ) for all x0 ∈ X 0 , y ∈ Y, and ξ ∈ Rd . • There exist a function Rξ : Rd → R and constants kξ > 0 and aξ > 0 with Rξ (ξ) ≤ 0 kξ exp(aξ kξk2 ), a function rx0 : Rn → R that is continuous at x0 = 0, and a function ry : Rd → R that is continuous at y = 0, such that |r(x01 , y, ξ) − r(x02 , y, ξ)| ≤ rx0 (x01 − x02 )Rξ (ξ) 0
0
|r(x , y1 , ξ) − r(x , y2 , ξ)| ≤ ry (y1 − y2 )Rξ (ξ) for all x0 , x01 , x02 ∈ X 0 , y, y1 , y2 ∈ Y, and ξ ∈ Rd . 7
(11a) (11b)
Assumption 7 is expressed in (11b). Because of the smoothness of the normal density ϕ and the continuity of r in x and y, the function g defined in Equation (1) is continuous. It is therefore bounded on the compact set C × Y, verifying Assumption 1. For the same reasons, the integrand G(·, ξ) is bounded for each ξ ∈ Rd , and Assumption 3 is verified. Assumption 2 holds, because the support of φi is Rd . In Assumption 4, we consider |G((x0 , y), ξ)|/φi (ξ) = r(x0 , y, ξ)ϕ((ξ − y)/σ)/ϕ((ξ − Ti )/σ). Assumption 4 holds because r has sub-exponential growth, and the sub-exponential growth of the likelihood ratio is shown by ϕ((ξ − y)/σ) 2hξ, y − Ti i2 + kTi k22 − kyk22 kξ − Ti k22 − kξ − yk22 = exp = exp ϕ((ξ − Ti )/σ) 2σ 2 2σ 2 2 2 D1 + 2D2 kξk2 2kξk2 ky − Ti k2 + kTi k2 ≤ exp , ≤ exp 2σ 2 2σ 2 where D1 = supt∈Y ktk2 and D2 = supt,y∈Y ky − tk2 are finite because Y is bounded. In Assumption 5, we consider the moment generating function Mi of the conditional distribution of kξi k2 given Fi−1 . Represent ξi = Ti + σZi where, given Fi−1 , Ti is known and contained in Y, and Zi is d-variate standard normal. Then kξi k2 ≤ D1 + σkZi k2 . Much as in Section 4.1, this ˜ (σpb), which is a finite constant. Because p > 1, it follows that implies Mi (pb) ≤ exp(pbD1 )M P ∞ −p i=1 i E[Mi (pb)] < ∞.
5
Proofs
5.1
Proof of Theorem 1
For all x ∈ X and i, N ∈ N, define Ui (x) =
G(x, ξi ) − g(x) φi (ξi )
and VN (x) =
N X
Ui (x)
i=1
so that gˆN − g = VN /N . The conclusion of the Banach-space-valued strong law of large numbers (Hoffmann-Jørgensen and Pisier, 1976) is VN /N → 0 with probability one. Because we chose the Banach space L∞ (X ), the conclusion is limN →∞ kVN /N k∞ = 0 with probability one, and this is just what we want in the conclusion of Theorem 1. The remainder of this section verifies that our setting satisfies the conditions for Theorem 2.2 of Hoffmann-Jørgensen and Pisier (1976) to imply that kVN /N k∞ → 0 with probability one. The conditions are: • V must bePa martingale whose increments satisfy Chung’s condition: that there exists p > 0 p −p such that ∞ i=1 i E[kUi k∞ ] < ∞. This is verified in Propositions 1 and 2. • The Banach space must be uniformly (1 + α)-smooth for some α ∈ [0, 1]. It is easily verified that L∞ (X ) is uniformly 1-smooth for X ⊆ Rn . Uniform 1-smoothness means that 1 (kf1 + f2 k∞ + kf1 − f2 k∞ − 2) : kf1 k∞ = 1, kf2 k∞ = t ρ(t) = sup f1 ,f2 2 is O(t) as t → 0. This holds because the supremum is bounded above by t. Proposition 1. If Assumptions 1, 2, 3, 4, and 5 hold, then Chung’s condition holds for p > 1 as in Assumption 5. 8
Proof. Applying Lemma 1 with p0 = p, we get E [kUi kp∞ ] ≤ kgkp∞ P + k p E [Fi (pb)]. Therefore, P∞ p ∞ −p p Chung’s condition is verified once we show that i=1 i−p kgk ∞ and k i=1 i E [Fi (pb)] are both P∞ p −p finite. Because p > 1, it follows from Assumption 1 that i kgk ∞ < ∞. By Assumption 5, i=1 P∞ −p i=1 i E[Fi (pb)] < ∞. Lemma 1. If Assumptions 1, 2, 3, and 4 hold, then for any p0 ≥ 1, for every i ∈ N, h i 0 0 0 E kUi kp∞ ≤ kgkp∞ + k p E Fi (p0 b) . 0
0
0
Proof. We have kUi kp∞ ≤ kgkp∞ + kUi + gkp∞ . By Assumption 4, with probability one, for every i ∈ N, kUi + gk∞ ≤ k exp(bkξi k2 ). Therefore h i h h ii 0 0 0 0 E kUi + gkp∞ = E E kUi + gkp∞ |Fi−1 ≤ k p E E exp(p0 bkξi k2 )|Fi−1 = k p E Fi (p0 b) .
Proposition 2. If Assumptions 1, 2, 3, 4, and 5 hold, then V is a martingale. Proof. By construction, V is adaptedPto the filtration F. N 0 For any step N , E[kVN k∞ ] ≤ i=1 E[kUi k∞ ]. Applying Lemma 1 with p = 1, we get E[kUi k∞ ] ≤ kgk∞ + kE [Mi (b)] for all i. It follows from Assumption 5 that there exists p > 1 such that each E [Mi (pb)] < ∞. Moment generating functions are strictly positive and log-convex. Thus, from 0 ≤ b < pb, it follows that Mi (b) ≤ Mi (pb) for all i. Therefore each E [Mi (b)] < ∞. Therefore E[kVN k∞ ] < ∞. For any step i and any x ∈ X , because φi is the conditional density of ξi given Fi−1 , Z G(x, ξ) E[Ui (x)|Fi−1 ] = φi (ξ) dξ − g(x) = 0. φi (ξ) That is, for any step i, E[Ui |Fi−1 ] = 0.
5.2
Proof of Theorem 2
The proof follows very closely the proof of Theorem 5.3 in Shapiro et al. (2009), but we include it here for completeness. Lemma 2. Suppose Assumptions 1, 2, 3, 4, and 5 hold. Further assume that S∗ is not empty and that, with probability one, SˆN is non-empty for all N sufficiently large. Then limN →∞ ϑˆN = ϑ∗ with probability one. Proof. We prove limN →∞ ϑˆN = ϑ∗ in the event that SˆN is non-empty for all N sufficiently large and that limN →∞ kˆ gN −gk∞ = 0. This event has probability one by assumption and by Theorem 1. Let x∗ be an optimal solution of (6). Because limN →∞ kˆ gN − gk∞ = 0, limN →∞ gˆN (x∗ ) = g(x∗ ) = ϑ∗ . Because ϑˆN is the optimal value of (7), ϑˆN ≤ gˆN (x∗ ) for all N . Therefore lim supN →∞ ϑˆN ≤ ϑ∗ . Define ϑˆinf = lim inf N →∞ ϑˆN . There exist a subsequence {Ni }∞ i=1 of the natural numbers and a sequence {xN }∞ of points in X such that for every i = 1, . . . , ∞, xNi ∈ SˆNi , and N =1 limi→∞ gˆNi (xNi ) = ϑˆinf . Because limN →∞ kˆ gN − gk∞ = 0, we also have limi→∞ g(xNi ) = ϑˆinf . Because ϑ∗ is the optimal value of (6), ϑ∗ ≤ g(xNi ) for all Ni . Therefore ϑ∗ ≤ ϑˆinf . Overall, we have obtained lim supN →∞ ϑˆN ≤ ϑ∗ ≤ lim inf N →∞ ϑˆN . 9
Lemma 3. Suppose the assumptions of Theorem 2 hold. Then limN →∞ D(SˆN , S∗ ) = 0 with probability one. Proof. We prove limN →∞ D(SˆN , S∗ ) = 0 in the event that limN →∞ kˆ gN − gk∞ = 0, limN →∞ ϑˆN = ˆ ϑ∗ , and SN is non-empty and contained in C for all N sufficiently large. This event has probability one by Theorem 1, by Lemma 2, and by assumption. ∞ Consider any subsequence {Ni }∞ i=1 of the natural numbers and sequence {xN }N =1 of points in X such that for every i = 1, . . . , ∞, xNi ∈ SˆNi . Because C is compact, the sequence {xNi }∞ i=1 has a ∗ limit point. Consider any such limit point, and denote it as x . Consider any subsequence {Ni0 }∞ i=1 ∗ . For any i, 0 of {Ni }∞ such that lim x = x i→∞ Ni i=1 ϑˆNi0 − g(x∗ ) = gˆNi0 (xNi0 ) − g(x∗ ) = gˆNi0 (xNi0 ) − g(xNi0 ) + g(xNi0 ) − g(x∗ ) . By continuity of g, limi→∞ (g(xNi0 ) − g(x∗ )) = 0. We also have limi→∞ (ˆ gNi0 (xNi0 ) − g(xNi0 )) = 0 ∗ because limN →∞ kˆ gN − gk∞ = 0. Therefore limi→∞ ϑˆN 0 = g(x ). We also have limN →∞ ϑˆN = ϑ∗ . i
Thus, g(x∗ ) = ϑ∗ , which implies x∗ ∈ S∗ . In words: if x∗ is a limit point of a sequence {xNi }∞ i=1 of points that are optimal solutions of a sequence of sample-average approximation problems given by (7), then x∗ is in S∗ . Therefore lim supN →∞ D(SˆN , S∗ ) = lim supN →∞ supx∈SˆN dist(x, S∗ ) = 0.
5.3
Proof of Theorem 3
We have
Y
Y
gˆN − g((·, YN )) + kg((·, YN )) − g((·, Y∗ ))k
gˆN − g Y ≤ ∞ ∞Z ∞ G((x0 , YN ), ξ) − G((x0 , Y∗ ), ξ) dξ. ≤ kˆ gN − gk∞ + sup x0 ∈X 0
By Theorem 1, with probability one, limN →∞ kˆ gN − gk∞ = 0. By Assumptions 6 and 7, for any x0 ∈ X 0 , Z Z G((x0 , YN ), ξ) − G((x0 , Y∗ ), ξ) dξ ≤ gy (YN − Y∗ ) Gξ (ξ) dξ, R limN →∞ gy (YN − Y∗ ) = 0 by continuity of gy at 0, and Gξ (ξ) dξ < ∞. Therefore Z G((x0 , YN ), ξ) − G((x0 , Y∗ ), ξ) dξ = 0 lim sup N →∞ x0 ∈X 0
with probability one.
6
Future Research
We provided theorems about uniform convergence of a sample average approximation to a parametric integral, and about convergence of solutions and optimal values, under adaptive importance sampling. We mention three directions for future research: extension to adaptive multiple importance sampling (AMIS), analyzing rate of convergence, and generalization of the function spaces and norms involved in our setting. We mentioned in Section 1.2 the value of AMIS in applications. However, our proof of uniform convergence, based on a Banach-space-valued strong law of large numbers (Hoffmann-Jørgensen and 10
Pisier, 1976), appears to be inapplicable to the AMIS estimator in Equation (5). When proceeding from N to N + 1 samples, the denominator in Equation (5) changes for terms 1, . . . , N in a way that interferes with finding a martingale. It may be useful to employ concepts such as mixingale triangular arrays (de Jong, 1996; Davidson and de Jong, 1997) and Banach-space-valued mixingales (Gan, 2002; Hu, 2004). Banach-space-valued mixingale triangular arrays might be most helpful, but to the best of our knowledge, the requisite convergence theorems have not yet been proved. Because we relied on a strong law of large numbers, we proved a result about convergence, nothing about the rate of convergence. Using large deviations techniques, one can obtain an exponential rate of convergence for the sample average approximation (Dai et al., 2000; Homemde-Mello, 2008; Xu, 2010; Sun and Xu, 2012). The rate of convergence could be investigated in our setting, where the assumptions differ from the assumptions in the cited articles. In our setting, the integral g is a function from X to R, and the integrand G is a function from X × Ξ to R, where X ⊆ Rn and Ξ ⊆ Rd . Other cases for the target space besides R are also interesting. One might have a multi-dimensional space for multi-objective optimization or metamodeling of multiple simulation outputs simultaneously. One might have an infinite-dimensional function space when the simulation output is a function of time or space. Similarly, Ξ could be infinite-dimensional if the underlying source of randomness is a stochastic process, such as Brownian motion, rather than a random vector. Also X could be infinite-dimensional: for example, x could be a function of time representing the arrival rate of customers to a service system, or x could be the distribution function of a random variable. In stochastic control, X would contain stochastic processes representing controls. Moreover, the Banach-space-valued strong law of large numbers (Hoffmann-Jørgensen and Pisier, 1976) can also support an analysis of convergence for other Banach spaces, involving other norms. This could allow to weaken some of our assumptions, such as boundedness of G(·, ξi ) in Assumption 3.
Acknowledgments We thank Tito Homem-de-Mello, Imry Rosenbaum, and Johannes Royset for discussions. The second author was supported in part by the National Science Foundation grant DMS-1216920.
References Ankenman, B., Nelson, B. L., Staum, J., 2010. Stochasic kriging for simulation metamodeling. Operations Research 58 (2), 371–382. Cornuet, J.-M., Marin, J.-M., Mira, A., Robert, C. P., 2012. Adaptive multiple importance sampling. Scandinavian Journal of Statistics 39, 798–812. Dai, L., Chen, C. H., Birge, J. R., 2000. Convergence properties of two-stage stochastic programming. Journal of Optimization Theory and Applications 106 (3), 489–509. Davidson, J., de Jong, R., 1997. Strong laws of large numbers for dependent heterogeneous processes: a synthesis of recent and new results. Econometric Reviews 16 (3), 251–279. de Jong, R. M., 1996. A strong law of large numbers for triangular mixingale arrays. Statistics & probability letters 27 (1), 1–9.
11
Feng, M., Staum, J., 2015. Green simulation designs for repeated experiments. In: Yilmaz, L., Chan, W. K. V., Moon, I., Roeder, T. M. K., Macal, C., Rossetti, M. D. (Eds.), Proceedings of the 2015 Winter Simulation Conference. IEEE Press, forthcoming. Gan, S., 2002. Strong laws of large numbers for B-valued Lq -mixingale sequences and the qsmoothness of Banach space. Theory of Probability & Its Applications 46 (4), 717–721. Hoffmann-Jørgensen, J., Pisier, G., 1976. The law of large numbers and the central limit theorem in Banach spaces. Annals of Probability 4 (4), 587–599. Homem-de-Mello, T., 2008. On rates of convergence for stochastic optimization problems under non-independent and identically distributed sampling. SIAM Journal on Optimization 19 (2), 524–551. Homem-de-Mello, T., Bayraksan, G., 2015. Stochastic constraints and variance reduction techniques. In: Fu, M. C. (Ed.), Handbook of Simulation Optimization. Ch. 9, pp. 245–276. Hu, Y., 2004. Complete convergence theorems for Lp -mixingales. Journal of Mathematical Analysis and Applications 290 (1), 271–290. Korf, L., Wets, R. J.-B., 2001. Random LSC functions: An ergodic theorem. Mathematics of Operations Research 26 (2), 421–445. Maggiar, A., W¨ achter, A., Dolinskaya, I. S., Staum, J., July 2015. A derivative-free trust-region algorithm for the optimization of functions smoothed via Gaussian convolution using multiple importance sampling, working paper, Northwestern University. Marin, J.-M., Pudlo, P., Sedki, M., May 2014. Consistency of the adaptive multiple importance sampling, arXiv:1211.2548v2. Owen, A., Zhou, Y., 2000. Safe and effective importance sampling. Journal of the American Statistical Association 95 (449), 135–143. Rosenbaum, I., Staum, J., 2013. Multi level Monte Carlo metamodeling. In: Pasupathy, R., Kim, S.H., Tolk, A., Hill, R., Kuhl, M. E. (Eds.), Proceedings of the 2013 Winter Simulation Conference. IEEE Press, pp. 509–520. Royset, J. O., Polak, E., 2004. Implementable algorithm for stochastic optimization using sample average approximations. Journal of Optimization Theory and Applications 122 (1), 157–184. Shapiro, A., Dentcheva, D., Ruszczi´ nski, A., 2009. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia. Sun, H., Xu, H., 2012. A note on uniform exponential convergence of sample average random functions. Journal of Mathematical Analysis and Applications 385, 698–708. Veach, E., Guibas, L. J., 1995. Optimally combining sampling techniques for Monte Carlo rendering. In: Mair, S. G., Cook, R. (Eds.), SIGGRAPH ’95 Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques. ACM, pp. 419–428. Xu, H., 2010. Uniform exponential convergence of sample average random functions under general sampling with applications in stochastic programming. Journal of Mathematical Analysis and Applications 368, 692–710. 12