Variance Reduction Algorithms for Parallel Replicated ... - CiteSeerX

Comment

Report 2 Downloads 132 Views

Variance Reduction Algorithms for Parallel Replicated Simulation of Uniformized Markov Chains y Simon Streltsov & Pirooz Vakili [email protected] & [email protected] Manufacturing Engineering Department Boston University 15 St. Mary Str Boston, MA 02215 Submitted July 1994,revised: April 1995

Abstract

We discuss the simulation of M replications of a uniformizable Markov chain simultaneously and in parallel (the so-called parallel replicated approach). Distributed implementation on a number of processors and parallel SIMD implementation on massively parallel computers are described. We investigate various ways of inducing correlation across replications in order to reduce the variance of estimators obtained from the M replications. In particular, we consider the adaptation of Strati ed Sampling, Latin Hypercube Sampling, and Rotation Sampling to our setting. These algorithms can be used in conjunction with the Standard Clock simulation of uniformized chains at distinct parameter values and can potentially sharpen multiple comparisons of systems in that setting. Our investigation is primarily motivated by this consideration.

1 Introduction Parallel/distributed approaches for discrete event simulation may be categorized as singlerun and multiple-run methods. The majority of studies reported in the literature have been devoted to the development of single-run parallel/distributed simulation methods: dierent strategies have been devised to execute processes of a single simulation run on a number of processors in parallel. Multiple-run methods, on the other hand, correspond either to multiple statistical replications of one system, the so-called parallel replicated approach, or to the simulation of one system at multiple parameter settings. y http://cad.bu.edu/go/simon.html: draft version. Final version appeared in: Discrete Event Dynamic

Systems: Theory and Applications, 6, 159-180 (1996) Research was supported in part by the National Science Foundation under grants EID-9212122 and DDM-9215368.

1

It is worth emphasizing that these two broad categories are designed to address very dierent computational issues that arise in discrete event simulation. Most single-run parallel/distributed methods are developed to address the high computational cost of simulating a large discrete event system where the size and complexity of the system renders its simulation on a sequential machine inecient or infeasible (for an excellent review, see [14]). Single-run methods are also used in cases where a long simulation run of a system is required and the length of the run is the main source of the high computational cost (see, e.g., [18], [19]). By contrast, multiple-run methods address the computational issues that relate, for example, to cases where a large number of statistical replications of a system is required to achieve a desired level of accuracy (see, e.g., [16], [17], [20]). In recent years, another set of multiple-run methods have been developed where variants of a discrete event system|with dierent system parameter settings or dierent operational policies|are simulated concurrently on multiple processors (see, e.g., [4], [5], [6], [7], [24], [30], [31]). The objective of such a parametric simulation project may be response surface approximation, factor screening, or performance optimization, where the computational cost is due to the large number of parameter settings or operational policies at which the system needs to be simulated. The approach described in this paper belongs to the category of multiple-run parallel/distributed simulation methods. Speci cally, we describe a way of simulating M replications of a system in parallel. However, it is important to note that a primary motivation for our investigation is to explore eective ways of using the proposed parallel replicated approach in conjunction with a parametric parallel simulation approach called the Standard Clock (SC) method (see, e.g., [31]). In SC simulation, variants of a system at N parameter settings are simulated in parallel. This simulation environment allows for the dynamic comparison of system performance at multiple parameter settings which is highly desirable for many performance analysis studies. Simulating M replications of the system at each of the N parameter settings simultaneously makes the comparisons signi cantly more eective. Distributed implementation on a number of processors and parallel SIMD implementation on massively parallel computers are described. We investigate various ways of inducing correlation across replications in order to reduce the variance of estimators obtained from the M replications. In particular, we consider the adaptation of Strati ed Sampling ([1], [8]), Latin Hypercube Sampling ([26], [28]) and Rotation Sampling [11]) to our setting. We assume that the objective of the simulation project is to estimate the expected value of some cumulative cost/reward over a nite deterministic horizon. In our approach all trajectories|all replications, and/or all variants|are time-synchronized, in the sense 2

that all trajectories proceed along the same simulated time (for other possible strategies, see, e.g., [24], [30]). This time-synchronization is well suited for the above setting, is a device we use to synchronize the computation across multiple processors, and is one of the ways in which we attempt to correlate sample performances|negatively for replications of a single system, and positively for samples of dierent systems. This time-synchronization, however, introduces certain ineciencies in the computation (see [31] for a discussion of computational tradeos in the context of parallel implementation and see [2] and [32] for a similar discussion in the context of serial implementation). One expects, or hopes, that in most cases, the cost of introducing local computational ineciencies would be oset by the bene ts of, say, some reduction in the variance of the estimators, the possibility of parallel implementation, and an improved capability for comparing variants (as in [23]). In this paper, we focus on ways to correlate sample performances across parallel replications. The schemes we introduce occasionally increase the cost of computation when compared to independent replications; we do not evaluate the tradeos between increased computational cost and the potential bene ts of variance reduction; this task requires a separate study. The paper is organized as follows: preliminaries are given in Section 2. Section 3 describes parallel independent and parallel correlated simulation algorithms. We discuss conditions that guarantee variance reduction for some systems in Section 4 and introduce some heuristic dynamic correlated samplings in Section 5. Experimental results are presented in Section 6. We conclude in Section 7.

2 Preliminaries In this section we present the basic setting of the paper and establish some notation. Let X = fXt ; t 0g be a continuous-time Markov chain (CTMC) on a ( nite or) countable set S . Let Qij be the rate of transitions from state xi to state xj , and let qi = ?Qii be the total rate of transitions out of si . All through this paper we assume that the chains simulated are uniformizable, i.e., Q is bounded (sup qi < 1). For uniformizable chains, the sojourn times of the chain in states can be uniformized by appropriately introducing extra ctitious transitions from states to themselves. The inter-event times can thus be made an independent and identically distributed (i.i.d.) sequence of exponential random variables, independent of the state process of the chain. This construction often leads to simpli cations that can be exploited to design alternative simulation strategies which, in one way or another, improve upon straightforward event-scheduling simulation (see, e.g., 3

[22], [27], [31]). We limit ourselves to those Markov chains that are \well-structured": We assume that all state transitions are caused by a nite number of events and that an event causes \similar" transitions in all states. More speci cally, we consider the following model: Let S denote, as above, the state space. Let E = fe1 ; : : : ; eK g be the set of events, K nite. To each event e there corresponds a state transition rule

fe : S ! S : The triple (S ; E ; ffe ; e 2 Eg) de nes the structure of the model. To introduce the element of time, we proceed as follows: Let (; ") = f(n ; "n ); n 0g be a marked Poisson process, where fn ; n 0g is the sequence of \arrival" instances of a Poisson process with rate , N , and f"n ; n 0g is an i.i.d. sequence of discrete random variables, independent of the Poisson process N , such that "n 2 E and P ("n = ei ) = pi . is the rate at which the clock ticks, n is the n-th tick of the clock, "n is the type of event that occurs at the n-th tick of the clock. We denote the sequence of the states at discrete epochs fn ; n 0g by fYn ; n 0g, i.e., Yn = Xn ; this sequence, called a uniformized discrete-time Markov chain (DTMC) associated with X , is de ned by Y0 = X0 , and

Yn = f"n f"n?1 f"1 (X0 ); the state process in continuous time is de ned by

Xt = YN (t) ; for t 0: It can easily be veri ed that the process fXt ; t 0g above is a uniformizable CTMC, where Qij = PKl=1 pl I ffel (si ) = sj g, and sup qi < . (I f:g is the indicator function.) As an example of the above model, consider the following special case which can be used as a model for a variety of systems of practical interest.

Example 1 Let S Zd , where Z = f0; 1; 2; g and d 1 is an integer. Also let E = fe1 ; : : : ; eK g Zd and de ne fei by ( i i f i (x) = x + e if x + e 2 S e

x

otherwise

Thus, each event ei is identi ed with a vector in Zd and a transition due to event ei corresponds to a vector addition with ei (informally, a move in the direction and size of 4

ei ) if the result is within the state space; otherwise no transition occurs. In other words,

this model can be viewed as a generalized random walk in Zd (or a generalized birth-death process), where there are K directions of movement; the movements that take the particle outside the allowed region are censored. Consider, for instance, an M=M=1 queueing system and let e1 and e2 represent, respectively, an arrival and a departure. To put things in the framework of the present model, set S = f0; 1; 2; g Z, e1 = 1 and e2 = ?1. Almost all Markovian queuing networks (and a host of other systems of practical interest) can be similarly cast in the framework of the model described in this example.

3 Parallel replicated simulation In this section we describe our parallel replicated algorithms. Let X j = fXtj ; t 0g represent the j th replication of the process de ned in the previous section. Let "j = f"jn ; n 1g be the i.i.d. sequence of events that drive the j th replication and let Y j = fYnj ; n 0g be the sequence of states visited by the uniformized discrete-time chain associated with the j th replication (j = 1; ; M ). We assume that all replications share the same sequence of event epochs f0 ; 1 ; 2 ; g of the Poisson process N . Given this choice, we only need to focus on the construction of the discrete-time processes Y j = fYnj ; n 0g (j = 1; ; M ); the continuous-time processes, X j s, are simply de ned by Xtj = YNj (t) . More speci cally, we consider the following model for the parallel simulation of multiple replications: Q E . The transition rule for the M Consider an M -dimensional vector of events e 2 M 1 Q S , is de ned componentwise: Q M replications associated with this vector, i.e., Fe : 1 S ! M 1

Fe (y1 ; : : : ; yM ) = (fe1 (y1 ); : : : ; feM (yM )): Q The discrete-time process Y = fY ; n 0g, de ned on M S , is then given by n

1

Y n = F"n F"(n?1) : : : F"1 (Y 0 ) where "n = ("1n ; : : : ; "M n ). Thus, at each transition epoch, our main task is to generate the components of the M -dimensional vector of events, "n , i.e., to generate M identically distributed discrete random variables "1n ; ; "M n . Given these random variables and the current states of the 5

replications, the new states are then determined using the state transition rules at each replication. We treat the random vectors "1 ; "2 ; as the primitive inputs to the simulation and use the degrees of freedom that exist in generating them to design dierent parallel replicated algorithms: The marginal distributions of the random variables "jn (j = 1; ; M ) are given; however, we are free to choose any joint distribution of ("1n ; : : : ; "M n ) that produces the given marginals. In all cases, independent and correlated, we assume that the vectors "1 ; "2 ; are sampled independently.

3.1 Independent and correlated sampling Generating "1n ; : : : ; "M n independently is one sampling scheme which will be referred to as Independent (parallel replicated) Sampling (IS). Alternatively, we can correlate replications by introducing correlations across components of "n , i.e., across "1n ; : : : ; "M n . To do so we proceed as follows: Let

Ini =

M X j =1

I f"jn = eig

for

i = 1; : : : K:

In words, Ini is the number of events of type ei that are generated at the nth transition across replications. I f"jn = ei g's (j = 1; : : : ; M ) are M i.i.d. Bernoulli random variables with parameter pi . (In Independent Sampling (In1 ; ; InK ) has a multinomial distribution with parameters M and p1 ; ; pK .) The basis of our approach to correlated sampling is to sample in such a way that the number of events of type ei generated at each transition, namely Ini , is \close" to the expected value of this random variable. In other words, we hope to reduce the variance of estimators of interest|some cumulative cost/reward evaluated from each trajectory and then averaged across replications|by reducing the variability of the number of events of each type generated at each transition. More speci cally, let E [Ini ] = pi M = bpi M c + ri (bac = integer part of a). (Note that 0 ri < 1.) In all correlated sampling schemes to follow Ini 2 fbpi M c; bpi M c+1; bpi M c?1g, i.e., the number of events of each type generated at each transition is equal either to the expected number of events of that type, or one more or one less than that. To make things more explicit, we introduce an example and use it throughout this section to illustrate the algorithms.

Example 2 Let E = fe1 ; e2 ; e3 g, p1 = 0:24; p2 = 0:3; p3 = 0:46, and M = 10; in other 6

words, the system is driven by the occurrences of three events where the probability of observing event ei at a transition is pi (i = 1; 2; 3), and we plan to generate 10 replications in parallel. Note that bp1 M c = 2; bp2 M c = 3; bp3 M c = 4, and r1 = 0:4; r2 = 0; r3 = 0:6. As will be seen below, in all correlated samplings to follow In1 2 f2; 3g; In2 = f2; 3; 4g, and In3 2 f4; 5g. We are now prepared to introduce the correlated sampling schemes. The three schemes we describe below can be viewed as applications of Strati ed Sampling, Latin Hypercube Sampling, and Rotation Sampling in our setting. We give brief descriptions of these sampling strategies before discussing their adaptations to our setting. All three algorithms have the following two steps: 1. Generate a vector = (1 ; ; M ) as speci ed by the algorithm. 2. Use a random permutation of 's to distribute these events randomly across replications, i.e., generate a random permutation of f1; ; M g, n , and set "jn = n (j ) . Since step two is shared between the three schemes, in what follows, we only specify step one for each scheme.

Strati ed Sampling and Correlated Sampling 1 (CS1)

Assume that in order to estimate the mean of a certain characteristic of a population, say, height, a sampling scheme is to be designed. Furthermore, assume that the population can be partitioned into K strata where the variation within a stratum is smaller than the overall variation in the characteristic of the members. Let ni be the number of members of the ith stratum. The underlying idea of Strati ed Sampling is to design the sample in such a way that the number of sample members from each stratum is proportional to the size of the stratum, i.e., ni =(n1 + nK ) of the sample members, modulus rounding, are from population i. It can be shown that this sampling approach yields a lower variance estimator than when the whole population is sampled randomly. For a general discussion on Strati ed Sampling see [3] section 2.4 (in particular, see example 2.4.2) and [8]. The strata in our context are the sets of singletons fei g (i = 1; ; K ). CS1, described below, is one method of Strati ed Sampling to generate M samples of "n , i.e., "1n ; "M n.

Set the rst bp1M c components of equal to e1, the next bp2 M c components equal to e2 , and so on. At this stage there are r = r1 + + rK components of that are unspeci ed. To specify them proceed as follows: 7

Sample r random variables, Wnl (l = 1; ; r) independently such that P (Wnl = ei ) = ri . r

Set the last components of equal to Wn1; ; Wnr . In application of CS1 to Example 2, we note that bp1 M c + bp2 M c + bp3 M c = 9 elements of are already speci ed; the last element, Wn1 , needs to be sampled, and it is e1 with probability 0:4 and e3 with probability 0:6.

Latin Hypercube Sampling and Correlated Sampling 2 (CS2)

Latin Hypercube Sampling was proposed in [26] as a variance reduction method for multi-dimensional Monte Carlo integration. To illustrate the basic idea, we consider the following simpler one-dimensional case: Let

=

Z1 0

f (u)du = E [f (U )];

where U is a uniform random number over (0; 1). A straightforward Monte Carlo approach to estimate is the following: Let U1 ; ; UM be M i.i.d. copies of U ; the sample mean of ff (U1 ); ; f (UM )g is used to estimate . In the Latin Hypercube Sampling approach, on the other hand, the interval (0; 1) is subdivided into M equal subintervals ( jM?1 ; Mj ), (j = 1; ; M ), and it is ensured that one sample is selected from each subinterval. More speci cally, Vj is sampled uniformly on ( jM?1 ; Mj ) (V1 ; ; VM are sampled independently). Then Ui is set equal to V(i) where is a random permutation of f1; ; M g; the same estimator as above is used to estimate . The motivation is to spread the M uniformly sampled points more evenly across the interval (0; 1) to, it is hoped, get a better estimate of . In the multi-dimensional case, the above sampling scheme is used for each coordinate. (The reader can easily verify that in the one-dimensional case a random permutation is not necessary whereas it is an essential ingredient of the multi-dimensional case for which LHS is intended.) See [26] and [28] for more detailed discussions. The adaptation of the approach in our setting is as follows: Subdivide the unit interval (0; 1) into M equal subintervals ( jM?1 ; Mj ), j = 1; ; M . Generate j 's as follows:

Generate Uj uniformly on ( jM?1 ; Mj ); U1 ; ; UM are sampled independently. Set j = ei if p1 + + pi?1 < Uj p1 + + pi; i = 1; ; K . 8

Note that, if for all U 2 ( jM?1 ; Mj ), p1 + + pi?1 < U p1 + + pi for some i, sampling from such an interval is unnecessary and the result will always be j = ei . It is easy to verify that for each event ei , at least bpi M c ? 1 of the subintervals have such a property and sampling from them is unnecessary. Therefore, one needs to sample only from those intervals that include points p1 + : : : + pi; i = 1 ; K ? 1. The number of such subintervals is less than K , therefore at most K components of needs to be generated by sampling. When applying CS2 in Example 2, without any sampling we are certain that there are two events of type e1 , two events of type e2 , and four events of type e3 ; to determine the type of the remaining two component of we only need to sample from the subintervals (0:2; 0:3) and (0:5; 0:6).

Rotation Sampling and Correlated Sampling 3 (CS3)

R

Consider the one-dimensional integral considered above, i.e., let = 01 f (u)du. The same consideration which underlie the Latin Hypercube Sampling scheme, namely spreading the M uniform random numbers U1 ; UM more evenly on (0; 1), motivates the Rotation Sampling strategy. In Rotation Sampling, however, the following approach is used: Assume the interval (0; 1) is wrapped around a circle with circumference 1. Select a random point on the circle; rotate this point by increments on 2=M to determine M ? 1 other points on the circle (hence the term Rotation Sampling). Map these points back to the interval (0; 1); the resulting points on the interval (0; 1) are U1 ; ; UM ; as before, use the sample mean of ff (U1 ); ; f (UM )g as an estimator of . Rotation Sampling was introduced by Fishman and Huang (see [11]). For applications of RS to parallel replication of Markov chains see [9], [10]. In our setting we proceed as follows: Subdivide the unit interval (0; 1) into M equal subintervals ( jM?1 ; Mj ), j = 1; ; M . Generate j 's as follows:

Generate U uniformly on (0; 1). Set Uj = MU + Mj ; j = 0; ; M ? 1. Set j = ei if p1 + + pi?1 < Uj p1 + + pi; i = 1; ; K .. It is worth noting the similarities and dierences between CS2, and CS3: In CS2, Uj is uniformly distributed over ( jM?1 ; Mj ) and Uj 's are independent, while in CS3, Uj is uniformly distributed over ( jM?1 ; Mj ), however, Uj 's are highly dependent; in fact, Uj = U1 + jM?1 for j = 1; ; M . A moment of re ection will reveal that the number of events that do not require actual sampling in CS3 is the same as that of CS2. A scheme can be developed to 9

sample the remainder of event types in a way that is equivalent to CS3, in the sense that they both produce the same joint distribution on (In1 ; ; InK ). Just as in the CS2 case, when applying CS3 in Example 2, it can be easily veri ed that without any sampling we are certain that there are two events of type e1 , two events of type e2 , and four events of type e3 ; to determine the type of the remaining two components of we only need to sample from the subintervals (0:2; 0:3) and (0:5; 0:6).

Remark 1 It is important to note that in all three sampling schemes CS1, CS2, CS3,

the dierence between the components of the vector of number of events of each type generated, namely (In1 ; ; InK ), and the vector of expected values of these numbers, namely (E [In1 ]; ; E [InK ]) is at most 1. Therefore, these schemes are highly similar; in particular, if r1 = = rK = 0, they are identical, and if M K their dierence becomes insigni cant. It is therefore reasonable to assume that the three schemes behave similarly in terms of the correlation they induce across replications. Hence, in what follows, we treat the three methods as almost equivalent in our setting.

3.2 Parallel implementation As we mentioned before, our main motivation in introducing the above parallel replicated schemes is to use them in conjunction with the Standard Clock (SC) simulation of a number of systems at dierent parameter settings. The model used for SC simulation is the same model as the one introduced in Section 2 (see, e.g., [15], [31]). In SC simulation a single Marked Poisson process, (; "), is used to drive multiple systems; as a result, all systems encounter the same event at the same epoch (they respond to this event dierently). In the parallel replicated approach, the Poisson epochs are the same for all replications; however, the markings across two arbitrary replications are in general dierent. To use these approaches in conjunction with each other, we proceed as follows: let "1 ; "2 ; be the sequence of vectors of events generated by one of the schemes discussed above ("n = ("1n ; ; "M n )). Assume N variations of a uniformized chain are to be simulated via SC. We can simulate M replications for each variation: the sequence f"j1 ; "j2 ; g drives the simulation of the j th replication of all N systems. Our approach|our way of combining SC and the correlated sampling schemes, and in particular, the structure of the correlated schemes introduced in the previous section| require a centralized clock mechanism; speci cally, a central mechanism is needed to gener10

ate the types of events (vector ) and randomly distribute it across replications. As a result, computations at all replications need to be synchronized. This structure of computation can be easily mapped to a SIMD implementation (in fact, it is particularly suited for such an implementation), however, it imposes certain restrictions when an implementation on a network of workstations is intended. In such a case, the implementation on the network is required to mimic a SIMD implementation. More speci cally, in SIMD implementation on massively parallel computers, at each transition, the vector "n is generated at the front end. Each processing element PE of the computer simulates the replication of one of the variations of the N systems. Assume PEij is the one that simulates the j th replication of the ith system. Then, at each transition, "jn is sent to PE1j , , PENj . All processors execute their state transitions in lock-step once they have received the events. In a distributed simulation on a network of workstations the same scheme as in SIMD implementation is used. One workstation acts as the front-end and the others as the PE's of a parallel computer.

4 Variance reduction In this section we consider the question of potential variance reduction due to the correlated sampling schemes introduced in the previous section; for some special cases we identify conditions that guarantee variance reduction for a class of performance measures. Similar to most other studies on variance reduction in simulation, a certain notion of monotonicity plays a crucial role in the discussion to follow. We begin by describing the basic setting. Recall that our main strategy for introducing correlation across replications is to correlate the discrete-time processes fY0j ; Y1j ; g; j = 1; ; M ; also recall that the latter is accomplished by correlating the events that drive the replications at each transition, i.e., by correlating "1n ; ; "M n . Therefore, to study the question of variance reduction we focus on the discrete-time processes fY0j ; Y1j ; g; j = 1; ; M . Speci cally, consider the following setting: Consider a deterministic horizon of N transitions (N is a deterministic integer); in other words, the parallel simulation is stopped after all replications encounter N transitions. (The case of a random horizon speci ed by a random variable N can be analyzed in a similar fashion if N is independent of the state of replications. This, for example, would be the case if we consider simulating the replications over the interval [0; T ] where T is a deterministic value. In this case, N would be a Poisson random variable with rate T .) Let g : S ! R 11

be a real valued function. (g(x) represents the instantaneous cost/reward of being in state x.) Fix Y01 = Y0j 2 S ; j = 2; ; M , and let a cumulative cost/reward, denoted by J , be de ned by N X J = h("1 ; ; "N ) = g(Yn ): n=0

Our objective is to estimate = E [J ]. Consider M replications of the discrete-time chain, fY0j ; Y1j ; YNj g; j = 1; ; M , and let M b = 1 X J j M j =1

P be the estimator of = Nn=0 g(Ynj )). Assume that a correlated sampling scheme (CS) is used to sample the vectors of event types, "1 ; ; "N . (J j

Note that

VarCS (b) =

M X 1 [X Var(J j ) + Cov(J i ; J j )]: 2 M i6=j

j =1

Since J j 's are identically distributed and due to the symmetry of the sampling schemes, we have Var(J i ) = Var(J j ) and CovCS (J i ; J j ) = CovCS (J 1 ; J 2 ) for all i 6= j . Therefore, M ? 1 Cov (J 1 ; J 2 ): Var (b) = Var (b) + (1) CS

Ind

M

CS

Hence, the above identity implies that if CovCS (J 1 ; J 2 ) 0 then the variance of the CS estimator is no larger than that of independent sampling. We now identify conditions that ensure CovCS (J 1 ; J 2 ) 0 for two special cases. As already mentioned, in almost all methods for establishing variance reduction, a certain notion of monotonicity plays a crucial role. In our setting we assume that there exists a certain natural order on the set of events, E , under which this set becomes a totally ordered set. To clarify, we give an example. Consider the M=M=1 queueing system discussed in Section 2: S = f0; 1; 2; g Z, E = fe1 ; e2 g where e1 and e2 represent, respectively, the arrival event and the departure event and these events are identi ed with +1 and ?1 respectively, i.e., e1 = 1 and e2 = ?1. In this case we say that e2 is less than e1 (e2 < e1 ); this order has a straightforward interpretation that is intimately connected to the natural order of the state space S : an arrival event increases the state of the system by 1 and a departure event decreases the state of the system by 1 or, if the state is equal to zero, leaves the state unchanged. Another way of saying the same thing is: for any state x, fe1 (x) > fe2 (x); this is the sense in which we say e2 < e1 ; under this order the set of events is, of course, a totally ordered set. 12

As a slightly more general case, consider the random walk de ned in Example 1 in Section 2. Assume that the random walk is on the set S Z . Assume, as we did in that example, that E = fe1 ; ; eK g Z. In this case both sets S and E are totally ordered with the order inherited from the set of integers, Z. We now return to the general setting. Assume that S is a totally ordered set. To simplify notation, we use to denote the order on both sets. We say that the transition function f : S E !S is increasing in each coordinate if

x y ) fei (x) fei (y); ei ej ) fei (x) fej (x): It is simple to verify that for the above random walk example (also the M=M=1 queue) where S ; E Z, f is increasing in each coordinate.

Lemma 1 If S and E are totally ordered sets, f : S E ! S is increasing in each coordinate, and g : S ! R is an increasing function, then h : EN ! R de ned above is increasing in each coordinate.

Proof. Let = ( 1 ; ; N ); 0 = ( 10 ; ; N0 ) 2 E N be such that n n0 for some n (1 n N ) and all other coordinates are equal. Clearly, Yi ( ) = Yi ( 0 ) for i = 1; ; n ? 1. Since Yn?1 ( ) = Yn?1( 0 ) and n n0 , from monotonicity of f in the rst coordinate it

follows that

Yn ( ) = f n (Yn?1 ( )) f n0 (Yn?1 ( 0 )) = Yn ( 0 ): Since n+1 = n0 +1 , monotonicity in the second coordinate of f yields Yn+1 ( ) = f n+1 (Yn ( )) f n0 +1 (Yn ( 0 )) = Yn+1 ( 0):

The same argument can be used to show that Yi ( ) Yi ( 0 ) for i = n + 2; ; N . The statement of the lemma follows from the fact that g is increasing in Y . 2 A second component of the methods used for establishing variance reduction is to appeal to some appropriate property of the probability measure induced by the sampling scheme. For our purpose, we need to use a notion of negative dependence due to Lehmann [25]. 13

De nition 1 A pair of random variables (U; V ) is negatively quadrant dependent (NQD) if

P (U u; V v) P (U u)P (V v):

Note that, if (U; V ) are NQD, then Cov(U; V ) 0. We are now prepared to state one of the results of this section.

Theorem 1 Under the assumptions of the Lemma 1 VarCS (b) VarInd (b)

(2)

Proof. According to the assumptions of Lemma 1, the set of events is totally ordered; therefore we can assume e1 e2 eK (re-label the events if necessary). Let the function : (0; 1) ! E be de ned by (u) = ei if p1 + + pi?1 u < p1 + + pi. Then is an

increasing function in u. We assume is used to generate the events; under this assumption J : E N ! R can be viewed as a function on (0; 1)N (identify J with J N ). From Lemma 1, and the fact that is increasing, it follows that J is increasing in each coordinate (as a function on (0; 1)N ). Let Vn1 ; ; VnM be M i.i.d. random variables uniformly distributed on (0; 1). Let be, j as in Section 3, a random permutation of f1; ; M g. Let Unj = VMn + M(j ) and "jn = (Unj ). It can be easily veri ed that this sampling is the same as CS2 or Latin Hypercube Sampling. Moreover, it can be easily shown that Uni , and Unj are NQD. Since J is increasing in each coordinate as a function on (0; 1)N and since the random variables used for sampling from (0; 1)N , i.e., Uni 's, are pairwise NQD, the assumptions of Theorem 1 of Lehmann [25] (or Proposition 1 in [1]) are satis ed; this latter theorem implies CovCS (J 1 ; J 2 ) 0. The variance reduction result (2) follows from this fact and identity (1). The same arguments can be used to show that CS1, and CS3 also lead to variance reduction under the conditions of Lemma 1. 2

Remark 2 Assume that M , the number of replications and pi's, the probabilities of events

are such that there is no need to generate any events by sampling, i.e., assume ri = 0 for i = 1; K . In such a case it is not necessary to apply , and there is no need to order events monotonically, since any other scheme will result in the same number of events of each type at each transition. Therefore, in this case, the monotone generation of samples of J j 's is not necessary. As noted before, when M K , we expect the eect of events that are actually sampled (there are less than K of them) to be insigni cant. In this case 14

also the monotone generation of J j 's does not seem to be necessary, as long as a monotone ordering of events exists. We now turn to another special case: Let S = Zd (d 1), E = fe1 ; ; eK g Zd , and fei (x) = x + ei for all x and all ei . In other words, this is a random walk on Zd with no boundaries. In this case n X j Yn = "jr ; where "jr

2 Zd .

vector, i.e., let

r=1

Furthermore, assume that the cost is linear in the coordinates of the state

g(y) = < a; y > = (y and a, are d-dimensional vectors). Then

J =

N X

n=1

N X

g(Yn ) =

= < a; = < a;

N X n=1

N X

n=1

d X i=1

ai yi

< a; Yn >

Yn >= < a;

N X n X n=1 s=1

(N ? s + 1)"s > =

s=1

"s >

N X

(N ? s + 1) < a; "s > :

s=1

In this case E [J ] can be calculated analytically and there is no need to estimate this quantity from simulation. The following theorem, therefore, is intended to clarify a property of the correlated sampling schemes in a simpli ed setting; moreover, the result provides some insight into cases where the random walk is on a subset of Z d with boundaries and where sampling via simulation is required to estimate J .

Theorem 2 If f"1; ; "N g are sampled according to CS1 VarCS 1 (

The same holds for CS2 and CS3.

M X

j =1

J j ) VarInd (

M X

j =1

J j ):

(3)

Proof. Note that when CS1 is used for simulation of ", CovCS1("in ; "jm ) = 0 for all n 6= m. Then

CovCS 1 (J i ; J j ) =

=

N N X X n=1 m=1

N X

n=1

(N ? n + 1)(N ? m + 1)CovCS 1 (< a; "in >; < a; "jm >) =

(N ? n + 1)2 CovCS 1 (< a; "in >; < a; "jm >): 15

(4)

Since strati ed sampling is used to sample f"jn (U ); j = 1; ; M g, we have M M X X 1 1 j j VarCS 1 ( M < a; "n >) VarIS ( M < a; "n >): j =1

j =1

(Let h(U ) =< a; "n > then the above identity follows from a basic result of Strati ed Sampling: see, e.g., Theorem 1 [8].) Therefore, M X M X i=1 j =1;j 6=i

CovCS 1 (< a; "in >; < a; "jn >)

0:

(5)

As we mentioned before, due to the random allocation of events to replications via permutation n, all covariances are equal; hence (5) implies CovCS 1 (< a; "in >; < a; "jn >) 0;

for all r and i 6= j . Then, it follows from (4) that for all i 6= j CovCS 1 (J i ; J j ) 0;

and VarCS 1 (

M X

j =1

J j ) = VarIS (

M X

j =1 M X

Jj) +

M X M X i=1 j =1;j 6=i

CovCS 1 (J i ; J j )

VarIS ( J j ):2 j =1

5 Dynamic Correlated Sampling In this section we introduce some heuristic variants of the Correlated Sampling (CS) schemes presented in Section 3. Recall that in the CS algorithms discussed in Section 3, at each transition, all M events that drive the replications are correlated (independently of the states of the replications). In this sense, these schemes can be viewed as static correlated sampling schemes. In the heuristic variants presented in this section, at each transition, the set of M replications is partitioned into several subsets (according to criteria that will be described shortly) and events that belong to the same subset are correlated; events that belong to dierent subsets are generated independently of each other. The criteria for partitioning the M replications depend on the states of the replications and hence the partitions potentially change at each transition. In this sense, the schemes described in this section are dynamic correlated sampling schemes. 16

To describe the basic idea of the approach, we consider the random walk on Zd that was introduced in Section 2. Let x 2 S be a state of the system; de ne, E (x), the set of active events in x (or the event-list at x), by ei 2 E (x) if fei (x) 6= x; in other words, the set of active events in x includes those events that cause a transition from x to a state other then x. Recall that in a random walk on Zd , the discrete-time process of the state of the random walk for the j th replication, fY0j ; Y1j ; g, is de ned by

Ynj+1 = Ynj + f"jn+1 (Ynj );

where f"jn+1 (Ynj ) = "jn+1 if "jn+1 2 E (Ynj ), and f"jn+1 (Ynj ) = 0, otherwise. Therefore, clearly, if E (Ynj ) = E (Yni ) then the state increments, i.e., f"jn+1 (Ynj ) and f"in+1 (Yni ), are equally distributed. In one of the schemes for dynamic correlated sampling, called full event-list matching, we use this similarity (equality in distribution of the increments) as a criterion for partitioning the replications at each transition. In other words, at the nth transition, two replications, say i and j , belong to the same subset of the partition if E (Ynj ) = E (Yni ). Once the partition is determined, one of the correlated sampling schemes of CS1, CS2, or CS3 is applied to each subset of the partition. Samplings from subsets are done independently from one another. Consider, for example, the M=M=1 queue discussed before. In this case the possible event lists are fe1 ; e2 g and fe1 g, the former for cases were the number of customers in the system is greater than zero, and the latter for the case where this number is zero. Thus in a full event list matching scheme, at each transition, all replications in which the number of customers is nonzero are correlated, on the one hand, and all those that have no customers in the system are correlated, on the other.

Remark 3 The application of Rotation Sampling (RS) to parallel replicated simulation

of Markov chains introduced by Fishman in [9] and [10], can be viewed as a method of dynamic correlated sampling. The criterion for partitioning replications in this case is that they be in the same state. For example, in the above M=M=1 queue those replications will be correlated that have exactly the same number of customers. As is pointed out in [9] and [10], such a scheme is eective when M jSj, i.e., when the number of replications is considerably larger than the size of the state space. Otherwise, elements of the partition will be mostly singletons and the RS scheme will behave almost as independent sampling. A concern similar to the one expressed in the above remark motivates us to consider other schemes than the full event-list matching described above: when the number of events, K , is 17

reasonably \large", then the number of possible events lists becomes \very large;" unless the number of replications, M , is \very large," the size of the subsets of the partition becomes \very small" and this renders the full event-list matching scheme less eective. To address this issue we consider the following class of dynamic correlated sampling schemes:

Partial event-list matching (ELk ) scheme Let fe1 ; ; eK g be an \appropriate" labeling of the events; this may require a re-

labeling of the events. The choice of labeling is system-dependent and is a control variable. We partition the replications as follows: at the nth transition, two replications i and j belong to the same subset of the partition if

E (Yni ) \ fe1 ; ; ek g = E (Ynj ) \ fe1 ; ; ek g: In other words, we have given some prominence to events e1 ; ; ek , and two replications will be correlated as long as the parts of their event-list that intersect with events e1 to ek match. Note that for k = K this corresponds to full event-list matching. The parameter k is also a control variable. Once the partition is selected, one of the correlated sampling schemes of CS1, CS2, or CS3 is applied to each subset of the partition. As before, subsets are sampled independently.

6 Simulation Experiments In this section we provide a number of experimental results. The experiments are performed for M=M=1 queues and a number of Closed Jackson Networks (CJN). The performance measure, , is de ned to be the expected value of some sample performance evaluated over the duration of N transitions (ticks of the clock). The measure of variance reduction is de ned by

b V ar:Red: = Var(bIS ) Var(CS )

To estimate this quantity, the numerator and the denominator are estimated by, respectively, the sample variances of I independent batches of M independent replications, and I independent batches of M correlated replications. In all the experiments in this section the total number of replications| independent and correlated|is equal to 10; 000, in other words, I:M = 10; 000. The following schemes are considered: 18

CS3 as a static correlated sampling method. Rotation sampling (RS) as described by Fishman in ([9], [10]). Event-List matching correlated sampling, ELk . As already mentioned in the previous sections, we expect the results obtained for the CS3 sampling scheme to be also representative of CS1, and CS2 schemes. Also, recall that RS sampling is a dynamic correlated sampling scheme where only those replications that are in the same state are correlated

Experiment 1 The rst set of experiments relate to an M=M=1 queue. The sample performance is the time average of the queue length over a period of N transitions. It is

assumed that initially the queue is empty. We compare RS and CS3 algorithms for dierent values of trac intensity ( = 0:2; 0:5; 0:9). The number of transitions are selected in such a way as to re ect the time it takes the system to reach \steady state". Speci cally, N is equal to, respectively, 1000; 10000 and 100000, for = 0:2; 0:5; and 0:9. The results are reported in Tables 1-3. Note that RS on the whole performs better than CS3 when trac intensity is low; the reverse is true when trac intensity is high. This general trend is not surprising in light of our discussion of RS in the previous section.

Experiment 2 The second set of experiments relate to three Closed Jackson Networks

(CJN). Each network consists of 6 exponential servers and in nite buers. The events are departures from servers and arrivals to queues. Note that each departure from a server is simultaneously an arrival to a queue. A labeling of the events is selected and correspondence of labels to departures and arrivals, as well as the rates of these events, are given in tabular form (Tables 4, 6, 8). The sample performance in the experiments is the average number of customers served at all servers. We compare CS3 and ELk schemes. Table 4 describes the rst Closed Jackson Network, CJN 1. This network has the special structure of a central server: all customers go from server 1 to servers 2 ? 6, and after processing at these servers are routed back to server 1. In the dynamic correlated sampling EL1 , two replications are correlated if the intersections of their event-lists with the set fe1 g are equal. As can be seen in Table 5, EL1 performs better than CS3 and the dierence increases when the number of replications M is increased. 19

CJN2 has a more general structure than CJN1 (Table 6), although it still has elements of a central server system (events from and to server 1 have large rates). For the CJN2 we compare CS3 with several variations of ELk ( rst k events from the list in Table 6 are used to de ne the partition). We see in Table 7 that the variance reduction of EL1 is similar to the pure central server system (CJN1) and the variance reduction of ELk increases with the increase in k. Again, the dierence in variance reduction becomes signi cant only for large M . The last system is CJN3 which does not have a central server structure (Table 8). We perform a number of experiments with 10 and 30 customers in the system, and two dierent values of N . The results are reported in Tables 9 - 12. It can be observed that variance reduction due to ELk is generally higher than that of CS3. However, the dierence is not signi cant. As before, in all cases, a general trend of increase in variance reduction is observed as the number of replications, M , increases.

7 Conclusions In this paper we introduce several schemes for simulating replications of a uniformizable Markov chain in parallel on a SIMD massively parallel computer or on a network of workstations. We propose a number of ways to induce correlation across replications in order to reduce the variance of estimators. Further work is needed to determine the eectiveness of some of the schemes introduced (in particular the dynamic parallel replicated schemes) in suciently general settings. It is also worth studying the tradeos between the increased cost of computation for the correlated parallel replicated schemes, when compared to independent parallel replication, and the potential bene ts of the reduction in variance of the estimators. We have not considered random horizons as a stopping criterion. Considering such cases and investigating the implications of the results of [16], and [17] in our setting, in particular in the case of correlated sampling, is worthwhile. These are subjects for further study.

References [1] Avramidis, A., Variance Reduction for Quantile Estimation via Correlation Induction Proceedings of the Winter Simulation Conference, 572-576, 1992.

20

[2] Barnhart, C.M., Wieselthier, J.E., and Ephermides, A., Improvement in Simulation Eciency by Means of the Standard Clock: A Quantitative Study, Proceedings of the IEEE Conference on Decision and Control, 2217-2223, 1993. [3] Bratley, P.B., Fox, B.L., and Schrage, L. E., A Guide to Simulation. 2d ed., New York: Springer-Verlag, 1987. [4] Cassandras, C. G. , and Strickland, S. G., On-line Sensitivity Analysis of Markov Chains, IEEE Trans. on Automatic Control, 34, 76-86, 1989. [5] Cassandras, C. G. , and Strickland, S. G., Observable Augmented Systems for Sensitivity Analysis of Markov and semi-Markov Processes, IEEE Trans. on Automatic Control, 34, 1026-1037, 1989. [6] Cassandras, C. G., Sample-Path-Based Continuous and Discrete Optimization of Discrete Event Systems: From Gradient Estimation to \Rapid Learning", Proceedings of the 11th International Conference on Analysis and Optimization of Systems," June 1994. [7] Chen, C.H., and Ho, Y.C., An Approximation Approach of the Standard Clock Method for General Discrete Event Simulation, IEEE Trans. on Control Systems Technology, 3(3):309, 1995. [8] Cheng, R.C.H., Davenport, T., The Problem of Dimensionality in Strati ed Sampling, Management Science , 35, 1278-1296, 1989. [9] Fishman, G.S., Accelerated Accuracy in the Simulation of Markov Chains, Oper. Res., 31, 466-487, 1983. [10] Fishman, G.S., Accelerated Convergence in the Simulation of Countably In nite State Markov Chains, Oper. Res., 31, 1074-1089, 1983. [11] Fishman, G.S., Huang, B.D., Antithetic Variates Revisited, Communications of the ACM, 26, 964-971, 1983. [12] Fox, B.L., Generating Markov-Chain Transitions Quickly: I, ORSA Journal on Computing, 2, 126-135, 1990. [13] Fox, B.L., Glynn., P.W., Discrete time conversion for simulating nite-horizon Markov processes, SIAM Journal on Appl. Math., 50, 1457-1473, 1990. 21

[14] Fujimoto, R.M., Parallel Discrete Event Simulation, Communication of ACM, 33, 3053, 1990. [15] Glasserman, P., Vakili, P., Comparing Uniformized Markov Chains Simulated in Parallel, Probability in the Engineering and Informational Sciences, 8, 309-326, 1994. [16] Glynn, P.W., Heidelberger, P., Analysis of Parallel Replicated Simulations Under a Completion Time Constraint, ACM Transactions on Modeling and Computer Simulation, 1, 3-23, 1991. [17] Glynn, P.W., Heidelberger, P., Analysis of Initial Transient Deletion for Parallel Steady-State Simulations SIAM Journal on Scienti c and Statistical Computing, 13, 904-922, 1992. [18] Greenberg, A.G., Lubachevsky, B.D., and Mitrani, I., Algorithms for Unboundedly Parallel Simulations, ACM Trans. Computer Systems, 9, 201-221, 1991. [19] Greenberg, A.G., Schlunk, O., and Whitt, W., Using Distributed-Event Parallel Simulation to Study Departures from Many Queues in Series, Probability in the Engineering and Informational Sciences, 7, 159-186, 1993. [20] Heidelberger, P., Discrete Event Simulations and Parallel Processing : Statistical Properties, SIAM J. Sci Stat. Comput. 9, 1114-1132, 1988. [21] Heidelberger, P., Nicol, D. M., Simultaneous Parallel Simulations of Continuous Time Markov Chains at Multiple Parameter Settings, Proceedings of the Winter Simulation Conference, 602-607, 1991. [22] Heidelberger, P., Nicol, D. M., Conservative Parallel Simulation of Continuous Time Markov Chains using Uniformization, IEEE Transactions on Parallel and Distributed Systems, 4, 906-921, 1993. [23] Ho, Y.C., Sreenivas, R., and Vakili, P., Ordinal Optimization of Discrete Event Dynamic Systems, Journal of Discrete Event Dynamic Systems, 61-88, 1992. [24] Hu, J.Q., Parallel Simulation of DEDS Via Event Synchronization, Journal of Discrete Event Dynamic Systems, 1995. [25] Lehmann, E. L., Some concepts of dependence. Annals of Mathematical Statistics 37, 1137-1153, 1966. 22

[26] McKay M. D., Conover, W. J., Beckman, R. J., A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code, Technometrics, 21, 239-245, 1979. [27] Nicol, D. M., Heidelberger, P., Parallel Simulation of Markovian Queueing Networks Using Adaptive Uniformization, Performance Evaluation Review, 21, 122- , 1993. [28] Stein, M. Large Sample Properties of Simulation Using Latin Hypercube Sampling Technometrics 29, 143-151, 1987. [29] Streltsov, S., Vakili, P., Parallel Replicated Simulation of Markov Chains: Parallel Implementation and Variance Reduction, Proceedings of the Winter Simulation Conference, 430-436, 1993. [30] Strickland, S.G., and Phelan, R.G., Massively Parallel SIMD Simulation of Markovian DEDS: Events vs. Time Synchronous Methods, Journal of Discrete Event Dynamic Systems, 1995. [31] Vakili, P., Massively Parallel and Distributed Simulation of a Class of Discrete Event Systems: a Dierent Perspective, TOMACS 2, 214-238, 1992. [32] Wieslethier, J.E., Barnhart, C.M., and Ephremides, A., Standard Clock Simulation and Ordinal Optimization Applied to Admission Control in Integrated Communication Networks, Journal of Discrete Event Dynamic Systems, 1995.

23

Table 1: M=M=1 queue

= 0:2, N = 1; 000

M 8 16 32 64 128

RS 1.26 2.04 2.53 2.74 4.45

CS3 2.03 2.00 3.11 2.66 2.57

Table 2: M=M=1 queue

= 0:5, N = 10; 000

M 8 16 32 64 128

RS 1.13 1.41 1.99 3.68 4.38

24

CS3 1.48 1.99 2.34 2.81 2.69

Table 3: M=M=1 queue

= 0:9, N = 100; 000

M 8 16 32 64 128

RS 1.25 1.13 1.84 1.79 3.04

CS3 2.08 2.40 3.99 2.49 9.41

Table 4: Closed Jackson Network 1 event from to event rate 1 1 4 0.400 2 2 1 0.200 3 1 2 0.100 4 3 1 0.100 5 1 3 0.100 6 4 1 0.050 7 1 5 0.400 8 5 1 1.000 9 6 1 0.400 10 1 6 0.200

Table 5: Closed Jackson Network 1 10 Customers, N = 1000

M 4 8 16 32 64 128

CS3 EL1 1.14 1.13 1.11 1.16 1.50 1.58 1.50 1.96 2.01 5.45 3.46 10.83 25

Table 6: Closed Jackson Network 2 event from to event rate 1 1 2 0.50 2 6 4 0.13 3 5 1 0.10 4 5 2 0.25 5 1 3 0.04 6 6 2 0.15 7 1 4 0.60 8 4 1 0.01 9 4 2 0.05 10 2 3 0.20 11 3 4 0.55 12 2 1 0.80 13 3 5 0.50 14 3 1 0.45 15 1 6 0.01 16 6 1 0.30

Table 7: Closed Jackson Network 2 10 Customers, N = 1000

M 4 8 16 32 64 128

CS3 1.26 1.19 1.31 1.86 3.42 3.14

EL1 EL2 EL3 EL4 1.31 1.17 1.43 2.19 4.84 5.39

1.25 1.22 1.38 2.00 3.93 6.14

26

1.27 1.18 1.33 2.33 3.88 5.20

1.27 1.27 1.43 2.21 3.90 7.07

Table 8: Closed Jackson Network 3 event from to event rate 1 1 2 0.30 2 6 4 0.03 3 5 1 0.10 4 1 3 0.50 5 6 2 0.01 6 1 4 0.20 7 4 1 0.05 8 4 2 0.30 9 2 3 0.20 10 3 4 0.05 11 2 1 0.20 12 3 5 0.30 13 5 2 0.20 14 3 1 0.10 15 1 6 0.05 Table 9: Closed Jackson Network 3 10 Customers, N = 1; 000

M 4 8 16 32 64 128

CS3 1.07 3.12 3.06 2.73 4.88 5.07

EL1 EL2 EL3 EL4 1.11 2.76 3.29 3.29 5.09 4.76

1.13 2.89 3.25 3.56 5.88 4.49

1.14 3.07 3.31 3.29 5.06 4.06

1.13 3.19 3.05 3.25 5.64 5.70

Table 10: Closed Jackson Network 3 10 Customers, N = 10; 000

M 16 32 64 128

CS3 1.26 1.73 3.20 2.93

EL1 EL2 EL3 EL4 1.36 1.96 4.09 6.00

1.33 1.80 3.77 5.33

27

1.33 2.00 3.60 5.94

1.29 1.90 3.62 6.08

Table 11: Closed Jackson Network 3 30 Customers, N = 1; 000

M 8 16 32 64 128

CS3 1.68 2.38 2.55 2.88 2.89

EL1 EL2 EL3 EL4 1.78 2.60 2.75 3.12 2.93

1.73 2.51 2.76 2.97 3.18

1.72 2.45 2.64 2.98 3.16

1.75 2.51 2.92 2.77 2.87

Table 12: Closed Jackson Network 3 30 Customers, N = 10; 000

M 4 8 16 32 64 128

CS3 1.35 1.64 2.87 2.57 3.56 3.66

EL1 EL2 EL3 EL4 1.34 1.63 3.25 3.09 5.46 4.84

1.29 1.61 2.90 3.73 4.92 4.72

28

1.32 1.62 2.88 2.94 4.42 3.72

1.35 1.62 2.76 2.54 4.52 -

Recommend Documents

Parallel Genetic Algorithms - CiteSeerX

Provable Algorithms for Parallel Generalized Sweep ... - CiteSeerX