Derivatives of the Stochastic Growth Rate

Report 4 Downloads 80 Views
Derivatives of the Stochastic Growth Rate David Steinsaltz Department of Statistics University of Oxford 1 South Parks Road Oxford OX1 3TG United Kingdom [email protected]

Shripad Tuljapurkar Biology Department Stanford University Stanford, CA 94305 USA

March 28, 2011

1

Carol Horvitz Biology Department University of Miami P.O. Box 249118 Coral Gables, FL 33124 USA

Abstract We consider stochastic matrix models for population driven by random environments which form a Markov chain. The top Lyapunov exponent a, which describes the long-term growth rate, depends smoothly on the demographic parameters (represented as matrix entries) and on the parameters that define the stochastic matrix of the driving Markov chain. The derivatives of a — the “stochastic elasticities” — with respect to changes in the demographic parameters were derived by Tuljapurkar (1990). These results are here extended to a formula for the derivatives with respect to changes in the Markov chain driving the environments. We supplement these formulas with rigorous bounds on computational estimation errors, and with rigorous derivations of both the new and the old formulas.

1

Introduction

Stochastic matrix models for structured populations are widely used in evolutionary biology, demographic forecasting, ecology, and population viability analysis (e.g., Tuljapurkar (1990); Lee and Tuljapurkar (1994); Morris and Doak (2002); Caswell (2001); Lande et al. (2003)). In these models, a discrete-time stochastic process drives changes in environmental conditions that determine the population’s stagetransition rates (survival, fertility, growth, regression and so on). Population dynamics are described by a product of randomly chosen population projection matrices. In most biological situations the population’s stage structure converges to a time-varying but stable structure Cohen (1977), and in the long run the population grows at a stochastic growth rate a that is not random and is the leading Lyapunov exponent of the random product of population projection matrices Furstenberg and Kesten (1960a); Cohen (1977); Lange (1979); Lange and Holmes (1981); Tuljapurkar and Orzack (1980). This growth rate a is of considerable biological interest, as a fitness measure for a stage-structured phenotype Tuljapurkar (1982), as a determinant of population viability and persistence Tuljapurkar and Orzack (1980); Morris and Doak (2002); Lande et al. (2003), and in a variety of invasion problems in evolution and epidemiology Metz et al. (1992). The map between environments and projection matrices describes how phenotype change depends on the environment — the phenotypic norm of response — and we are often interested in how populations respond to changes in, say, the mean or variance of the projection matrix elements. Such questions are answered by computing the derivatives of a with respect to changes in the projection matrices, using a formula derived by Tuljapurkar (1990). Tuljapurkar et al. (2003) called these derivatives stochastic elasticities, to contrast with the elasticity of the dominant eigenvalue of a fixed projection matrix to the elements of that matrix (Caswell, 2001). Stochastic elasticity has been used to examine evolutionary questions (Haridas and Tuljapurkar, 2005) and the effects of climate change (Morris et al., 2008). At the same time, a is also a function of the stochastic process that drives environments. Many processes, such as climate change (Boyce et al., 2006), will result in changes in the frequencies of, or the probabilities of transition between, environmental states. How is a affected by a change in the pattern and distribution of environments, rather than by a change in the population projection matrices? To answer this question, we consider a model in which the environment makes transitions among several discrete states, according to a Markov chain. Then what we want is the derivative of a with respect to changes in the transition probabilities of

2

this Markov chain. This derivative exists (at least away from the boundaries of the space of stochastic matrices), and in fact we know from Peres (1992) that a is an analytic function of the parameters of both the projection matrices and the parameters defining the stochastic matrix, in an open neighborhood of the set of stochastic matrices. In deterministic models (Caswell, 2001), the growth rate is represented as λ = er ; then sensitivities are derivatives of the form (∂λ/∂x) with respect to a parameter x whereas elasticities are proportional derivatives of the form (∂r/∂ log x). In stochastic models we compute derivatives of a, and these can be used to compute elasticities (as in Tuljapurkar et al. (2003)) or sensitivities. Our first contribution here is a new formula for computing the derivative of a with respect to changes in the transition probabilities of the environmental Markov chain, given in abstract form as equation (40). To obtain this result we show how an initial environmental state affects future growth, using coupling and importance sampling; this analysis may be of independent interest. Even with a formula in hand we must compute derivatives of a by numerical simulation which is subject to both sampling (Monte Carlo) error and bias. Our second contribution here is to show how one can bound these estimation errors. Our third contribution is a rigorous proof of the heuristically derived formula given by Tuljapurkar (1990) for the derivatives of a to the elements of the population projection matrices. In Section 2 of this paper we set out the model and assumptions, the approach to finding derivatives, along with necessary facts about the convergence of population structures and distributions. In Section 3 we discuss systematic and sampling errors and show how we can bound them. We illustrate this approach in Section 4 by presenting bounds (in Theorem 1) for simulation estimates of the stochastic growth rate a and (in Theorem 2) for the derivatives of a with respect to projection matrix elements. In Section 5 we define a measure of the effect of an initial environmental state on subsequent population growth and show how to estimate this measure using coupling arguments. Section 6 presents (in Theorem 4) the formula, algorithm, and error bounds for the derivative of a with respect to the elements of the Markov chain that drives environments. We end by discussing how these theorems can be applied and some related issues concerning parameter estimation in such models. Proofs are in the Appendix.

2

The Model

We consider a population whose individuals exist in K different stages (these may be, for example, ages, developmental stages or size classes). Newborns are in stage i = 1. The progression between stages occurs at discrete time intervals at rates that depend on the environment in each time interval. The environment et in period t is in one of M possible states; we denote the set of possible environments by M = {1, . . . , M }. Individuals in stage i at time t move to stage j at a rate Xet+1 (j, i). These rates are elements of a nonnegative population projection matrix, and at time t when the environment is et this matrix is denoted by Xet ; there are M such matrices, one for each environmental state. We assume that allocation of individuals to classes and the identification of environment states are certain. We also assume that the total number of individuals in the population is large enough that we can ignore sampling variation. Successive environments are chosen according to a Markov process with transition matrix P whose elements are P (e, e˘) and whose stationary distribution is ν = {ν(e)}. We follow the standard convention for Markov chains, that P (e, e˘)

3

represents the probability of a transition from state e to state e˘; note that this is the reverse of the convention used in matrix population models. To guarantee demographic weak ergodicity (Cohen 1977) we assume that (i) There exists some R > 0 such that any product Xe1 · · · XeR has all entries positive. This implies, in particular, that each population projection matrix is row-allowable and column-allowable. That is, every row and every column has at least one positive entry. (ii) The chain is ergodic (so transitive and aperiodic), and environments are in the stationary distribution of P , which will also be denoted by ν(e). T The population in year t is represented by a vector Nt ∈ RK + := {(nt (1), . . . , nt (K)) : nt (i) > 0}. The superscript T will always mean transpose; here it indicates that population vectors are column vectors. (There may be population classes early on that have 0 members; condition (ii) above forces all classes eventually to have positive membership, and so we assume without loss of generality that we start with all population classes occupied.) The population structure changes according to Nt+1 = Xet+1 Nt , and Nt = Xet Xet−1 · · · Xe1 N0 . (1) P The normalized population structure Nt /( i Nt (i)) does not converge to a fixed limit (as it would if the environment were constant) but it does converge in distribution. At each stage i, we define P Nt (i) −1 −1 P (2) a := lim t log Nt (i) = lim t log i t→∞ t→∞ N i 0 (i)

A fundamental result in this area of research is the Furstenberg-Kesten Theorem Furstenberg and Kesten (1960b), which tells us that the long-run growth rate exists and is not random. This a is called the “stochastic growth rate”. The proof of this fact is fairly straightforward, within the framework of modern Markov chain theory. We present a version in section 3.1 and a proof in the appendix, both for the reader’s convenience and because it illustrates some of the important basic ideas that we draw on throughout this work. The main result of this paper is formula (40), which represents the derivative of a with respect to an incremental change in the probability of transitioning from e to e˜. If we denote this derivative by Ae,˜e , then the change corresponding to shifting P in the direction of a matrix W is a linear functional X da(P + W ) = Ae,˜e We,˜e . d e,˜ e∈M

We note that, while the existence of such a derivative follows from the general theory of Peres (1992), the computable formula is new.  Counting the K 2 parameters in each matrix, there are at most M K 2 + M 2 parameters. While all parameters must be nonnegative, and all elements of the population projection matrices but the birth rates must be ≤ 1, the only universal P constraint is e∈M P (˘ e, e) = 1. (There may, however, be further constraints imposed, as some transitions may be impossible. If we are considering age-structured populations, the matrices are Leslie matrices, each with only 2K − 1 potentially nonzero parameters.) The sensitivities we examine are derivatives of a with respect

4

to these parameters. We will define a one-dimensional family of population matrices or environment transition matrices, and we refer to change “in the direction of” the derivative of this family. This will be made precise in the analyses that follow.

3 3.1

Technical background Contraction

We denote the K-dimensionalP column vector with 1’s in all places by 1. By default we use the L1 norm kxk := |xi | when x is a vector in RK , and write kXk := PK i,j=1 |Xi,j | when X is a K × K matrix. Note that kXk = kX1k when X is a nonnegative matrix. Our assumptions imply that there are positive constants kˆ and rˆ, such that for any environments e1 , . . . , em , log kXem · · · Xe1 k ≤ kˆ + mˆ r (3) We use the Hilbert projective metric ρ, described in Golubitsky et al. (1975) and in section 3.1 of Seneta (1981), defined by ρ(x, y) := log max x(i)/y(i) + log max y(i)/x(i). 1≤i≤K

1≤i≤K

(4)

T This isP a pseudometric on RK + that is a metric on S := {(x(1), . . . , x(K)) : x(i) > 0 and x(i) = 1}. The distance between two vectors is defined by the ray from the origin; that is, ρ(x, y) = ρ(x/kxk, y/kyk). It is straightforward to show — cf. Bushell (1973) — that i 1h min{x(i)} + min{y(i)} e−ρ(x,y) ρ(x, y) ≤ kx − yk ≤ eρ(x,y) − 1 2

for any x, y ∈ S. (The upper bound is shown by Bushell with respect to the Euclidean norm, but the same argument holds for any Lp norm.) Thus, convergence in the projective metric implies that the projections onto S converge in the standard norms. It is also straightforward to show that x(i) kxk max log ≤ ρ(x, y) + log . y(i) kyk

1≤i≤K

(5)

The Birkhoff contraction coefficient (cf. Seneta (1981)) is defined for nonnegative matrices X as ρ(Xu, Xv) τB (X) := sup . (6) ρ(u, v) u6=v∈S It is clear that τB is sub-multiplicative (that is, τB (XY ) ≤ τB (X)τB (Y )). We also have the formula (Theorem 3.12 of Seneta (1981)) ( X Xj` mini,j,k,` Xik if X > 0, 1 − φ(X)1/2 jk Xi` τB (X) = , where φ(X) = (7) 1/2 1 + φ(X) 0 otherwise. (This depends on the assumption that X is row-allowable. We will be applying this result only to cases in which X is positive, so our bound on τB will be strictly less than 1.) An immediate consequence is Lemma 3.1.

5

Lemma 3.1. For any fixed R, set r := maxe1 ,...,eR ∈M τB (XeR · · · Xe1 )1/R . Then for any e1 , . . . , en ∈ M, τB (Xen · · · Xe1 ) ≤ k1 rn , (8) where k1 := r1/R−1 . In particular, if R is chosen so that every product XeR · · · Xe1 is positive, then r < 1. We also know that for any u, u ˘ ∈ S, and v T any positive row vector,     u(i) vT u u(i) log min ≤ log T ≤ log max . u ˘(i) v u ˘ u ˘(i) Since the left-hand side is ≤ 0 and the right-hand side ≥ 0, so log v T u − log v T u ˘ ≤ ρ(u, u ˘).

(9)

Following Lemma 1 of Lange (1979), we define a compact subset U ⊂ S which is stable under the transformations u → Xe u/kXe uk for any e ∈ M and includes the vector 1/K, as well as all vectors of the form XeR · · · Xe1 y/kXeR · · · Xe1 yk where y ∈ S; and a compact subset V ⊂ ST which is stable under the transformations v T → v T Xe /kv T Xe k and includes the vector 1T /K as well as all vectors of the form y T XeR · · · Xe1 /ky T XeR · · · Xe1 k, where y ∈ S. Note that we could take U to be any Un , defined to be the union of all products Xe1 · · · Xen S. These are a decreasing sequence of compact sets, with Xe Un ⊂ Un+1 . whose limit U∞ is in general a fractal set, which is the support of the stationary distribution π. For practicality we are likely to work with a U which is a finite-stage approximation to U∞ . Since U is a compact subset of S, its diameter ∆ := supu,˘u∈U ρ(u, u ˘) is finite. Combining this with Lemma 3.1, there is a constant k2 = k1 ∆ such that ρ (Xem · · · Xe1 u, Xem · · · Xe1 u ˘) ≤ k1 rm ρ(u, u ˘ ) ≤ k2 r m .

(10)

(Cf. Theorem 2 of Lange and Holmes (1981).) Of course, the constants may be chosen so that the same relation holds for the transposed matrices, with uT , u ˘T ∈ V. We use then the definition ( ) ∆ := max

sup ρ(u, u ˘), sup ρ(v, v˘) . u,˘ u∈U

(11)

v,˘ v ∈U

We also define ∆U (u) := sup ρ(u, u ˘);

∆V (v) := sup ρ(v, v˘).

u ˘∈U

(12)

v ˘∈V

Since for any positive vectors u, u0 , u00 we have ρ(u+u0 , u00 ) ≤ max{ρ(u, u00 ), ρ(u0 , u00 )}, the maximum of ρ(u, u0 ) among vectors in a cone is taken between extreme points of the cone, we can compute ∆ as the maximum of all distances of the form ρ(XeR · · · Xe1 1, Xe0R · · · Xe01 1); and distances of the form ρ(1XeR · · · Xe1 , 1Xe0R · · · Xe01 ). It follows (as in Lemma 2 of Lange and Holmes (1981)) that for any u, u ˘∈U and environments e, e1 , . . . , em , ˘k log kXe Xem · · · Xe1 uk − log kXe Xem · · · Xe1 u ≤ k1 rm ρ(u, u ˘) kXem · · · Xe1 uk kXem · · · Xe1 u ˘k (13) ≤ k2 rm .

6

The same relation holds when the matrices X are replaced by their transposes, with uT , u ˘T ∈ V. Since kXk = kX1k, and 1/K is in U, it immediately follows that if e˘1 , . . . , e˘i are any other environments, log kXe Xem · · · Xe1 k − log kXe Xem · · · Xei+1 Xe˘i · · · Xe˘1 k ≤ k2 rm−i . (14) kXem · · · Xe1 k kXem · · · Xei+1 Xe˘i · · · Xe˘1 k We note that the results in Lange and Holmes (1981) depend only on the set of matrices X being compact, not on it being finite. In section 5 and beyond we will be letting the matrices Xe and/or the transition matrix P depend smoothly on a parameter x, which will take values either in [−x0 , x0 ] or [0, x0 ]. We may then choose the sets U and V and constants k1 , k2 , r such that the properties above — in particular, the stability of U and V and the bounds (10) and (13) — hold simultaneously for all values of the parameter. One slightly more complicated bound is proved in the appendix. Lemma 3.2. Given v, v˘, u, u ˘ ∈ RK + , and constants ku , kv ≤ kv and max | log uj /˘ uj | ≤ ku , we have T T ˘ log v u − log v u ≤ 8ku kv . v˘T u v˘T u ˘

1 4

with max | log vi /˘ vi | ≤

(15)

This condition is satisfied if 1 1 ˘) + log kvk/k˘ v k ≤ . kv := ρ(v, v˘) + log kvk/k˘ v k ≤ and ku := ρ(u, u 4 4

3.2

Time reversal and the stationary distribution

The transition matrix for the time-reversal of P will be denoted Pe, and is given by ν(˘ e) Pe(e, e˘) := P (˘ e, e). ν(e) A standard result (for example, see Theorem 6.5.1 of Grimmett and Stirzaker (2001)) tells us that if e1 , e2 , . . . , em form a stationary Markov chain with transition matrix P , for a fixed m, the reversed sequence em , em−1 , . . . , e1 is a Markov chain with transition matrix Pe. (We use the boldface e to represent a random environment — a random variable with values in M — and e to represent a particular environment or a realisation of a random variable e.) As described in Lange and Holmes (1981), if e1 , e2 , . . . forms a stationary Markov chain with transition probabilities P , there is a unique distribution π on U × M which is stable under the transformation (u, et ) 7→ (Xet u/kXet uk, et+1 ). That is, if the normalized population structure Nt /kNt k paired with et+1 is chosen from the distribution π, then the pair (Nt+1 /kNt+1 k, et+2 ) will also be in the distribution π. This Markov chain is naturally represented as an iterated function system; that is, a step in the chain is made by choosing randomly from a collection of possible functions, and applying it to the current position of the chain. In this case, the function is u 7→ Xe u/kXe uk, where e is the next random environment. Time reversal is a convenient approach for studying the asymptotic behavior of such systems, and more general random dynamical systems. For applications, see Barnsley and Elton (1988); Elton (1990); Steinsaltz (1999).

7

To understand how this works, suppose we extend our Markov chain of environments forward and backward in time to infinity: . . . , e−2 , e−1 , e0 , e1 , e2 , . . . . Starting at any s, positive or negative, moving forward we have a Markov chain with transition matrix P : es , es+1 , es+2 , . . . . Moving backward, es , es−1 , es−2 , . . . is a Markov chain with transition matrix Pe. Choose any population vector u0 ∈ U, and define for s > t, Xes Xes−1 · · · Xet u0 Us,t := . (16) kXes Xes−1 · · · Xet u0 k From (10), we know that for any t1 , t2 ≥ m, we have the bound ρ(Us,−t1 , Us,−t2 ) ≤ k2 rm+s , which means that Us,−t )∞ t=0 is a Cauchy sequence. We may denote its limit by Us,−∞ . We note four important features of this backward iterated sequence: (i) If the environments all shift one to the left, so that we start at es+1 instead of es , the distribution remains exactly the same. Thus, the distribution of Us,−∞ is the same for any s. (ii) If we start our population Markov chain in the state (U0,−∞ , e1 ), then the next state will be   Xe1 U0,−∞ , e2 = (U1,−∞ , e2 ) . kXe1 U0,−∞ k Thus, the distribution of (U0,−∞ , e1 ) is a stationary distribution π for the chain. (iii) For each fixed t, U0,−t has the same distribution as Ut,0 =

Xet Xet−1 · · · Xe0 u0 , kXet Xet−1 · · · Xe0 u0 k

(17)

which is a realization of the normalized population vector Nt /kNt k, when it is started at the vector u0 . Thus, the population-environment Markov chain (Nt /kNt k) converges in distribution to the same stationary distribution π, no matter what the initial condition. (iv) We will use frequently the fact that (U0,−t , e1 ) may be considered an approximate sample from the distribution π, with an error that is bounded by  ρ(U0,−t , U0,−∞ ) = ρ Xe0 · · · Xe−t u0 , Xe0 · · · Xe−t · U−t−1,−∞  ≤ ∆U (u0 )τB Xe0 · · · Xe−t , which is straightforward to bound (cf. section 3.1), independent of any information about U−t,−∞ . The same holds true, of course, if we reverse the matrix multiplication: Starting from any nonnegative K-dimensional row vector v0T , we define the sequence of row vectors v T Xe · · · Xe0 T ∈ V. V0,t := 0T t kv0 Xet · · · Xe0 k Then V0,t converges pointwise to a vector V0,∞ := limt→∞ V0,t . We denote the distribution of (V0,∞ , e−1 ) by π ˜ . As before, π ˜ is the stationary distribution for the Markov chain on V ×M, defined by taking (v T , e−t ) at time t to (v T Xe−t /kv T Xe−t k, e−(t+1) ) at time t + 1.

8

We also define the regular conditional distributions πe on U as follows: pick (U, e) from the distribution π, conditioned on e being e, and take πe to be the distribution of U . We define π ˜e similarly. In the case of i.i.d. environments, of course, π and π ˜ would be simply products of an independent population vector and the stationary environment with distribution ν.

4

Errors and How to Bound Them

A standard approach to estimating a — and a similar approach to the derivatives that we describe later on — is to choose a fixed starting vector u0 ∈ U, simulate sequences e0 (i), e1 (i), . . . , emi (i) independently from the stationary Markov chain with transition probabilities P (i = 1, . . . , J), and then compute # " kXemi (i) Xemi −1 (i) · · · Xe0 (i) u0 k am := E log kXemi −1 (i) · · · Xe0 (i) u0 k ≈

J 1 X log kXemi (i) Xem−1 (i) · · · Xe0 (i) u0 k J i=1

(18)

 − log kXemi −1 (i) · · · Xe0 (i) u0 k . It is important not only to know what would be an appropriate approximation to a or its derivatives in the sense of being asymptotically correct, but also to have rigorous bounds for the error arising from any finite simulation procedure, such as (18). There are two sources of error: systematic error, arising from the fact that (Xe0 , u0 ) is not exactly a sample from the distribution π; and sampling error, arising from the fact that we have estimated the expectation by averaging over a random sample. For obtaining an unbiased estimate there is no reason the sample Markov chain realizations should need to be independent. For instance, a standard approach would be to take a single realization of the chain e0 , e1 , e2 , . . . , and take et (i) = et , with mi = B + i for some “burn-in time” B. The problem with this approach is that it becomes more difficult to bound the errors; in principle, the convergence could be extremely slow for these kinds of sums. We leave the problem of bounding errors for these cumulative simulations to a future work.

4.1

Systematic error

By “systematic error” we mean the error in our estimate of a arising from the difference between the distribution we are aiming for and the distribution we are actually sampling from. The quantity we are trying to estimate may be represented as a = π[F ], the expectation of a certain function F with respect to the distribution PJ π. If we can simulate Z1 , . . . , ZJ from π, then a ˆJ := J −1 j=1 F (Zj ) is an unbiased estimator of π[F ], and will be consistent under modest assumptions on F and the independence of the samples. Suppose, though that what we have are not samples from π, but samples Zj∗ from a “similar” distribution π ∗ . Then we can bound the error by J −1 X ∗ ∗ |a − a ˆ| ≤ J F (Zj ) − π [F ] + π[F ] − π ∗ [F ] . (19) j=1

9

Here the first term on the right-hand side is the sampling error, and the second term is the bias, the expected value of systematic error. The problem is that the bounds we can obtain for the bias are likely to be crude, absent good computational tools for the distribution π. (And if we could compute analytically from π, we wouldn’t need to be simulating). An alternative is to couple the samples Zj∗ from the approximate distribution π 0 to exact samples Zj from the distribution π, we can break up the error in a slightly different way: J J J X X X F (Zj ) − J −1 F (Zj∗ ) |a − a ˆ| ≤ J −1 F (Zj ) − π[F ] + J −1 j=1

j=1

j=1

J J X X F (Zj ) − π[F ] + J −1 ≤ J −1 F (Zj ) − F (Zj∗ ) j=1

(20)

j=1

Bounds for the sampling error in (19) will generally also be bounds for the first term in (20). The second term in (20), on the other hand, which takes the place of the bias, is a random variable, computed from the samples Zj∗ . Its expectation is still a bound on the bias. The crucial fact is that the last line may be computable without knowing in detail what the “true” sample Zj is. A small disadvantage of this approach is that the systematic error varies with the sample. To achieve a particular fixed error bound we need an adaptive approach, whereby we successively extend our sequence of matrices until the error crosses the desired threshold. We note here that this approach to estimating the systematic error in simulations is essentially just a version of the Propp-Wilson algorithm (cf. Chapter 10 of H¨ aggstr¨ om (2002)). Unlike standard applications, though, the space is continuous, so the systematic error never reaches 0.

4.2

Sampling error

The sampling error is difficult to control with current techniques, because the distribution of the samples is so poorly understood — the very reason why we resort to the Monte Carlo approximation in the first place. The best we can do for a rigorous bound is to use Hoeffding’s inequality (see Hoeffding (1963)), relying on crude bounds on the terms in the expectation. Hoeffding’s inequality tells us that if X1 , . . . , XJ are i.i.d. random variables such that α ≤ Xi ≤ β almost surely, then for any z > 0,  X  1 2 2 P Xi − E[X] > z ≤ 2e−2Jz /(β−α) . (21) J This is essentially the same bound that we would estimate from the normal approximation if the standard deviation of X were (β − α)/2. Generally we will want to fix p0 , the confidence level, and compute the corresponding z, which will be r 1 z0 = (β − α) − log(p0 /2). (22) 2J Of course, the standard deviation will be smaller than this, but we do not know√how much smaller. An alternative approach then would be to use the bound 2τ (z J/ˆ σ ), where σ ˆ is the standard deviation of the simulated samples, and τ is the cumulative distribution function of the Student t distribution with J − 1 degrees

10

of freedom. This will be a smaller bound, in that sense “better”, but not precisely true for finite samples, to the extent that the sample distribution is not normal. The corresponding bound on the error, at probability level p0 , is σ ˆ z0 = √ t1−p0 /2 (J − 1), J

(23)

where tp (J − 1) is the p quantile of the Student T distribution with J − 1 degrees of freedom; that is, if T has this distribution then P {T > tp (J − 1)} = p. These asymptotic bounds are perfectly conventional in Markov chain Monte Carlo analysis (see, for example, Asmussen and Glynn (2007)), and there is no reason particularly to eschew them in this context. They are likely to be quite accurate, and superior to the rigorous bounds, and might also be applied to the setting where the expectations are estimated not from independent samples, but from a single run of the environment chain. We have nonetheless emphasised the Hoeffding-based rigorous bounds in the statements of our theorems for three reasons: First, they are likely to be less familiar, and the reader may require more guidance in applying them to the individual cases; second, because some residue of skepticism must remain for the asymptotic bounds, while these results may be applied to calculations that are otherwise analytically precise; and third, as a spur to further thought on the best ways to bound errors rigorously in these sorts of problems.

5 Growth Rate and Sensitivity to Projection Matrices We present here extensions of two known results. In these cases (and in later results) we start by defining an estimator that converges to the the quantity we desire, and follow that by bounds on the systematic and sampling errors, as well as an error bound for estimates from a simulation estimator. We state our results on error bounds in the form “The quantity Q may be approximated by the expectation of A, with systematic error bounded by B and sampling error bounded by C(J, p).” This means that if A1 , . . . , AJ are independent realizations of A, then the probability P that the true value of Q is not in the interval J −1 Ai ± [B + C(J, p)] is no bigger than p. When describing an adaptive bound on the systematic error, B will depend upon the particular simulation result. Again, the sampling error may be bounded either by a universally valid Hoeffding bound, based on known upper bounds on the samples, or by the Student t distribution, using the standard deviation estimated from the sample, which provides a generally much superior bound, but which can only be treated as an approximation.

5.1

Computing a

The stochastic growth rate a is commonly estimated by numerical simulation but, as discussed with examples by Caswell (2001), there is no general way to bound the errors in the estimated values. The following result provides suitable bounds. Theorem 5.1. Let u0 be any fixed element of U, and Ym := Xem Xem−1 · · · Xe1 , where e0 , e1 , . . . form a Markov chain with transition rates P . The stochastic growth rate may be approximated by the simulated expectation of log

kXem+1 Ym u0 k , kYm u0 k

11

(24)

with systematic error bounded by k2 rm and sampling error at level p on J samples bounded by   1/2 supu∈U maxe∈M kXe uk − log p log . (25) inf u∈U mine∈M kXe uk 2J When the simulated expectation is J kXem+1 (i)Ym (i)u0 k 1X log J i=1 kYm (i)u0 k

we may also bound the systematic error by J J  ∆U (u0 ) X  1X sup ρ Ym (i)u, Ym (i)˘ u ≤ τB Ym (i) , J i=1 u,˘u∈U J i=1

(26)

where ∆U (u0 ) is defined as in (12) and τB is defined as in (6).

5.2

Derivatives with respect to Projection Matrices

We need care in defining derivatives of a with respect to elements of the population projection matrices. As discussed in Tuljapurkar and Horvitz (2003) we must define how the matrix entries change, e.g., do we change fertility rates in a particular environment, or in all possible environments? Although the main formula here is known, Tuljapurkar’s (1990) derivation did not justify a crucial exchange of limits (between taking the perturbation to zero and time to infinity). We provide a rigorous proof (see Appendix) and of course the error bounds here are new. We will suppose that the matrices Xe depend smoothly on a parameter , so that we may define Xe0 := ∂Xe /∂, and we define the base matrices to be at  = 0. In some cases, the parametrization will be defined only for  ≥ 0, and in those cases we will understand the partial derivatives to be one-sided derivatives, and the limits lim→0 will be the one-sided limits lim↓0 . Theorem 5.2. Let Ue and Ve be independent random variables with distributions πe and π ˜e (the conditional stationary distributions defined in section 3.2). Then  T 0  X V X Ue a0 (0) = νe E eT e (27) Ve Xe Ue e∈M

Each term may be approximated by averaging samples of the form X e∈M

νe

V (m)T Xe0 U (m) , V (m)T Xe U (m)

(28)

where u0 , v0T are any fixed elements of U, V respectively, U (m) = X˜e1 · · · X˜em u0 and V (m)T = v0T Xem · · · Xe1 , e = ˜ e0 , ˜ e1 , . . . , ˜ em form a sample from the Markov chain Pe, and e = e0 , e1 , . . . , em form a sample from the Markov chain P . The systematic error may be bounded uniformly by 2(exp(4k2 rm ) − 1)|a0 (0)|,

12

(29)

while the sampling error at level p on J samples is bounded by  1/2  X  v T X 0 u v˘T X 0 u − log(p/2) e e˘ sup νe − . 2J v T Xe u v˘T Xe u ˘ u∈U ,v T ∈V

(30)

e∈M

u ˘∈U ,˘ v T ∈V

Suppose the simulated expectation is J 1 X X (V (m) (j))T Xe0 U (m) (j) νe (m) , J j=1 (V (j))T Xe U (m) (j) e∈M

where U (m) (j) = X˜e1 (j) · · · X˜em (j) u0 =: Yem (j)u0 and (V (m) (j))T = v0T Xem (j) · · · Xe1 (j) =: v0T Ym (j). Let  U(j) := Yem (j)U = Yem (j)u : u ∈ U ,  V(j) := VYm (j) = v T Ym (j) : v T ∈ V . Then we may also bound the systematic error by J X νe X J j=1

e∈M

sup u∈U (j) v T ∈V(j)

T 0 v Xe u (V (m) (j))T Xe0 U (m) (j) v T Xe u − (V (m) (j))T X U (m) (j) e

( ) ! J X νe X ≤ exp 2 sup ρ(u, U (m) (j)) + 2 sup ρ(v, V (m) (j)) − 1 J j=1 u∈U (j) v T ∈V(j) e∈M

×

v T |Xe0 |u v T Xe u

sup u∈U (j) v T ∈V(j)



J X νe X J j=1

e∈M

sup u∈U (j) v T ∈V(j)

n    o v T |Xe0 |u  em (j) + τB Ym (j)T exp 2∆ τ Y − 1 . B v T Xe u (31)

Note that the bound (29) is given as a proportion of the unknown a0 (0). It can be turned into an explicit bound by using an upper bound on a0 (0). We have the trivial bound X |v T Xe0 u| . (32) |a0 (0)| ≤ νe sup T u∈U v Xe u e∈M

v T ∈V

We recall that ∆ and τB may be bounded according to the formulas given in section 3.1. It may seem surprising that the vectors Ue and Ve are independent. If we think of the environments as a doubly infinite sequence with e0 = e, then the vector Ue in equation (27) depends only on the past (that is, ei with i < 0) and Ve depends only on the future (ei with i > 0). By the Markov property, these two are independent, when conditioned on e0 = e.

13

6

Environments and Coupling

Suppose we change the transition matrix P to a slightly different matrix P () , and want to compare population growth along environmental sequences generated by the original and the perturbed matrix. We expect that the perturbed environmental sequences will only occasionally deviate from the environment that we “would have had” in the original distribution of environments. Computing the derivative of a is then a matter of measuring the cumulative deviations due to these changes. These may be split into two parts: First, the process moves under P () to a state e˘, different from the state e that it would have moved to under P . Then there is a sequence of environments following on from e˘ that is different from the sequence that would have followed from e, until the Markov chain gradually “forgets” its starting point. The change to a new sequence of matrices induces two separate changes on the growing population: The new sequence accumulates a difference in magnitude on its way to stationarity; and it produces a different stationary distribution of unit vectors depending on the starting environment e˘ rather than e. In this section we examine the first effect. We fix the transition matrix P and compare the growth of total population size when starting from environment e to the growth starting from the stationary distribution ν. A standard method for doing this is coupling. For an outline of coupling techniques in MCMC, see Kendall (2005) and Roberts and Rosenthal (2004). We use coupling in two ways, corresponding to the two components of the Markov chain: the environment and the population vector. Fix environments e and e˘ (possibly the same). We define sequences e0 , e1 , . . . ; ˘ e0 , ˘ e1 , . . . ; and ¯ e0 , ¯ e1 , . . . : all three are Markov chains with transition probabilities P , but with e0 = e, ˘ e0 = e˘ and ¯ e0 having distribution ν (so that (¯ ei ) is stationary). We define the total population effect of starting in state e rather than e˘ as   e˘ζe := lim E [log kXet · · · Xe0 k] − E [log kX˘ et · · · X˘ e0 k] t→∞   ζe := lim E [log kXet · · · Xe0 k] − E [log kX¯et · · · X¯e0 k] (33) t→∞ X = νe˘ · e˘ζe . e˘∈M

This is one of the terms that will come into formula (40), the derivative of a with respect to shifting transition probabilities from e to e˘. Note that when the environments are i.i.d. — so P (e, e˘) = νe˘ — we have   kV T Xe k , e˘ζe = E log kV T Xe˘k where V T has the distribution π. Computing ζe depends on coupling the version of the Markov chain starting at e, to another version starting in the distribution ν. We define the coupling time τ to be the first time such that eτ = ¯ eτ ; after this time the chains follow identical trajectories. If we know the distribution of τ and of the sequences followed by the two chains from time 0 to τ , we can average the diferences in (33) to find ζ. The advantage of coupling is, first, that it reduces the variability of the estimates, and second, that we know from the simulation when the coupling time has been achieved, which gives bounds on the error. A suitable choice is Griffeath’s maximal

14

coupling (Griffeath, 1975) which we will apply in Pitman’s (Pitman, 1976) pathdecomposition representation. (The coupling is “maximal” in the sense of making the coupling time, and hence the variance of the estimate, as small as possible.) However we must be careful about sampling values of τ , because they may be large if the Markov chain mixes slowly. To deal with this we overweight large coupling times that generate a large contribution to ζ. Beginning with a fixed environment e, the procedure is as follows: (i) Define the sequence of vectors αt := P t (e, ·) − ν(·). We also define αt+ and αt− to be the vectors of respectively. Let pointwise positive and negative parts C(t) be a bound on log kXe1 · · · Xet k − log kXe˘1 · · · Xe˘t k , where the ei and e˘i are any environments. From (3) we know that 2kˆ + 2tˆ r is a possible choice for C(t). (ii) For pairs (t, e), where t is a positive integer and e∗ ∈ M, define a probability distribution   if e = e˘, t = 0; νe q(t, e˘) := 0 if e 6= e˘, t = 0;   + [αt−1 P ](˘ e) − αt+ (˘ e) otherwise. This is the distribution of the pair (τ, eτ ) for the maximally coupled chain. Define ∞ X X A := q(t, e˘)C(t), t=1 e˘∈M

and a probability distribution on N × M ˚ q (t, e˘) :=

q(t, e˘)C(t) . A

(iii) Average J independent realizations of the following random variable: Let (τ, e0 ) be chosen from the distribution ˚ q on N × M, and ˘ e0 independently from the distribution ν. From these starting states e0 and ˘ e0 , let (e0 , . . . , eτ , . . . , em ) and (˘ e0 , . . . , ˘ eτ , . . . , ˘ em ) be realizations of the coupled pair of Markov chains with transition probabilities P , conditioned on the coupling time being τ and eτ = ˘ eτ = ˘ e. These realizations are generated from independent inhomogeneous Markov chains running backward, with transition probabilities  α+ (x)P (x, y) , P ei−1 = x ei = y = P i−1 + z∈M αi−1 (z)P (z, y)  α− (x)P (x, y) P e˜i−1 = x ˘ ei = y = P i−1 − . z∈M αi−1 (z)P (z, y) We extend the chain past τ , requiring et = ˘ et for t > τ , as a realization of the Markov transition probabilities P , to obtain a total sample of predetermined length m. The random variable is then Z :=

A kXem · · · Xeτ · · · Xe0 k log . C(τ ) kXem · · · Xeτ +1 X˘eτ · · · X˘e0 k

(Note that the realizations corresponding to τ = 0 are identically 0. The possibility of τ = 0 has been included only to simplify the notation. In practice, we are free to condition on τ > 0.)

15

The change from q to ˚ q is an example of importance sampling (cf. Chapter V.1 in Asmussen and Glynn (2007)). We oversample the values of the random variable with high τ to reduce the variability of the estimate. The importance sampling makes Z(j) a bounded random variable, with bound A. Imagine that we had a source of perfect samples V T (j) from the distribution π ˜em , and define e Z(j) :=

kV T (j)Xem (j) · · · Xe0 (j) k A log . C(τ (j)) kV T (j)X˘em (j) · · · X˘e0 (j) k

Let Y (j) := Xem (j) · · · Xe0 (j) and Y˘ (j) := X˘em (j) · · · X˘e0 (j) . Then    e − Z(j) ≤ A∆V (1) τB Y (j) + τB Y˘ (j) . Z(j) C(τ (j)) e = ζe , we may use (22) to compute the bound Since E[Z]   r n   X 1 e > 2A − Z(j) log(p0 /2) ≤ p0 . P ζe − n−1   2J

(34)

(35)

j=1

Lemma 6.1. The limits defining the coefficients e˘ζe and ζe exist and are finite. We may approximate ζe by J kXeτ (j) · · · Xe0 (j) k 1X A log . J j=1 C(τ (j)) kXe˜τ (j) · · · Xe˜0 (j) k

(36)

If 0 < p0 ≤ 1, the probability is no more than p0 that the error in this estimation is larger than r J   1 X A∆V (1)  1 ˘ τB Y (j) + τB Y (j) + 2A − log(p0 /2), (37) J j=1 C(τ (j)) 2J It remains to bound A. From standard Markov chain theory, for any vector v ∈ S, kv T P t − ν T k ≤ Dξ t . (38) Setting setting Q := P − 1ν T , αtT is the e-th row of Qt , which may also be written as αt = 1Te (Qt )T , where 1e is the column vector with 1 in place e and 0 elsewhere. Thus kαt k ≤ Dξ t . If we use the bound C(t) = kˆ + tˆ r, then A=

∞ X

  kˆ + tˆ r kαt−1 k − kαt k

t=1

ˆ 1 k + rˆ = kkα

∞ X

kαt k

t=1

  Dˆ r ≤ Dξ kˆ + 1−ξ

16

(39)

7 Derivatives with respect to Environmental Transitions We are now ready to compute derivatives of a with respect to changes in the distribution of environments, P as determined by P . Complicating the notation slightly is the constraint {P : e˘ P (e, e˘) = 1 for each e}; thus, there can be no sense in speaking of the derivative with respect to changes in P (e, e˘) for some particular e, e˘. Instead, we mustP compute directional derivatives along the direction of some matrix W , in the plane e˘ We,˘e = 0. For the purposes of this result we write a() = a(P () ), where P () is a differentiable curve of M × M matrices, where the parameter  takes values either in a two-sided interval [−0 , 0 ], or a one-sided interval [0, 0 ]. Let W := ∂P () /∂, an M ×M matrix whose rows all sum to 0. The perturbations are such that P () retains the ergodicity and irreducibility of P . (The result should be the same whether  is positive or negative. If P is on the boundary of the set of possible values, one or the other sign may be impossible. Some choices of W may be impossible in both directions.) In the special case in which We,˘e = 1 and We,˘e = −1, with all other entries 0, we are computing the derivative corresponding to a small increase in the rate of transitioning from environment e to e˘, and a decrease in the rate of transitioning to e˘. In this section the matrices X1 , . . . , XM are assumed fixed. Theorem 7.1. The derivative of the stochastic growth rate is    X VeT Xe Xe˘Ue˘ 0 , a (0) = νe˘We˘,e ζe + E log kVeT Xe k

(40)

e,˘ e∈M

where Ue˘, VeT are independent random variables with distributions πe˘ and π ˜e respectively. The quantities ζe may be approximated, with error bounds, according to the algorithm described in section 6. The other part of the expression may be approximated by averaging samples of the form (m) (m)T X Xe Xe˜Ue˜ Ve , (41) νe˜We˜,e log (m)T kVe Xe k e,˜ e∈M where (m)

Ue˜

Ve(m)T

X˜e1 · · · X˜em u0 and kX˜e1 · · · X˜em u0 k v T Xe · · · Xe1 = 0T m , kv0 Xem · · · Xe1 k

=

and e = e0 , e1 , . . . , em is a Markov chain with transition matrix P , and e˜ = ˜ e0 , ˜ e1 , . . . , ˜ em is an independent Markov chain with transition probabilities Pe, and u0 ∈ U and v0T ∈ V. The systematic error may be bounded uniformly by 2k2 rm k(ν T |W |)k, while the sampling error at level p on J samples bounded by  1/2

− log p 2 ν T |W | sup max | log v T Xe u| . (42) 2J u∈U e∈M v T ∈V

17

Suppose the simulated expectation is J 1X X V (m,˜e)T (j)Xe˜Xe U (m,e) (j) νe˜We,˜e log , J j=1 kV (m,˜e)T (j)Xe˜k e˜,e∈M

where (m)

Ue˜

(j) =

Ve(m)T (j) =

(m) X˜e1 (j) · · · X˜em (j) u0 Ye (j)u0 =: e˜(m) and kX˜e1 (j) · · · X˜em (j) u0 k kYee˜ (j)u0 k (m) v0T Xem (j) · · · Xe1 (j) v T Ye (j) =: 0 (m) . T kv0 Xem (j) · · · Xe1 (j) k kv0T Ye (j)k

We may also bound the systematic error by  J 1X X (m) (m) νe˜|We˜,e | sup ρ(Yee˜ (j)u, Yee˜ (j)˘ u) J j=1 u,˘ u∈U e˜,e∈M

+

sup

ρ(v

v T ,˘ v T ∈V



T



Ye(m) (j)Xe , v˘T Ye(m) (j)Xe )

(43)

J    ∆X X (m) νe˜|We,˜e | τB Yee˜ (j) + τB XeT Ye(m) (j)T . J j=1 e˜,e∈M

We note that expressions like (40) are examples of what Br´emaud (1992) calls “ersatz derivatives”. In a rather different class of applications Br´emaud suggests applying maximal coupling.

8

Discussion

Our results provide analytical formulas and simulation estimators for the derivatives of stochastic growth rate with respect to the transition probability matrix or the population projection matrices. We have concentrated here on the theoretical results; although this may not be obvious, we have made considerable effort at brevity. Partly for this reason, we will present elsewhere numerical applications of these results. We expect that our results should carry over to integral population models (IPMs), given the strong parallels between the stochastic ergodicity properties of IPMs and matrix models (Ellner and Rees, 2007). Our results apply not only to stochastic structured populations but to any stochastic system in which a Lyapunov exponent of a product of random matrices determines stability or other dynamic properties. Examples include the net reproductive rate in epidemic models and some models of network dynamics. An obvious application of our results is to the analysis of optimal life histories, i.e., environment-to-projection matrix maps that maximize the stochastic growth rate. As discussed by McNamara (1997), this optimization problem translates into what is called an average reward problem in stochastic control theory, and so our results may be more generally useful in such control problems.

18

9

Acknowledgements

We thank NIA (BSR) for support under 1P01 AG22500. David Steinsaltz was supported by a New Dynamics of Ageing grant, a joint interdisciplinary program of the UK research councils.

19

Appendix Proofs of the theorems.

A.1

Proof of Lemma 3.2

We have T T v u v u ˘ log − log T = log v˘T u v˘ u ˘ = log

vi ui v˘j u ˘j P ˘i v˘j uj i,j vi u P P ˘i u ˘i + i<j (vi ui v˘j u ˘j + vj uj v˘i u ˘i ) i ui vi u P P ˘i u ˘i + i<j (vi u ˘i v˘j uj + vj u ˘j v˘i ui ) i ui vi u vi ui v˘j u ˘j + vj uj v˘i u ˘i ≤ max log i,j vi u ˘i v˘j uj + vj u ˘j v˘i ui P

i,j

For some choice of i, j, let α := log

ui , u ˘i

β := log

vi , v˘i

γ := log

uj , u ˘j

δ := log

vj . v˘j

Note that by (5), 1 |α| ≤ ρ(u, u ˘) + log kuk/k˘ uk ≤ , 4 1 v k ≤ , |β| ≤ ρ(v, v˘) + log kvk/k˘ 4 1 uk ≤ , |γ| ≤ ρ(u, u ˘) + log kuk/k˘ 4 1 uk ≤ . |δ| ≤ ρ(v, v˘) + log kuk/k˘ 4 Then we need to bound α+β α+β e + eγ+δ − eβ+δ − eα+γ + eγ+δ log e ≤ eβ+δ + eα+γ min{eα+β + eγ+δ , eβ+δ + eα+γ } ≤ eα+β + eγ+δ − eβ+δ − eα+γ , where we have used the relation | log x| ≤ max{x, 1/x} − 1. We now use the Taylor series expansion ex =

n X xk k=0

k!

+

20

Cxk+1 , (k + 1)!

(44)

(45)

where |C| ≤ max{ex , 1}. This turns the right-hand side of (45) into 2

∞ X 1 (α + β)k + (γ + δ)k − (β + δ)k − (α + γ)k k! k=2 ∞ k−1 X  1 X =2 α` β k−` + γ ` δ k−` − β ` δ k−` − α` γ k−` `!(k − `)! k=2 `=1  k−2 ∞ k−2 X X k − 2 X k − 2 1 ≤2 αβ α` β k−2−` + |γδ| γ ` δ k−2−` (k − 2)! ` ` k=0 `=0 `=0     k−2 k−2 X k−2 X k − 2 + |βδ| β ` δ k−2−` + |αγ| α` γ k−2−` . ` ` `=0

`=0

Applying the bounds (44) yields finally the upper bound in (15).

A.2

Estimating the stochastic growth rate

We prove here Theorem 5.1. The quantity we are trying to compute is a = E [log kXe U k] , where (U, e) is selected from the distribution π.

(46)

Let e0 , e1 , e2 , . . . be a realization of the stationary Markov chain with transition matrix P . Let Ym := Xem Xem−1 · · · Xe1 . Choose u0 ∈ U, and let U be a random variable with distribution πe0 . Then a = E[log kXem+1 Ym U k/kYm U k], which may be approximated by E[log kXem+1 Ym u0 k/kYm u0 k]. If we identify systematic error with bias, this is     kXem+1 Ym U k kXem+1 Ym u0 k − E log Errorsys = E log , kYm u0 k kYm U k since (Ym U/kYm U k, em ) also has the distribution π (if Ym and U are taken to be independent, conditioned on e0 ). Thus   kXem+1 Ym u0 k kXem+1 Ym U k Errorsys ≤ E log − log kYm u0 k kYm U k " # kX Y uk kX Y u ˘ k e m e m m+1 m+1 ≤ E sup log − log kYm uk kYm u ˘k u,˘ u∈U ≤ k2 rm , by (13). The corresponding bound on the sampling error may be computed from (22). For a particular choice of of e1 , . . . , em+1 and U we can also represent the random systematic error as log kXem+1 Ym u0 k − log kXem+1 Ym U k , kYm u0 k kYm U k which may be bounded by the summand in (26).

21

A.3

Estimating sensitivities: Matrix entries

We prove here Theorem 5.2. As discussed at the end of section 3.1, we may assume that the compact sets U and V are stable and satisfy the bounds of section 3.1 () simultaneously for all Xe . The stationary distributions corresponding to products () of the perturbed matrices are denoted π () and π ˜e , and the corresponding regular () () conditional distributions are πe and π ˜e 0 The derivative a (0) may be written as # "  ! () () () kX X · · · X u k kX X · · · X u k e e e 0 e e e 0 m m−1 0 m m−1 0 − lim E log lim −1 lim E log () () m→∞ m→∞ →0 kXem−1 · · · Xe0 u0 k kXem−1 · · · Xe0 u0 k # "  ! () () () kX˜e0 X˜e1 · · · X˜em u0 k kX˜e0 X˜e1 · · · Xem u0 k − lim E log = lim E log () () m→∞ m→∞ kX˜e1 · · · Xem u0 k kX˜e1 · · · X˜em u0 k = lim lim

→0 m→∞

m X

as,m () =: lim lim A(m, ), →0 m→∞

s=0

where   () () () kX˜e0 X˜e1 · · · X˜es X˜es+1 · · · X˜em u0 k −1 as,m () := E  log () () kX˜e1 · · · X˜es X˜es+1 · · · X˜em u0 k ()

− log

()

()

kX˜e0 X˜e1 · · · X˜es−1 X˜es · · · X˜em u0 k  ()

()

kX˜e1 · · · X˜es−1 X˜es · · · X˜em u0 k

 .

Here e0 , e1 , . . . is the stationary Markov chain with transition probabilities P , and ˜ e0 , ˜ e1 , . . . is the reverse stationary Markov chain, with transition probabilities Pe. By (13), for  > 0 sufficiently small   () |as,m ()| ≤ −1 k2 rs−1 sup E ρ(X˜es u, X˜es u) ≤ 2k2 Crs , u∈U

where C := r−1 max max

e∈M 1≤`≤K

 max u∈U

|v T X 0 |` |Xe0 u|` , max T e (Xe u)` v∈V (v Xe )`

 .

If we define Vi,j := Ui,j :=

()

()

()

()

()

()

1T X˜ei X˜ei+1 · · · X˜ej

k1T X˜ei X˜ei+1 · · · X˜ej k X˜ei X˜ei+1 · · · X˜ej u0 , kX˜ei X˜ei+1 · · · X˜ej u0 k

then for m ≤ n,  () () V (0, s − 1)T X˜es U (s + 1, m) V (0, s − 1)T X˜es U (s + 1, n) −1 as,m () − as,n () ≤  E log − log V (0, s − 1)T X˜es U (s + 1, m) V (0, s − 1)T X˜es U (s + 1, n)  () () T T V1,s−1 X˜es Us+1,n V1,s−1 X˜es Us+1,m + log T − log T V1,s−1 X˜es Us+1,m V1,s−1 X˜es Us+1,n

22

We note by (10) that ρ(Us+1,m , Us+1,n ) ≤ k2 rm−s ; and for  > 0 sufficiently small and any v ∈ V we have () (v T Xe )j max log ≤ 2C. 1≤j≤K, (v T Xe )j e∈M

It follows by Lemma 3.2 that  as,m () − as,n () ≤ min 16k2 Crm−s , 4k2 Crs .

(47)

Putting these together, we get, for all n ≥ m, n m X X A(m, ) − A(n, ) ≤ as,m () − as,n () |as,n ()| +



s=m+1 n X

s=0

s=m+1



X

2k2 Crs +

16k2 Crm−s +

1≤s≤m/2

X

4k2 Crs

m/2<s≤m

20k2 C m/2 r . 1−r

By Lemma A.1 we may exchange the order of the limits, to see that  m   () () () X kXem Xem−1 · · · Xes Xes−1 · · · Xe0 u0 k E log () () kXem−1 · · · Xes Xes−1 · · · Xe0 u0 k s=1   () () () kXem Xem−1 · · · Xes+1 Xes · · · Xe0 u0 k . − E log () () kXem−1 · · · Xes+1 Xes · · · Xe0 u0 k

a0 (0) = lim lim −1 m→∞ →0

(48)

This limit is the same for any choice of u0 , hence would also be the same if we replaced u0 by a random U , with any distribution on U. We choose U to have the distribution πe0 , independent of the rest of the Markov chain. By the invariance property of the distributions π, " # X m () () () kXem Xem−1 · · · Xes Xes−1 · · · Xe0 U k 0 −1 E log a (0) = lim lim  () () m→∞ →0 kXem−1 · · · Xes Xes−1 · · · Xe0 U k s=1 # " () () () kXem Xem−1 · · · Xes+1 Xes · · · Xe0 U k − E log () () kXem−1 · · · Xes+1 Xes · · · Xe0 U k (49)  m () () () X kXem Xem−1 · · · Xes Us−1 k −1 = lim lim  E log () () m→∞ →0 kXem−1 · · · Xes Us−1 k s=0  () () () kXem Xem−1 · · · Xes+1 Xes Us−1 k − log , () () kXem−1 · · · Xes+1 Xes Us−1 k where (Us−1 , es ) has distribution π. For m ≥ s ≥ 1 define functions (δ)

fs (, δ) :=

(δ)

(δ)

()

1T Xem Xem−1 · · · Xes+1 Xes Us−1 (δ)

(δ)

()

1T Xem−1 · · · Xes+1 Xes Us−1

23

,

where the denominator is understood to be 1 for s = m. The summand on the right of (49) may be written as lim −1 E [log fs (, ) − log fs (0, )] .

(50)

→0

By (13), −1 |log fs (, δ) − log fs (0, δ)| ≤ −1 k2 rs ρ(Xe() U, Xes U ) < 2Ck2 rs s for  in a neighborhood of 0, so the Bounded Convergence Theorem turns (50) into   h i ∂ log fs −1 (0, 0) . (51) E lim  log fs (, ) − log fs (0, ) = E →0 ∂ We have, by linearity of the matrix product and k · k, ()

∂ 1T Xem · · · Xes+1 ∂ Xes Us−1 ∂ log fs (0, 0) = ∂ kXem · · · Xes Us−1 k ()



=

∂ 1T Xem−1 · · · Xes+1 ∂ Xes Us−1 kXem−1 · · · Xes Us−1 k

  T 0   1 Xem ···Xes+1 Xes Us−1 − kXe ···Xe Us−1 k T

 

m Xe0 m Um−1

1 kXem Um−1 k

s

1T Xem−1 ···Xes+1 Xe0 s Us−1 kXem−1 ···Xes Us−1 k

 for 1 ≤ s ≤ m − 1,

for s = m.

Combining this with (49) yields the telescoping sum " # X m 1T Xet · · · Xe1 Xe0 0 U 0 a (0) = lim E T m→∞ 1 Xet · · · Xe1 Xe0 U t=1 " # m−1 X 1T Xet · · · Xe1 Xe0 0 U E T − 1 Xet · · · Xe1 Xe0 U t=1 " # T 0 1 Xem · · · Xe1 Xe0 U = lim E T , m→∞ 1 Xem · · · Xe1 Xe0 U where in the last line (U, e0 ) has the distribution π. Define VmT := 1T Xem · · · Xe1 /k1T Xem · · · Xe1 k. Then VmT converges in distribution to V , with distribution π ˜e0 (conditioned on e0 ), and so  T 0    X Vm Xe0 U VeT Xe0 Ue 0 a (0) = lim E = νe E , m→∞ VmT Xe0 U VeT Xe Ue e∈M

which is identical to (27). Now we estimate the error. We use the representation U :=

X˜e1 · · · X˜em U0 V T Xe · · · Xe1 and V T := 0T m , kX˜e1 · · · X˜em U0 k kV0 Xem · · · Xe1 k

where U0 and V0T are assumed to have distributions π˜em and π ˜em respectively. Then V (m)T X 0 U (m)   |V X 0 U | (m)T V T X 0 U e )+2ρ(U,U (m) ) −1 (m)T e (m) − T e ≤ 2 e2ρ(V,V V Xe U |V Xe U | V Xe U (52)   |V X 0 U | e 4k2 r m ≤2 e −1 , V Xe U

24

by (13). This implies the uniform bound on systematic error, and the bound on sampling error (30) follows from applying (22) to a trivial bound on the terms in the average. The simulated bound (31) also follows directly from (52). Lemma A.1. Let A(m, ) be a two-dimensional array of real numbers, indexed by m ∈ N and  > 0, with A(m, 0) := lim↓0 A(m, ) existing for each m, and A(∞, ) := limm→∞ A(m, ) existing for all  sufficiently small (independent of m). Suppose A satisfies lim lim sup sup A(m, ) − A(n, ) = 0. (53) M →∞

↓0

m,n>M

Then the two limits lim lim A(m, ) and lim lim A(m, )

m→∞ ↓0

↓0 m→∞

are equal; in particular, if one exists the other exists as well. Proof. Suppose A∗ := lim↓0 A(∞, ) exists. Then we need to show that A∗ = limm→∞ A(m, 0). Choose any δ > 0, and choose M such that lim sup sup A(m, ) − A(n, ) < δ. ↓0

m,n>M

This means that we may find 0 > 0 such that A(m, ) − A(n, ) < 2δ when 0 <  < 0 and m, n > M , and such that also |A(∞, ) − A∗ | < 2δ for all 0 <  < 0 . It follows, in particular, that A(m, )−A∗ < 4δ when m > M and 0 <  < 0 . Thus |A(m, 0) − A∗ | < 4δ for m > M , and consequently | lim supm→∞ A(m, 0) − A∗ | < 4δ. Since δ is arbitrary, it follows that limm→∞ A(m, 0) = A∗ . The converse result (starting from the assumption that limm→∞ A(m, 0) exists) follows identically.

A.4

Estimating sensitivities: Markov environments

We derive here the formula (40) by a combination of the coupling method and importance sampling. We use importance sampling for the actual computation, but coupling provides a more direct path to validating the crucial exchange of limits. As in the proof of Theorem 5.2, the error bounds are an obvious consequence of the formula (40) and the general formulas for errors described in section 4. Given two distributions q and q ∗ on {1, . . . , M }, we define a standard coupling between q and q ∗ . Suppose we are given a uniform random variable P ω on [0, 1]. Let M− := {e : qe < qe∗ } and M+ := {e : qe∗ < qe }. Let δ := e∈M− (qe∗ − qe ) = P ∗ ˘ on M, e˘+ on M+ , and e˘− on e∈M+ (qe − qe ). We define three random variables e M− , according to the following distributions: P{˘ e = e} = min{qe , qe∗ }/(1 − δ), P{˘ e+ = e} = (qe − qe∗ )+ /δ, P{˘ e− = e} = (qe∗ − qe )+ /δ. The joint distribution is irrelevant, but for definiteness we let them be independent. Then we define the coupled pair (e, e∗ ) to have the values (˘ e, e˘) if ω > δ; (˘ e+ , e˘− ) if ω ≤ δ.

25

(54)

Then e has distribution q, e∗ has distribution q ∗ , and e = e∗ with probability 1 − δ. This δ is called the total-variation distance between q and q ∗ . We write EP for the expectation with respect to the distribution that makes e0 , . . . , em a stationary Markov chain with transition matrix P . Define ν () to be the stationary distribution corresponding to P () , and define Pe() to be the timereversed chain of P () . We define     kXem Xem−1 · · · Xe0 uk kXem Xem−1 · · · Xe0 uk g(m, ; u) := EP () log − EP log . kXem−1 · · · Xe0 uk kXem−1 · · · Xe0 uk By the time-reversal property,     kXe0 Xe1 · · · Xem uk kXe0 Xe1 · · · Xem uk g(m, ; u) = EPe() log − EPe log . kXe1 · · · Xem uk kXe1 · · · Xem uk For  > 0 we couple a sequence e0 , . . . , em selected from the distribution Pe to () () a sequence e0 , . . . , em selected from the distribution Pe() as follows: We start by () choosing (e0 , e0 ) according to the standard coupling of (ν, ν () ). Assume now that () we have produced sequences of length i, ending in ei−1 and ei−1 . We then produce () () (ei , e ) according to the standard coupling of row ei−1 of Pe to row e of Pe() . i

i−1

(To simplify the typography in some places we use e(i) and e() (i) interchangeably () with ei and ei .) Let δ = δ() be the maximum of the total variation distance between ν and ν () , and all of the pairs of rows. It is easy to see that there is a constant c such that δ ≤ c for  sufficiently small. Define ω1 , ω2 , . . . to be an i.i.d. sequence of uniform random variables on [0, 1], and two sequences of random times as follows: T0 := S0 := −1, and  Ti+1 = min t > Si : ωt ≤ δ ,  () Si+1 = min t > Ti+1 : et = et . ()

Thus, et = et for all Si ≤ t < Ti+1 . Define for any u0 ∈ U the random vector Xe(t) · · · Xe(t+m) u0 , m→∞ kXe(t) · · · Xe(t+m) u0 k

Ut := lim

and define a version of g conditioned on T1 and T2   kXe0 Xe1 · · · Xem uk g(m, ; u; T1 , T2 ) := EPe() log T1 , T2 kXe1 · · · Xem uk   kXe0 Xe1 · · · Xem uk − EPe log T1 , T2 . kXe1 · · · Xem uk Then for any u ∈ U, ∇W a(P ) = a0 (0) = lim lim −1 E [g(m, ; u; T1 , T2 )] . ↓0 m→∞

We also define h kXe() (0) · · · Xe() (S1 −1) US1 k γ(; T1 , T2 ) := E log kXe() (1) · · · Xe() (S1 −1) US1 k − log

i kXe(0) Xe(1) · · · Xe(S1 −1) US1 k T1 , T2 kXe(1) · · · Xe(S1 −1) US1 k

26

(55)

We break up these expectations into their portion overlapping three different events: (i) {T1 > m}; (ii) {T2 > m ≥ T1 }; (iii) {m ≥ T2 }. On the event {T1 > m} we have g(m, ; u; T1 , T2 ) = 0, and T1 − m is geometrically distributed with parameter δ. By (13), γ is bounded by k2 rT1 −1 . On the event {T2 > m ≥ T1 }: We have e() (i) = e(i) for i < T1 and for S1 ≤ i ≤ m. If S1 ≤ m, US1 = Xe(S1 ) · · · Xe(m) Um+1 /kXe(S1 ) · · · Xe(m) Um+1 k = Xe() S1 ) · · · Xe() m) Um+1 /kXe() S1 ) · · · Xe() m) Um+1 k (

(

(

(

Thus we may write γ(; T1 , T2 ) − g(m, ; u; T1 , T2 ) " # ˘ () k kXe(0) · · · Xe(T1 −1) Xe() (T1 ) · · · Xe() (m) U T1 , T2 ≤ E log kXe(1) · · · Xe(T1 −1) Xe() (T1 ) · · · Xe() (m) U () k " # kXe(0) · · · Xe(T1 −1) Xe() (T1 ) · · · Xe() (m) uk − E log T , T 1 2 kXe(1) · · · Xe(T1 −1) Xe() (T1 ) · · · Xe() (m) uk " (56) ˘k kXe(0) Xe(1) · · · Xe(m) U + E log ˘k kXe(1) · · · Xe(m) U # kXe(0) · · · Xe(T1 −1) Xe(T1 ) · · · Xe(m) uk − log T , T 1 2 kXe(1) · · · Xe(T1 −1) Xe(T1 ) · · · Xe(m) uk ≤ 2k2 rm , ˘ () = U ˘ = Um+1 if S1 ≤ m; otherwise where U Xe() (m+1) · · · Xe() (S1 −1) US1 ; kXe() (m+1) · · · Xe() (S1 −1) US1 k ˘ = X(m+1) · · · X(S1 −1) US1 . U kX(m+1) · · · X(S1 −1) US1 k

˘ () = U

On the event {T2 ≤ m}: The above approach shows that γ(; T1 , T2 ) − g(m, ; u; T1 , T2 ) ≤ 2k2 rT2 −1 . Combining these bounds, we obtain γ(; T1 , T2 )−g(m, ; u; T1 , T2 ) ≤ k2 rT1 −1 1{T1 >m} + 2k2 rm 1{T1 ≤m} + 2k2 rT2 −1 .

27

(57)

(58)

Taking the expectation with respect to the distribution of T1 and T2 , using the fact that T1 and T2 − S1 are independent with distribution geometric with parameter δ, we obtain h i E γ(; T1 , T2 )−g(m, ; u; T1 , T2 ) (59) k2 m−1 2k2 2 δ . r ≤ δ + 2k2 δmrm + 2 1−r r (1 − r)2 Since δ is bounded by a constant times ||, we may find a constant C such that (by the triangle inequality) for all , positive integers m, and u ∈ U, E [γ(; T1 , T2 )] −E [g(m, ; u; T1 , T2 )]   (60) ≤ E γ(; T1 , T2 ) − g(m, ; u; T1 , T2 ) ≤ C(mrm || + 2 ). This bound allows us to exchange the limits in (55): a0 (0) = lim lim −1 E [g(m, ; u; T1 , T2 )] = =

→0 m→∞ lim −1 E [γ(; T1 , T2 )] →0 lim lim −1 E [g(m, ; u; T1 , T2 )] m→∞ →0

(61)

  kXem Xem−1 · · · Xe0 uk d EP () log m→∞ d =0 kXem−1 · · · Xe0 uk

= lim

Now we apply the method of importance sampling. We may assume without loss of generality that W (e, e0 ) = 0 whenever P (e, e0 ) = 0 (using the analyticity of a, and the fact that the formula (40) is nonsingular on the nonnegative orthant). For any function Z : Mm+1 → R, EP () [Z(e0 , . . . , em )] = EP [Z(e0 , . . . , em )F (; e0 , . . . , em )] , where F is the Radon-Nikodym derivative F (; e0 , . . . , em ) = =

dP () (e0 , . . . , em ) dP () m−1 νe Y P () (ei , ei+1 ) 0

νe0

i=0

P (ei , ei+1 )

.

This allows us to rewrite  () m−1  kXem Xem−1 · · · Xe0 uk d νe0 Y P () (ei , ei+1 ) log EP m→∞ d =0 νe0 i=0 P (ei , ei+1 ) kXem−1 · · · Xe0 uk

a0 (0) = lim

(62)

For any fixed m, there is an upper bound on −1 (F (; e0 , . . . , em ) − 1), so we may

28

move the differentiation inside the expectation, to obtain # " () m−1 kXem Xem−1 · · · Xe0 uk νe0 Y P () (ei , ei+1 ) d 0 a (0) = lim EP log m→∞ d =0 νe0 i=0 P (ei , ei+1 ) kXem−1 · · · Xe0 uk " # () kX X · · · X uk dν e e e e m m−1 0 0 = lim EP (νe0 )−1 (63) log m→∞ d =0 kXem−1 · · · Xe0 uk   m−1 X kXem Xem−1 · · · Xe0 uk W (ei , ei+1 ) + lim EP log m→∞ P (ei , ei+1 ) kXem−1 · · · Xe0 uk i=0 The first limit is 0. To see this, rewrite it as a sum over possible values of e0 : " # () X kXem Xem−1 · · · Xe uk −1 dνe lim νe EPe (νe ) log m→∞ d =0 kXem−1 · · · Xe uk e∈M   X dνe() kXem Xem−1 · · · Xe1 Xe uk = lim EPe log m→∞ d kXem−1 · · · Xe1 Xe uk e∈M

PM dνe˜() Since ν () is a probability distribution, it must be that = 0. Thus, e˜=1 d the expression in the limit becomes 0 if we replace the expectation by a constant, independent of e. By Lemma A.2 it follows that the limit is 0. To compute the other limit, we sum over all possible pairs (ei , ei+1 ) = (˜ e, e). The summand becomes   X kXem Xem−1 · · · Xe0 uk ν(˜ e)W (˜ e, e)E log (64) ei = e˜, ei+1 = e kXem−1 · · · Xe0 uk e˜,e∈M

In order to analyze this, we need to consider the distribution of e0 , . . . , em , conditioned on ei = e˜ and ei+1 = e. By the Markov property, this splits into two independent Markov chains: e = ei+1 , . . . , em is a Markov chain of length m − i, with transition probabilities P and starting point e, while e˜ = ei , ei−1 , . . . , e0 is a Markov chain of length i + 1 with transition probabilities Pe and starting point e˜. Define two independent infinite sequences ˜ e0 , ˜ e1 , . . . and e0 , e1 , . . . , which are Markov chains with transitions Pe and P respectively, beginning in ˜ e0 = e˜ and ei+1 = e. Define for ˇi (˜ ˇ0 (˜ i ≥ 1, U e) := Xe˜X˜e1 · · · X˜ei u with U e) := 1, and VˇiT (e) := 1T Xei−1 · · · Xe1 Xe with Vˇ0T (e) := 1T . Also define Ui (e) :=

ˇi (e) U , ˇi (e)k kU

ViT (e) :=

VˇiT (e) . kVˇ T (e)k i

29

Since kuk = 1T u for any nonnegative column vector u, the expression (63) becomes 0

a (0) =

m−1 X   T ˇm−i (˜ ν(˜ e)W (˜ e, e) lim E log Vˇi+1 (e)U e)

X

m→∞

e˜,e∈M

i=0

  ˇm−i (˜ − E log VˇiT (e)U e) =



m−1 X     T ν(˜ e)W (˜ e, e) lim E log kVˇi+1 (e)k − E log kVˇiT (e)k

X

m→∞

e˜,e∈M

i=0

(65)

    T + E log Vi+1 (e)Ui (˜ e) − E log ViT (e)Ui (˜ e) X X   = ν(˜ e) lim W (˜ e, e)E log kVˇmT (e)k m→∞

e˜∈M

X

+



e∈M

ν(˜ e)W (˜ e, e) lim

m→∞

e,˜ e∈M

m−1 X i=0

  T Vi+1 (e)Um−i (˜ e) E log T Vi (e)Um−i (˜ e)

P In the last line we have used the fact that e, e) = 0, which means e∈M W (˜   P that e∈M W (˜ e, e)E log kVˇ0T (e)k = 0 as well, since Vˇ0T (e) = 1T is independent of e. The same reasoning implies that if we define VˇiT (ν) to be the version of VˇiT started in the stationary distribution — for instance, starting from realizations of VˇiT (e), define VˇiT (ν) to be equal to VˇiT (e) with probability νe — then P e, e)E log kVˇmT (ν)k = 0. The first term on the right-hand side of (65) e∈M W (˜ may then be written as   X X kVˇ T (e)k ν(˜ e)W (˜ e, e) lim E log m = ν(˜ e)W (˜ e, e)ζe . (66) T m→∞ kVˇm (ν)k e˜,e∈M e˜,e∈M To compute the second term, we note that U∞ (˜ e) := limi→∞ Ui (˜ e) exists, with T distribution πe˜, and ρ(Ui (˜ e), U∞ (˜ e)) ≤ k2 ri ; similarly, V∞ (e) := limi→∞ ViT (e) exists, with distribution π ˜e , and ρ(Vi (e), Vi+1 (e)) ≤ k2 ri . Thus  T log Vi+1 (e)Um−i (˜ e) − log ViT (e)Um−i (˜ e) ≤ ρ Vi (e), Vi+1 (e) ≤ k2 ri , We break up the sum on the right-hand side of (65) into three pieces: X   T E log Vi+1 (e)Um−i (˜ e) − log ViT (e)Um−i (˜ e) 0≤i≤m−1

=

X

  T E log Vi+1 (e)U∞ (˜ e) − log ViT (e)U∞ (˜ e)

0≤i≤m/2

  T Vi+1 (e)Um−i (˜ e) ViT (e)Um−i (˜ e) + E log − log T (e)U (˜ Vi+1 ViT (e)U∞ (˜ e) ∞ e) 0≤i≤m/2 X   T + E log Vi+1 (e)Um−i (˜ e) − log ViT (e)Um−i (˜ e) . X

m/2