On the Convergence Rates of Some Adaptive ... - Semantic Scholar

Report 1 Downloads 143 Views
On the Convergence Rates of Some Adaptive Markov Chain Monte Carlo Algorithms Yves Atchad´e∗ and Yizao Wang† July 29, 2012

Abstract This paper studies the mixing time of certain adaptive Markov Chain Monte Carlo algorithms. Under some regularity conditions, we show that the convergence rate of Importance Resampling MCMC (IRMCMC) algorithm, measured in terms of the total variation distance is O(n−1 ), and by means of an example, we establish that in general, this algorithm does not converge at a faster rate. We also study the Equi-Energy sampler and establish that its mixing time is of order O(n−1/2 ). Keywords: adaptive Markov Chain Monte Carlo, mixing time, total variation distance, Importance-Resampling Algorithm, Equi-Energy Sampler. MSC2010 subject classifications: Primary 65C05, 65C40; Secondary 60J05.

1

Introduction

Markov Chain Monte Carlo (MCMC) methods have been widely applied in many areas for more than 40 years [16]. In particular, they are successful when the target distribution π is available only up to a normalizing constant. To sample from such a distribution, various MCMC algorithms have been developed. A typical MCMC algorithm is designed using a transition kernel P that has π as stationary distribution. See for example Meyn and Tweedie ∗

Department of Statistics, the University of Michigan, 439 West Hall, 1085 S. University, Ann Arbor, MI 48109–1107, USA. Email: [email protected]. † Department of Mathematical Sciences, University of Cincinnati, 2815 Commons Way, ML#0025, Cincinnati, OH, 45221–0025, USA. Email: [email protected].

1

[20], Robert and Casella [21], Roberts and Rosenthal [22], and the references therein. Constructing a transition kernel to sample efficiently from a given distribution, although conceptually easy, is rather difficult in practice. The difficulty is that generic algorithms such as the Metropolis–Hastings algorithm requires a careful choice and tuning of the proposal kernel. The development of adaptive MCMC (AMCMC) methods stems partly from these difficulties. Instead of having a fixed Markov kernel P , at each round n an AMCMC algorithm selects a kernel Pθbn from a family of Markov kernels {Pθ }θ∈Θ , where the value (parameter) θbn is computed based on possibly all the samples generated up to time n, so that the transition kernel is automatically self-adapted. This approach is very appealing in practice, as it frees the users from parameter tuning, and provides a better exploration-exploitation balance in the performance. As a consequence, AMCMC algorithms often yield smaller asymptotic errors in Monte Carlo estimations. The theoretical and methodological studies of AMCMC have drawn attentions of many researchers lately. See for example the survey by Atchad´e et al. [4] and the references therein. In this paper, we investigate the convergence rates of two AMCMC algorithms: the Importance Resampling MCMC (IRMCMC) algorithm introduced by Atchad´e [5], and the Equi-Energy (EE) sampler by Kou et al. [17]. The IRMCMC algorithm is also referred to interacting annealing algorithm [8]. For the EE sampler, we actually focus on a simplified version, which is sometimes referred to as interacting tempering algorithm [11]. Throughout the paper we denote by {Xn }n∈N the random process generated by either of these algorithms. A common feature is that in either case, the dynamics of {Xn }n∈N is driven by a sequence of random measures θbn computed from an auxiliary chain {Yn }n∈N . Most of the theoretical results available so far focused on the convergence of marginal distributions LXn ⇒ π, and on the law of large numbers: n

1X lim f (Xi ) = π(f ) almost surely. n→∞ n i=1

See for example Andrieu et al. [2, 3], Atchad´e [5, 6] and Fort et al. [10] (there is a mistake in the proof of [6], pointed out in [10]). Central limit theorems for such AMCMC algorithms have only been considered recently by Fort 2

et al. [11] and Bercu et al. [7, 8]. In short, introducing the auxiliary chain makes the stochastic process no longer Markov, which raises considerable technical difficulties. We point out that there are other classes of AMCMC algorithms, for which the parameters take values in finite dimensional spaces (e.g. the adaptive Metropolis algorithm introduced by Haario et al. [13]). The analysis of such algorithms is out of the scope of this paper. In this paper, we study the convergence rate (or mixing time) of the IRMCMC and EE algorithms. That is, we provide upper bounds on the distances between LXn (the distribution of Xn ) and the target distribution. This type of rate differs and complements rate of convergence obtained in central limit theorems. Mixing time results provide information on the burn-in time of the algorithm, whereas central limit theorems (such as those mentioned above) deal with the rate of convergence and fluctuations of Monte Carlo averages. Beside Andrieu and Atchad´e [1] who considered AMCMC with a finite-dimensional parameter, and to the best of our knowledge, the mixing time of AMCMC has not been investigated so far. We show that the IRMCMC algorithm has convergence rate of order O(n−1 ). In particular, we also provide a simple example, for which the convergence rate has lower bound 1/n. That is, one should not expect a rate superior to O(n−1 ) in general. We also show that for m-tuple IRMCMC (to be defined in section 2.4), the convergence rate is also within O(n−1 ). For the EE sampler, under some regularity conditions, we show that the rate of convergence is O(n−1/2 ) in terms of a slightly weaker norm than the total variation distance. Our results are qualitative, in that the constants in the rates are not explicit. But they clarify what is known about these algorithms, and suggest in particular that longer burn-in should be selected for the EE sampler. The rest of the paper is organized as follows. The remaining of the introduction gives a general description of the algorithms considered in the paper and introduces some notation. Section 2 is devoted to IRMCMC algorithm. The convergence rate is established in Section 2.1, and for multiple IRMCMC in Section 2.4. Section 3 is devoted to EE sampler. The convergence rate is established in Section 3.1. A remark on the connection to parallel tempering is provided in Section 3.2.

1.1

A class of AMCMC algorithms

We describe the general framework of AMCMC considered in this paper. Let X denote a general state space. An AMCMC algorithm is a stochastic process {(Xn , Yn )}n≥0 in X × X , designed such that the main chain Xn 3

converges to the target distribution π in a certain sense to be described precisely later. Let P denotes the set of all probability measures on X , and {Kθ , θ ∈ P} a set of transition kernels on X . Let P be a transition kernel on X with invariant probability measure π. Starting from P and {Kθ , θ ∈ P}, we consider the family of transition kernel Pθ given by Pθ (x, ·) = (1 − )P (x, ·) + Kθ (x, ·), θ ∈ P, x ∈ X . The dynamics of the AMCMC algorithms considered in this paper can be unified as follows: given Fn = σ(X0 , . . . , Xn , Y0 , . . . , Yn ), Xn+1 and Yn+1 are conditionally independent, and for all bounded measurable functions h : X → R, Ex (h(Xn+1 ) | Fn , {Yk }k≥n+1 ) = Ex (h(Xn+1 ) | Fn ) = Pθen h(Xn ), almost surely P def where θen (·) = n−1 nj=1 δYj (·), denotes the empirical measure associated to the auxiliary chain {Yn }n≥0 . Each algorithm is determined by the choice of the kernels Kθ . For the IRMCMC, R w(z)θ(dz) Kθ (x, A) = AR , X w(z)dz where w(z) = dπ/dπY (z) (see Section 1.2 for our convention on π and dπ), while for the EE, the following choice is made  Z  π(z)πY (x) Kθ (x, A) = 1A (x) + 1∧ (1A (z) − 1A (x)) θ(dz). π(x)πY (z) X In both cases, πY is an auxiliary distribution chosen such that it is relatively close to π, and easy to sample from (at least easier than π). We assume that the evolution of the auxiliary train is independent of the main chain in the sense that for bounded measurable function h : X → R, E(h(Yn+1 ) | Fn ) = E(h(Yn+1 ) | Y0 , . . . , Yn ), almost surely. The description of the dynamics of the joint process {(Xn , Yn )}n≥0 is completed by specifying the dynamics of Yn+1 | Fn , which is either a Markov chain with target distribution πY , or the main chain of another adaptive MCMC with target distribution πY , not necessarily Markov. The rationale of these algorithms can be viewed as follows. For θ = θ? = πY , the Markov chain with transition kernel Pθ? have nice mixing properties, due to the choice of πY . Unfortunately, however, it is not possible 4

to implement directly the kernel Pθ? . The idea here therefore is (i) to run an auxiliary chain {Yn }n≥0 with limiting distribution πY , so that the empirical measure θen converges to θ? , and (ii) to sample Xn+1 from Pθen (Xn , ·) = (1 − )P (Xn , ·) + Kθen (Xn , ·) , which approximates Pθ? (Xn , ·) as n → ∞.

1.2

Notation

We assume that the state space X is a Polish space equipped with a metric d, and B is the associated Borel σ-algebra. In addition, (X , B) is a measure space with a reference σ-finite measure, which we denote for short by dx. Let π and πY be probability measures on (X , B). We assume that π and πY are both absolutely continuous with respect to dx and with a little abuse of notation, we also use π and πY to denote the density respectively. That is, we write π(dx) = π(x)dx and similarly for πY . For a transition kernel Q, def R a measure ν and a function h, we shall write νQ(·) = ν(dz)Q(z, ·), and def R Qh(·) = Q(·, dz)h(z). We denote π bY,n the empirical measure associated to the auxiliary chain {Yn }n∈N defined by n

1X π bY,n (·) = δYi (·) . n def

i=1

At times, we also use the notation θen (·) to denote π bY,n (·). For functions f : X → R, we write def

π bY,n (f ) = π bY,n (f ) − πY (f ). We let C denote general constants that do not depend on n, but may change from line to line.

2

Importance Resampling MCMC

We consider the importance-resampling Markov Chain Monte Carlo method described in Atchad´e [5]. Algorithm 1 (IRMCMC). Fix  ∈ (0, 1). Pick arbitrary X0 = x0 and Y0 = y0 . Let P be an arbitrary Markov kernel with stationary distribution

5

π. At each round n, Xn and Yn are conditionally independent given Fn−1 , and  P (Xn−1 , ·) w.p. 1 −  , Xn | Fn−1 ∼ θbn−1 (·) w.p.  , where θbn is the (randomly) weighted empirical distribution defined by R n X w(z)b e πY,n (dz) w(Y e i) b Pn θn (·) = , (1) δYi (·) = R · e j) e πY,n (dz) j=1 w(Y X w(z)b i=1

def with w(y) e ∝ π(y)/πY (y) =: w(y), and θb0 = δy0 . We assume |w|∞ = supx∈X |w(x)| < ∞.

Remark 1. The assumption on the boundedness of w is not too restrict. Indeed, very often in practice, we have π e, the un-normalized density function of π as a bounded function, and set the auxiliary chain with stationary distribution π eY ∝ πY obtained by π eY = π eT with T ∈ (0, 1). In this case, w e=π e/e πY is bounded and thus so is w. Equivalently, for any bounded function f : X → R, E(f (Xn+1 ) | Fn ) = Pθbn f (Xn ) almost surely, where for all probability measures θ on X , Pθ (x, ·) is defined by Pθ (x, ·) = (1 − )P (x, ·) + θ(·) .

(2)

For the time being, we make no particular assumption on the dynamics of the auxiliary chain {Yn }n≥0 .

2.1

Convergence rate of IRMCMC

The following equivalent representation of Algorithm 1 is useful. Let the auxiliary chain be as above. Furthermore, let {Zn }n≥0 be a sequence of independent and identically distributed random variables with P(Z1 = 1) = 1 − P(Z1 = 0) = . Furthermore, we assume that {Zn }n≥0 and {Yn }n≥0 are independent and for each n ≥ 1, Zn and Fn−1 are independent. Then, at round n, we can introduce Zn , and write the conditional distribution of Xn given Zn , Fn−1 as  P (Xn−1 , ·) if Zn = 0 Xn | Fn−1 , Zn ∼ θbn−1 (·) if Zn = 1 . 6

Define τ0 = 0, τi+1 = min{k > τi : Zk = 1} and n∗ = max{k : τk ≤ n} .

(3)

Observe that at each time τk > 0, conditioning on Y0 , Y1 , . . . , Yτk −1 , Xτk is sampled from θbτk −1 , independent of X0 , . . . , Xτk −1 . Furthermore, Y0 , . . . , Yn are independent from τ1 , . . . , τn∗ . Therefore, we first focus on def ηn = P(Xn+1 ∈ · | Zn+1 = 1) = Eθbn (·) , n ∈ N .

(4)

We first obtain a bound on the total variation distance kηn − πkTV . Recall that, given two probability distributions µ and ν, the total variation distance kµ − νkTV is defined by: kµ − νkTV =

1 sup |µ(f ) − ν(f )| . 2 |f |≤1

(5)

For convenience, write 2 def Bn = |w|∞ sup Eb πY,n (f ) + 2|w|2∞ sup E π bY,n (f ) , n ∈ N. |f |≤1

(6)

|f |≤1

Recall that throughout we assume |w|∞ < ∞. Lemma 1. For all n ∈ N, kηn − πkTV ≤ Bn .

(7)

Remark 2. A special case of kηn − πkTV , when w ≡ 1 (equal weights), is the so-called cesaro mixing time of {Yn }n≥0 . See for example Levin et al. [18, Chapter 6.6]. The proof of Lemma 1 is postponed to next subsection. We have no explicit requirement on πY , except that w = π/πY is bounded. However, the convergence of Y1 , Y2 , . . . to πY is implicitly ensured when we require further sup|f |≤1 E(b πY,n (f ) − πY (f )) + sup|f |≤1 E(b πY,n (f ) − πY (f ))2 → 0 as n → ∞. Indeed, these rates yield an upper bound on the convergence rate of LXn ⇒ π, as shown in the following theorem. We set B0 = B−1 = 1. Theorem 1. Consider {Xn }n∈N generated from Algorithm 1. Then, kLXn − πkTV ≤

n X `=0

7

(1 − )n−` B`−1 .

(8)

Furthermore, for any bounded measurable function f , #2 n 1 X E √ (f (Xi ) − π(f )) n "

i=1

80−2 |f |2∞ ≤ + 64−2 |f |2∞ + |f |2∞ n

n−1

1 Xp √ Bk n 0=1

!2 , n ∈ N. (9)

The proof of Theorem 1 is postponed to next subsection. Remark 3. The control of the total variation distance depends on the initial position of the auxiliary chain, as in general, Bn depends on the initial position Y0 = y0 . We omit this dependence throughout this paper. At the same time, it is obvious that the initial position X0 = x0 is irrelevant. Remark 4. In Theorem 1, we do not assume any ergodicity assumption on the kernel P . In the case P is say, geometrically ergodic,

one can improve (8) quantitatively by bounding the term ηk P n−k − π TV more effectively. For example, if P is uniformly ergodic with rate ρ, then (8) would become kLXn − πkTV ≤

n X

[ρ(1 − )]n−` B`−1 .

`=0

A similar improvement can be formulated for (9). However, these improvements do not change the rate but only the constant in the corollary below. Beside, such improvements will not be easily available if P is subgeometrically ergodic. Now, as a corollary we obtain an upper bound on the convergence rate of IRMCMC algorithm, under the following assumption. A1 There exist a finite constant C such that for all measurable function f : X → R, with |f |∞ ≤ 1, Eb πY,n (f ) ≤

C n

and

E π bY,n (f )

2



C . n

(10)

Remark 5. For example, if {Yn }n∈N is a Markov chain with transition kernel PY and stationary distribution πY , then the first part of (10) holds if P is geometrically ergodic [22]. The second part of (10) essentially assumes that the sample variances of {f (Yn )}n∈N is bounded, which also holds for geometrically ergodic Markov chains under appropriate moment condition 8

on f (e.g. π(|f |2+ ) < ∞ with  > 0). Occasionally this condition can fail, as the sample variance might become infinity in the limit. See for example H¨ aggstr¨ om [14] and H¨aggstr¨om and Rosenthal [15]. Corollary 1. Consider the importance resampling MCMC (Algorithm 1). If Assumption (A1) holds, then there exists a finite constant C such that C . n Furthermore for any bounded measurable function f , #2 " n 1 X (f (Xi ) − π(f )) ≤ C|f |2∞ , n ∈ N . E √ n kLXn − πkTV ≤

i=1

2.2

Proofs of Lemma 1 and Theorem 1

Proof of Lemma 1. Fix n ≥ 1 and write π bY ≡ π bY,n . Rewrite ηn (f ) as,   n X w(Y ) Pn j ηn (f ) = E  f (Yj ) l=1 w(Yl ) j=1     n n n X X X w(Y )f (Y ) 1 1 j  Pn j = E w(Yj )f (Yj ) + 1 − w(Yj ) n n w(Y l) l=1 j=1 j=1 j=1 h i = E π bY,n (wf ) + (πY (w) − π bY,n (w))θbn (f ) , where in the second term above we used the fact that πY (w) = 1. Since π(f ) = πY (wf ), 1 (ηn (f ) − π(f )) |f |≤1 2

kηn − πkTV = sup

  1 1 sup Eb πY,n (wf ) + sup E π bY,n (w)θbn (f ) 2 |f |≤1 2 |f |≤1 i h  1 ≤ |w|∞ sup Eb πY,n (f ) + sup E π bY,n (w) θbn (f ) − πY (wf ) . (11) 2 |f |≤1 |f |≤1 ≤

By Cauchy–Schwarz inequality, h  i sup E π bY,n (w) θbn (f ) − πY (wf ) |f |≤1

h

2

≤ E (b πY,n (w))

i1/2

  2 1/2 b . (12) × sup E θn (f ) − πY (wf ) |f |≤1

9

The first term is bounded by |w|∞ sup|f |≤1 [E(b πY,n (f ))2 ]1/2 . For the second term, observe that  2 E θbn (f ) − πY (wf )  2 ≤ 2E θbn (f ) − π bY,n (wf ) + 2E (b πY,n (wf ) − πY (wf ))2 , (13) and  2 n n 2 X X w(Y )f (Yj ) 1 Pn j E θbn (f ) − π bY,n (wf ) = E  − w(Yj )f (Yj ) n w(Y ) l l=1 j=1 j=1 h i = E (1 − π bY,n (w))2 θbn2 (f ) ≤ E (πY (w) − π bY,n (w))2 

≤ |w|2∞ sup E (b πY,n (g))2 , (14) |g|≤1

and the above calculation holds for all f : |f | ≤ 1. Combining (11), (12), (13) and (14) yields the desired result. Proof of Theorem 1. We recall that τn∗ is the last time k before n that the main chain is sampled from θbk−1 . Now, we can write 1 |Ef (Xn ) − π(f )| |f |≤1 2 n 1 X E(f (Xn ), τn∗ = k) − π(f ) = sup 2 |f |≤1 k=0 n 1 X = sup P(τn∗ = k)[E(f (Xn ) | τn∗ = k) − π(f )] |f |≤1 2

kLXn − πkTV =

sup

k=0



n X k=0

1 |E(f (Xn ) | τn∗ = k) − π(f )|.(15) |f |≤1 2

P(τn∗ = k) sup

Observe that the conditional distribution of Xn given that τn∗ = k ≥ 1, is ηk−1 P n−k (set η0 = δY0 ). Then, 1 |E(f (Xn ) | τn∗ = k) − π(f )| = |f |≤1 2

1 |ηk−1 P n−k (f ) − π(f )| 2 |f |≤1



n−k = ηk−1 P − π .

sup

sup

TV

10



By the fact that πP = π, we have ηk−1 P n−k − π TV ≤ kηk−1 − πkTV ≤ Bk−1 , by Lemma 1. Also P(τn∗ = k) = (1 − )n−k for k = 1, . . . , n and P(τn∗ = 0) = (1 − )n . Thus, (15) becomes (8). Pn To establish (9), we show that the partial sum k=1 (f (Xk ) − π(f )) admits a well behaved martingale approximation. For a probability measure θ on X , define πθ (A) = 

∞ X (1 − )j (θP j )(A), A ∈ B. j=0

Clearly, πθ is a probability measure on (X , B), and in fact we have for all A ∈ B, πθ Pθ (A) = 

∞ X

j

(1 − )

Z

(θP j )(dz) ((1 − )P (z, A) + θ(A))

j=0 ∞ X = (1 − )j+1 (θP j+1 )(A) + θ(A) = πθ (A). j=0

This means that the kernel Pθ is invariant by πθ . It is also easy to check by induction that for any bounded measurable function f , and n ≥ 1, Pθn f (x)

n

n

− πθ (f ) = (1 − ) P f (x) − 

∞ X

(1 − )j (θP j )f.

(16)

j=n

It then follows that kPθn (x, ·) − πθ kTV ≤ 2(1 − )n .

(17)

As a result of (17), the function gθ (x) =

∞  X

 Pθj f (x) − πθ (f ) ,

j=0

is well-defined with |gθ |∞ ≤ 2−1 |f |∞ , and satisfies Poisson’s equation gθ (x) − Pθ gθ (x) = f (x) − πθ (f ),

11

x ∈ X.

(18)

In particular, we have f (Xk ) − πθb (f ) = gθb (Xk ) − Pθb gθb (Xk ), alk−1 k−1 k−1 k−1 most surely. Using this, we write: n X

(f (Xk ) − π(f )) =

k=1

n  X

πθb

k−1

n   X (f ) − π(f ) + f (Xk ) − πθb

k−1

k=1

=

n  X

πθb

k−1

k−1



k=1

n   X (f ) − π(f ) + gθb

k−1

k=1 n  X + Pθb

(f )

(Xk ) − Pθb

k−1

gθb

k−1

(Xk−1 )



k=1

gθb

k−1

n   X Pθb gθb (Xk ) − Pθb (Xk−1 ) − Pθb gθb (Xk ) + k

k

k−1

k

k

gθb

k−1

 (Xk ) .

k=1

k=1

Using (18), the definition of gθ , and (16), it is easy to check that for any probability measures θ, θ0 and x ∈ X ,   Z ∞ X Pθ gθ (x) − Pθ0 gθ0 (x) = (θ0 − θ)(dz)  j(1 − )j P j f (z) . j=0

this implies that the term scoping sum and we have

Pn

k=1



Pθb gθb (Xk ) − Pθb k

k

k−1

gθb

k−1

 (Xk ) is a tele-

n  X  Pθb gθb (Xk ) − Pθb gθb (Xk ) k k k−1 k−1 k=1    ∞  X 2(1 − ) j j b b ≤ θn − θ0  j(1 − ) P f  ≤ |f |∞ .  j=0 

Pn

 (Xk−1 ) − Pθb gθb (Xk ) is also a telescoping sum

The term k=1 Pθb gθb k−1 k−1 k k and we have n  X  Pθb gθb (Xk−1 ) − Pθb gθb (Xk ) ≤ 4−1 |f |∞ . k−1 k−1 k k k=1

From the definition of πθ , notice that we can write n  X

πθb

k−1



(f ) − π(f ) =

k=1

n X k=1

12

θbk−1 (f − π(f )),

where f (x) = 

E

" n X

πθb

k−1

P∞

j=0 (1

− )j P j f (x). Thus,

(f ) − π(f )



#2

k=1 n X



!2 2 E1/2 θbk−1 (f − π(f ))

k=1

≤ |f |2∞

n−1 Xp

Bk

!2 ,

k=0

where in the last equality, we use the fact that sup|f |∞ ≤1 Eθbk2 (f −π(f )) ≤ Bk which is (13) and is proved as part of Lemma 1.  Pn  g (X ) − P Finally we also notice that g (X ) =: k k−1 k=1 θbk−1 θbk−1 θbk−1 Pn k=1 Dk is a martingale with respect to {Fn }, whence E

n X

!2 Dk

=

k=1

n X

EDk2 ≤ 4n sup |gθ |2∞ ≤ 16−2 |f |2∞ n. θ

k=1

Using all the above, we obtain (9).

2.3

An example on the lower bound

We provide an example where O(n−1 ) is also the lower bound of the rate for both kηn − πkTV and kLXn − πkTV . This shows that the rate in our upper bound in Corollary 1 is optimal. Example 1. Consider the simple case when X = {±1}, and π = πY . In this case, the weight function is uniform (w ≡ 1). Suppose the auxiliary chain {Yn }n≥0 has transition matrix   1−a a PY = , with a, b ∈ (0, 1) . b 1−b The corresponding Markov chain has stationary distribution πY = (a + b)−1 (b, a) and eigenvalues λ1 = 1, λ2 = 1 − a − b. Suppose a + b 6= 1 and the chain starts at Y0 = −1. By straight-forward calculation, P(Yn = −1) = a/(a + b) + b/(a + b)λn2 . Then, Eb πY,n ({−1}) − πY ({−1}) n 1X a 1 λ2 − λn+1 2 ≡ (P(Yi = −1) − πY ({−1})) = . n a + b n 1 − λ2 i=1

13

It then follows from the definition that kηn − πkTV ≥ C/n. Furthermore, in (2) set P (x, ·) = π(·). That is, P is the best kernel we can put into the algorithm, in the sense that it takes one step to arrive at the stationary distribution (although this is too ideal to be practical). Now, P(Xn = −1) − π({−1}) = (1 − )π({−1}) + Eb πY,n ({−1}) − π({−1}) =  (Eb πY,n ({−1}) − πY ({−1})) . It then follows that kLXn − πkTV ≥ C/n.

2.4

Multiple IRMCMC

We discuss importance-resampling MCMC algorithm in forms of multiple chains and establish a similar convergence rate as in Section 2.1. Algorithm 2 (Multiple IRMCMC). We construct iteratively m discrete(`) time stochastic processes X (`) ≡ {Xn }n≥0 , ` = 0, . . . , m as follows. Fix  ∈ (0, 1). Let X (0) be a Markov chain with target distribution π0 starting at x0 . Then iteratively, for each ` = 1, . . . , m with X (`−1) constructed, design X (`) starting from x` , so that X (`) and X (`−1) interact as the main chain and the auxiliary chain respectively in Algorithm 1. Namely, let P` (`) be a Markov kernel with stationary distribution π` , and sample Xn+1 from (`)

P`,θb(`−1) (Xn , ·) with n

P`,θ (x, ·) = (1 − )P` (x, ·) + θ(·) and θbn(`−1) (·) =

n X i=1

(`−1)

w` (Xi

)

δ

(`−1)

(`−1) Xi ) j=1 w` (Xj

Pn

(·),

with w` (x) = π` (x)/π`−1 (x), x ∈ X . Note that the `-th chain X (`) at time n, (`) depends on {Xk }k=0,...,n−1,`=1,...m−1 . We assume that max`=1,...,m |w` |∞ < ∞. In view of Theorem 1, it suffices to control  2 def Bn(`) = sup Eb πX (`) ,n (f ) + sup E π bX (`) ,n (f ) , n ∈ N, |f |≤1

(19)

|f |≤1

def

where this time π bX (`) ,n (f ) = π bX (`) ,n (f ) − π`−1 (f ). In fact, it suffices to (0)

control Bn , which is the purpose of the following assumption. 14

(0)

(0)

A2 As n → ∞, the initial Markov chain {Xn }n≥0 satisfies Bn ≤ C/n. Theorem 2. Consider the multiple IRMCMC (Algorithm 2) for which Assumption (A2) holds and max`=1,...,m |w` |∞ < ∞. Then for ` = 1, . . . , m, there exists a finite constant C such that

C

(20)

LX (`) − π` ≤ , n n TV and for any bounded measurable function f , " # n  2 1 X (`) E √ f (Xi ) − π(f ) ≤ C. (21) n i=1

(`)

Proof. Simply observe that (20) and (21) imply that Bn ≤ C/n, as n → ∞. By Theorem 1, this implies in turn that (20) and (21) hold with ` replaced by ` + 1. Given (A2), the result follows by induction.

3

Equi-Energy Sampler

In this section, we consider the simplified EE sampler as follows. Recall that the auxiliary chain {Yn }n≥0 evolves independently from the main chain {Xn }n≥0 . Algorithm 3 (Equi-Energy sampler). Fix  ∈ (0, 1). Start X0 = x0 and Y0 = y0 . At each round n, generate ( P (Xn−1 , ·) w.p. 1 −  Xn ∼ , Kθbn−1 (Xn−1 , ·) w.p.  where θbn = π bY,n is the empirical measure associated to {Yn }n≥0 and Kθ is defined by  Z  π(z)πY (x) Kθ (x, A) = 1A (x) + 1∧ (1A (z) − 1A (x)) θ(dz). π(x)πY (z) X In other words, for all non-negative functions h : X → R and n ∈ N, Ex (h(Xn+1 ) | Fn ) = Pθbn h(Xn ) almost surely,

(22)

where for any probability measure θ on X , Pθ is defined as Pθ (x, A) = (1 − )P (x, A) + Kθ (x, A),

(23)

Recall that we write π(dx) ≡ π(x)dx and similarly for πY with a little abuse of language, and w(x) = π(x)/πY (x). We assume |w|∞ < ∞. 15

The kernel Kθ? is the Independent Metropolis kernel with target π and proposal θ? = πY . It is well known that under the assumption |w|∞ < ∞ (recall Remark 1), the kernel Kθ? is uniformly ergodic [19], and this property is inherited by Pθ? . That is, there exists ρ ∈ (0, 1), such that

n

P (x, ·) − π(·) ≤ Cρn , n ≥ 0. (24) θ? TV

3.1

Convergence rate of EE sampler

We make the following assumptions. A3 There exist a finite universal constant C such that for any measurable function f : X → R, with |f |∞ ≤ 1,     n 1 X x2 sup P  √ (f (Yj ) − πY (f )) > x ≤ C exp − , Cσ 2 (f ) n n j=1 def

where σ 2 (f ) =

R X

f 2 (x)πY (dx).

A4 The function w : X → R is continuous (with respect to the metric on X ), and φ(x) sup 2 < ∞, (25) x∈X w (x) def

where φ(x) = πY ({z : w(z) ≤ w(x)}). A5 The kernel P is such that if f : X → R is continuous, then P f is also continuous. Remark 6. Deviation bounds as in (A3) are available for various conditions on transition kernels. See for example Cattiaux and Guillin [9, Proposition 1.2]. Remark 7. Assumption (A4) is not restrictive. For example, consider X = R and πY = π T with some T ∈ (0, 1). For the sake of simplicity, we focus def

on x ∈ R+ and define φ+ (x) = πY ({z > 0 : w(z) ≤ w(x)}). Suppose the density π(x) decays asymptotically as x−α for α > 1 as x → ∞. Then, πY (x) ∼ x−T α and w(x) ∼ x(T −1)α . Here and below, we write a(x) ∼ b(x) if limx→∞ a(x)/b(x) = 1. Assume further that T α > 1. Then, φ+ (x) ∼ (T α − 1)−1 x1−T α and 1 φ+ (x) ∼ x1+2α−3T α . w2 (x) Tα − 1 Therefore, (25) holds, if T > (1 + 2α)/(3α). 16

Theorem 3. Consider the Equi-Energy sampler described as above and suppose that Assumptions (A3)–(A5) hold. Then, there exists a constant C, such that for all continuous functions f : X → R and n ∈ N, |E (f (Xn ) − π(f ))| ≤

C|f |∞ √ . n

(26)

Proof. Fix n ≥ 2 and 1 ≤ q ≤ n. Fix f : X → R with |f |∞ = 1. Then write     n−q n f (x) −E P f (X ) − f (X ) . f (X ) − P Ex f (Xn )−Pθn? f (x) = Ex Pθn−q x q n q θ θ ? ? ? For the first term we can use (24) to get:   n−q Ex Pθ? f (Xq ) − Pθn? f (x) ≤ Cρn−q , for some finite constant C that does not depend on f . For the second term, we write:   f (X ) − f (X ) Ex Pθn−q q n ?   n−1  X  n−j f (Xj+1 )  Pθ? f (Xj ) − Pθn−j−1 = Ex  ? j=q

=

n−1 X

i  h n−j−1 f (X ) | F f (X ) − E P Ex Pθn−j j+1 j j x θ? ?

j=q

=

n−1 X

h i n−j−1 Ex Pθn−j f (X ) − P P f (X ) j j b θj θ? ?

j=q

=

n−1 X

C0 ρn−j−1 Ex

h

 i Pθ? − Pθbj ζn,j (Xj ) ,

(27)

j=q

where in the last line we write ζn,j (x) =

Pθn−j−1 (f (x) − θ? (f )) ? , x∈X, C0 ρn−j−1

with C0 and ρ chosen as in (24). As a consequence of (24), |ζn,j |∞ ≤ 1. It is also continuous by the continuity of f and Assumption (A5). To simplify the notation, for any function g : X → R, define def

Hg (x, z) = α(x, z) (g(z) − g(x)) , 17

x, z ∈ X ,

(28)

where def

α(x, z) = 1 ∧

w(z) . w(x)

(29)

Thus, we can write Z Pθ g(x) − Pθ? g(x) = 

Hg (x, z)(θ(dz) − θ? (dz)).

Always based on g : X → R, we introduce the class of functions def

Fg = {z 7→ Hg (x, z) : x ∈ X } , and the empirical process n

X def 1 Gn (h) = √ (h(Yj ) − πY (h)) , n

h ∈ Fg .

j=1

Therefore, the expectation term in (27) becomes  Z  i h b Hξn,j (Xj , z)(θ? (dz) − θj (dz)) Ex Pθ? − Pθbj ξn,j (Xj ) = Ex # " j Z 1X Hξn,j (Xj , z)θ? (dz) Hξn,j (Xj , Y` ) − = −Ex j X `=1    = − √ Ex Gj Hξn,j (Xj , ·) , j whence n−1 X C0 ρn−j−1    √ Ex Gj Hξn,j (Xj , ·) j j=q ! n−1 X ρn−j−1 √ Ex ≤ C0 sup |Gj (h)| . j h∈Fζn,j j=q

  n−q Ex Pθ? f (Xq ) − f (Xn ) =

We prove in Lemma 2 below that for any continuous function g : X → R such that |g|∞ ≤ 1, ! Ex

sup |Gn (h)|

≤ C,

h∈Fg

for some constant C that does not depend on n nor g. We conclude that n−1   X 1 n−q √ ρn−j−1 . Ex Pθ? f (Xq ) − f (Xn ) ≤ C j j=q

18

Thus for any 1 ≤ q ≤ n,   n−1  X ρn−j−1  √ |Ex (f (Xn )) − θ? (f )| ≤ C ρn + ρn−q +  ≤ Cn−1/2 ,  j  j=q

log n by choosing q = n − b −2 log ρ c.

We rely on the following technical result on the auxiliary chain {Yn }n≥0 . Lemma 2. Suppose that Assumptions (A3) and (A4) hold, and let g : X → R be continuous such that |g|∞ ≤ 1. Then ! Ex

sup |Gn (h)|

≤ C,

h∈Fg

for a constant C that does not depend on n. Proof. Throughout the proof n ≥ 1 is fixed. Assumption (A3) suggests the following metric on Fg : 1/2 (h1 (x) − h2 (x)) πY (dx) ,

Z

2

d(h1 , h2 ) = σ(h1 − h2 ) = X

which has the following properties. For x1 , x2 ∈ X , it is easy to check that |Hg (x1 , z) − Hg (x2 , z)| ≤ 2 |α(x1 , z) − α(x2 , z)| + |g(x1 ) − g(x2 )| .

(30)

It follows that d (Hg (x1 , ·), Hg (x2 , ·)) ≤





sZ

2 |g(x1 ) − g(x2 )| + 2 2

|α(x1 , z) − α(x2 , z)|2 πY (dz). (31)

√ This implies that the diameter of Fg is bounded by δ(Fg ) = 4 2. It also implies that with respect to d, the empirical process {Gn (h), h ∈ Fg } is separable. Indeed, for x ∈ X arbitrary and h = Hg (x, ·), using the Polish assumption, we can find a sequence xm ∈ X (xm belongs to a countable subset of X ) such that xm → x, as m → ∞. Setting hm = Hg (xm , ·), it follows from (31) and the continuity of u and E that hm → h in (F, d), and (30) easily show that Gn (hm ) − Gn (h) = 19

P √ n−1/2 n`=1 (Hg (x, Y` ) − Hg (xm , Y` )) + nπY (Hg (x, ·) − Hg (xm , ·)) → 0 as m → ∞ for all realizations of {Y1 , . . . , Yn }. For any h1 , h2 ∈ Fg , Assumption (A3) implies that for any x > 0   x2 Px (|Gn (h1 ) − Gn (h2 )| > x) ≤ C exp − 2 . cd (h1 , h2 ) Then we apply van der Vaart and Wellner [23, Corollary 2.2.8] to conclude that for h0 ∈ Fg , there exists a constant C independent of g, such that ! Z δ(Fg ) q Ex sup |Gn (h)| ≤ Ex |Gn (h0 )| + C 1 + log D(, Fg , d) d < ∞, h∈Fg

0

where D(, Fg , d) is the packing number of Fg with respect to d. Assumption (A3) shows that Ex |Gn (h0 )| < ∞. To control the entropy number, we further bound the right hand of (31). Without loss of generality, assume x1 , x2 ∈ X and w(x1 ) < w(x2 ). If w(x1 ) ∨ w(x2 ) ≤ w(z), then α(x1 , z) − α(x2 , z) = 0. If w(z) ≤ w(x1 ), then w(z) w(z) 2 1 2 |α(x1 , z) − α(x2 , z)| = − (w(x2 ) − w(x1 ))2 . ≤ 2 w(x1 ) w(x2 ) w(x1 ) If w(x1 ) ≤ w(z) ≤ w(x2 ), then 1 w(z) 2 2 |α(x1 , z) − α(x2 , z)| = 1 − ≤ (w(x2 ) − w(x1 ))2 . w(x2 ) w(x2 )2 Thus Z |α(x1 , z) − α(x2 , z)|2 πY (dz)   φ(x1 ) φ(x2 ) + (w(x2 ) − w(x1 ))2 ≤ C (w(x2 ) − w(x1 ))2 , ≤ w(x1 )2 w(x2 )2 def

where φ(x) = πY ({z : w(z) ≤ w(x)}), and the last inequality follows from (A4). Together with (31), we conclude from this bound that d (Hg (x1 , ·), Hg (x2 , ·)) ≤ C(|g(x1 ) − g(x2 )| + |w(x2 ) − w(x1 )|).

(32)

Since |g|∞ ≤ 1 and w(x) ∈ [0, |w|∞ ], this implies that the -packing number R δ(F ) p of (Fg , d) is at most of order −2 , so that 0 g 1 + log D(, Fg , d) d ≤ R δ(Fg ) p 1 + log(1/)d < ∞. This proves the lemma. C 0 20

3.2

Connection with Parallel Tempering

Our results suggest that the EE sampler mixes relatively slowly. A plausible reason for this slow mixing is the dependence on the entire sample path {Yk }0≤k≤n . The EE sampler is closely related to the Parallel Tempering (PT) algorithm of [12], which suggests that it might be possible to exploit this connection by deriving versions of the EE sampler with better mixing properties. Like the EE sampler, a 2-chain PT generates a stochastic process {(Xn , Yn )}n≥0 where with probability 1−, Xn is generated from P (Xn−1 , ·) and with probability , one proposes to swap the two chains. Thus PT is closely related to a EE-type algorithm where the empirical measure π bY,n would be replaced by a Dirac measure δYn . However, we show that in general, this new algorithm does not maintain the correct stationary distribution. We hope that the discussion in this section would be helpful in conceptualizing new adaptive algorithms in the future. The modified EE sampler is as follows. Let {Yn }n≥0 be the auxiliary chain with transition kernel PY and stationary distribution πY . Let {Xn }n≥0 be a chain satisfying the following assumption: for all continuous and bounded function f , E[f (Xn+1 ) | X0 , . . . , Xn , Y0 , . . . , Yn ] = PδYn f (Xn ), n ∈ N, where Pθ is as in (23), and denote the stationary distribution of P by πX . Remark 8. The difference from the EE sampler is that we replace π bY,n by δYn . If, when Xn+1 is moving to Yn , we also make Yn+1 move to Xn , then we are allowing swaps between the two chains. Such swaps are in the spirit of parallel tempering algorithms (see e.g. Geyer [12]). A nice property of this algorithm is that {(Xn , Yn )}n≥0 is a Markov chain. Indeed, it has transition kernel PX,Y (x, y, dz, dw) = Pδy (x, dz)PY (y, dw) .

(33)

This Markov chain may not have the desired stationary distribution. Let (i) πX,Y denote the stationary distribution and let πX,Y , i = 1, 2 denote its two (1)

(2)

marginal distributions. Naturally, we wish πX,Y = πX and πX,Y = πY . By construction, the latter identity is always true. However, the former does not always hold. Since Pθ = (1 − )P + Kθ and P (x, dz)PY (y, dw) has stationary distribution πX (dz)πY (dw), instead of (33) it suffices to focus on the transition kernel PX,Y (x, y, dz, dw) = KδY (x, dz)PY (y, dw) . 21

(34)

Consider the simple case when both chains take values from {±1}. Let the auxiliary chain has the following transition matrix and stationary distribution:     b a 1−a a PY = and πY = . , b 1−b a+b a+b Recall that in this case, Kδy (x, z) = α(x, y)1{z=y} + (1 − α(x, y))1{z=x} , with α(x, y) = 1 ∧

πX (y) πY (x) . πX (x) πY (y)

Write c = α(1, −1) and d = α(−1, 1). Then, one can write the transition matrix of PX,Y as follows: (1, 1) (1, 1) 1−a (1, −1) (1 − c)b (−1, 1) d(1 − a) (−1, −1) 0

(1, −1) a (1 − c)(1 − b) da 0

(−1, 1) 0 cb (1 − d)(1 − a) b

(−1, −1) 0 c(1 − b) (1 − d)a 1−b

For example, the table reads as P (1, −1, 1, −1) = Pδ−1 (1, 1)PY (−1, −1) = (1 − c)(1 − b) . We solve πX,Y PX,Y = πX,Y and obtain > bd(b + (1 − a − b)c)   1 abd   . =  abc (a + b)((1 − a − b)cd + ac + bd)  ac(a + (1 − a − b)d) 

πXY

(1)

To see that πX,Y does not always equal πX , take for example a = b = 1/3 and πX = (2/3, 1/3). In this case, c = 1/2, d = 1, and πX,Y = (1) (3/8, 1/4, 1/8, 1/4), whence πX,Y = (5/8, 3/8) 6= πX .

References [1] Andrieu, C. and Atchad´e, Y. F. (2007). On the efficiency of adaptive MCMC algorithms. Electron. Comm. Probab., 12:336–349 (electronic). 22

[2] Andrieu, C., Jasra, A., Doucet, A., and Del Moral, P. (2008). A note on convergence of the equi-energy sampler. Stoch. Anal. Appl., 26(2):298– 312. [3] Andrieu, C., Jasra, A., Doucet, A., and Del Moral, P. (2011). On nonlinear Markov chain Monte Carlo. Bernoulli, 17(3):987–1014. [4] Atchad´e, Y., Fort, G., Moulines, E., and Priouret, P. (2011). Adaptive markov chain monte carlo: Theory and methods. Book Chapter in Bayesian Time Series Models. Cambridge University Press. [5] Atchad´e, Y. F. (2009). Resampling from the past to improve on MCMC algorithms. Far East J. Theor. Stat., 27(1):81–99. [6] Atchad´e, Y. F. (2010). A cautionary tale on the efficiency of some adaptive Monte Carlo schemes. Ann. Appl. Probab., 20(3):841–868. [7] Bercu, B., Del Moral, P., and Doucet, A. (2009). A functional central limit theorem for a class of interacting Markov chain Monte Carlo methods. Electron. J. Probab., 14(73):2130–2155. [8] Bercu, B., Del Moral, P., and Doucet, A. (2012). Fluctuations of interacting markov chain monte carlo methods. Stochastic Process. Appl., 122(4):1304–1331. [9] Cattiaux, P. and Guillin, A. (2008). Deviation bounds for additive functionals of Markov processes. ESAIM Probab. Stat., 12:12–29 (electronic). [10] Fort, G., Moulines, E., and Priouret, P. (2011a). Convergence of adaptive and interacting markov chain monte carlo algorithms. Ann. Statist., 39(6):3262–3289. [11] Fort, G., Moulines, E., Priouret, P., and Vandekerkhove, P. (2011b). A central limit theorem for adaptive and interacting markov chains. Available at http://arxiv.org/abs/1107.2574. [12] Geyer, Charles, J. (1991). Markov Chain Monte Carlo maximum likelihood. In Computer Sciences and Statistics: Proc. 23rd Symposium on the Interface. [13] Haario, H., Saksman, E., and Tamminen, J. (1999). Adaptive proposal distribution for random walk metropolis algorithm. Computational Statistics, 14(3):375–396.

23

[14] H¨ aggstr¨ om, O. (2005). On the central limit theorem for geometrically ergodic Markov chains. Probab. Theory Related Fields, 132(1):74–82. [15] H¨ aggstr¨ om, O. and Rosenthal, J. S. (2007). On variance conditions for Markov chain CLTs. Electron. Comm. Probab., 12:454–464 (electronic). [16] Hastings, W. (1970). Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109. [17] Kou, S. C., Zhou, Q., and Wong, W. H. (2006). Equi-energy sampler with applications in statistical inference and statistical mechanics. Ann. Statist., 34(4):1581–1652. With discussions and a rejoinder by the authors. [18] Levin, D. A., Peres, Y., and Wilmer, E. L. (2009). Markov chains and mixing times. American Mathematical Society, Providence, RI. With a chapter by James G. Propp and David B. Wilson. [19] Mengersen, K. L. and Tweedie, R. L. (1996). Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist., 24(1):101–121. [20] Meyn, S. P. and Tweedie, R. L. (1993). Markov chains and stochastic stability. Communications and Control Engineering Series. SpringerVerlag London Ltd., London. [21] Robert, C. P. and Casella, G. (2004). Monte Carlo statistical methods. Springer Texts in Statistics. Springer-Verlag, New York, second edition. [22] Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probab. Surv., 1:20–71 (electronic). [23] van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence and empirical processes: with applications to statistics. Springer Series in Statistics. Springer-Verlag, New York.

24