Efficient MCMC sampling in dynamic mixture models
∗
Gabriele Fiorentini(a) , Christophe Planas(b) and Alessandro Rossi(b) August 2011 Abstract We show how to improve the efficiency of MCMC sampling in dynamic mixture models by block-sampling the discrete latent variable. Two algorithms are proposed: the first is a multi-move extension of the single-move Gibbs sampler devised by Gerlach, Carter and Kohn (2000); the second is an adaptive Metropolis-Hastings scheme that performs well even when the number of discrete states is large. Three empirical examples illustrate the gain in efficiency achieved. We also show that visual inspection of sample partial autocorrelations of the discrete latent variable helps anticipating whether blocking can be effective.
Keywords: Adaptive Metropolis-Hastings, conditionally linear models, Gibbs sampling, statespace models.
∗ The authors are grateful to the participants of the Bayesian Econometrics workshop of the Rimini Centre for Economic Analysis 2011 for useful comments and suggestions. The ideas expressed here are those of the authors and do not necessarily reflect the position of the European Commission. (a) University of Florence, Department of Statistics, and Rimini Centre for Economic Analysis. (b) Joint Research Centre of the European Commission. E-Mails:
[email protected],
[email protected],
[email protected].
1
Introduction
This paper is about enhancing the efficiency of posterior simulations in dynamic mixture models. Thanks to their flexibility for handling nonlinearities, structural changes, and outliers (see Giordani, Kohn, and van Dijk, 2007), dynamic mixture models have entertained some success in the statistical literature. Applications can be found in a wide range of disciplines (see Scott, 2002, the textbooks by Fruhwirth-Schnatter, 2006, and Kim and Nelson, 1999). Many empirical works take benefit of Bayesian methods for model estimation and inference. Bayesian dynamic mixture models typically involve three sources of randomness: a vector of parameters Θ, an unobserved continuous state X ≡ (X1 , · · · , XT ), and a latent discrete process S ≡ (S1 , · · · , ST ), where each variable St takes N possible values and T denotes sample size. Letting small letters denote realizations, a sample {(θ (n) , x(n) , s(n) ), n = 1, 2, · · · } from the joint distribution of (Θ, X, S) given the observations y ≡ (y1 , · · · , yT ) can be obtained by Markov Chain Monte Carlo (MCMC) methods. A plain Gibbs design relies on the following full conditional distributions (see Casella and George, 1992): f (Θ |x, s, y), f (X |θ, s, y), f (S |θ, x, y) where the sampling of θ (n) from f (Θ |x, s, y) is model dependent, the continuous state x(n) can be drawn jointly from f (X |s, θ, y) using for instance the Carter and Kohn (1994) smoother, and the sequence s(n) can be sampled from f (S |θ, x, y) as explained in Kim and Nelson (1999, p.212). Such a scheme however produces slowly mixing chains when X and S are strongly dependent given θ and y. In this case, it is more efficient to generate x and s jointly as in: f (Θ |x, s, y), f (X |s, θ, y), f (S |θ, y) Simulating from f (S|θ, y) is the focus of this paper. Gerlach, Carter and Kohn (2000), GCK hereafter, developed the gold standard algorithm which draws s from f (S|θ, y) in O(T ) operations. Yet it is a single-move algorithm so it may be inefficient when the St -variables are strongly conditionally dependent. In such a case the mixing properties of the algorithm can be improved by simulating jointly st , · · · , st+h−1 from f (St , · · · , St+h−1 |s1 , · · · , st−1 , st+h , · · · , sT , θ, y). We thus propose block-extensions to the GCK single-move sampler. As Seewald (1992) puts it, blocking moves correlations from the MCMC sampler over to the random vector generator. In practice however it is important to know whether grouping is worth to implement. Conditions for blocking to improve efficiency and eventually hasten convergence have been studied
2
by Roberts and Sahu (1997) for Gaussian target distributions (see also Liu, Wong, and Kong, 1994, and Besag, Green, Higdon, and Mengersen, 1995). In general, blocking is expected to reduce the chain correlations when the variables within each group are conditionally dependent given the rest. We report evidence that blocking is effective when S has sizeable partial autocorrelations a posteriori. In Section 2 we extend GCK’s single-move algorithm to multi-move sampling with blocks of length h. The number of operations required by this multi-move sampler are N h−1 /h times those of the single-move. Hence, when N = 2 the double-move algorithm has no additional computational cost. Computing time, however, becomes prohibitive for large values of N or h. We thus propose a Metropolis-Hastings (MH) block-sampler that requires at most 2/N times the number of operations involved in GCK single-move sampling. While blocking is often considered to be expensive both in terms of program complexity and computational cost, our MH scheme runs faster than the Gibbs single-move sampler without being more difficult to implement. This MH algorithm is further enhanced by making the proposal distribution adaptive. By updating the proposal distribution from the past history of the chain, adaptive MH schemes eventually aim at improving the sampling efficiency. Because the dependence on past draws makes the chain loose the Markov property, standard results about chain convergence do not apply (see Tierney, 1994). Conditions that insure convergence of adaptive MH algorithms have been studied in several papers (see for instance Haario, Saksman and Tamminen, 2001, Atchad and Rosenthal, 2005, Nott and Kohn, 2005, Andrieu and Moulines, 2006, Roberts and Rosenthal, 2007). Current research focuses on the development of adaptive samplers for new applications (see Roberts and Rosenthal, 2009). In the context of dynamic mixture models, Giordani and Kohn (2008) give the first adaptive sampler for discrete latent variables. Although appealing, their sampler is single-move and use a proposal distribution that does not contain time series dynamics. The proposal distribution we put forward introduces dynamics since our aim is to sample variables in blocks. Adaptation is made continuously at almost zero cost. In the Appendix we show the chain convergence using results given in Giordani and Kohn (2010). In Section 3 we compare the samplers performance in three empirical applications to real US GDP, to the USD/GBP exchange rate, and to the Fama-French returns data. Overall the adaptive MH algorithm appears to be the most efficient. Section 4 concludes and auxiliary material is given in Appendix .
3
2 2.1
Block-sampling Model and notations
We consider the following general state-space model: yt = ct + Ht xt + Gt ut xt = ft + Ft xt−1 + Γt ut
(2.1)
where ut is a vector of independent standard Gaussian innovations. The possibly time-varying vectors and matrices ct , Ht , Gt , ft , Ft , and Γt are determined by model parameters and by the latent discrete variable St . Hereafter for any random variable Z with realization z we write f (z|·) the conditional probability or density evaluated at z. In addition we denote zk,t ≡ (zk , · · · , zt ).
2.2
Gibbs
Without loss of generality, we follow the deterministic updating Gibbs strategy and leave up to the readers the possibility to implement a random permutation Gibbs scheme (see Roberts and Sahu, 1997). The GCK’s single-move algorithm relies on the following factorization: f (st |s1,t−1 , st+1,T , θ, y) ∝ f (yt+1,T |y1,t , s, θ) f (yt |y1,t−1 , s1,t , θ) f (st |s1,t−1 , st+1,T , θ)
(2.2)
Lemma 4 in GCK yields f (yt+1,T |y1,t , s, θ) in one step after an initial set of backward iterations whereas evaluating f (yt |y1,t−1 , s1,t , θ) needs one recursion of the Kalman filter for each possible value of St . Overall N × T Kalman recursions are needed. Generalization to multi-move is straightforward. To present the idea we first consider a double-move sampler. Similarly to (2.2) we can write: f (st , st+1 |s1,t−1 , st+2,T , θ, y) ∝ f (yt+2,T |y1,t+1 , s, θ) f (yt+1 |y1,t , s1,t+1 , θ) × f (yt |y1,t−1 , s1,t , θ)f (st , st+1 |s1,t−1 , st+2,T , θ)
(2.3)
In equation (2.3) above, GCK’s Lemma 4 still applies to evaluate f (yt+2,T |y1,t+1 , s, θ) in one step, whereas f (yt+1 |y1,t , s1,t+1 , θ) and f (yt |y1,t−1 , s1,t , θ) need N 2 Kalman filter evaluations. Overall the 2-term blocking requires N 2 /2 × T operations. If the discrete variable only takes two states, then single-move and double-move lead to the same number of operations. The 4
generalization to h-term blocking is such that: f (st,t+h−1 |s1,t−1 , st+h,T , θ, y) ∝ f (yt+h,T |y1,t+h−1 , s, θ)
h−1 Y
f (yt+` |y1,t+`−1 , s1,t+` , θ)
`=0
× f (st,t+h−1 |s1,t−1 , st+h,T , θ)
(2.4)
In general the number of Kalman filter iterations is N h /h × T . Therefore block sampling via full conditionals is not feasible when N or h take large values. For these cases we propose an adaptive MH strategy.
2.3
Metropolis-Hastings
To be viable, a MH block sampling scheme needs a proposal density whose acceptance rate remains appreciable when the block length increases. Since f (st,t+h−1 |s1,t , st+h,T , θ, y) ∝ f (s|θ, y), the proposal density for a generic block, say q(st,t+h−1 ), implicitly embodies an approximation to the joint full conditional distribution of the S sequence. The difficulty in building a reasonable approximation comes from the huge dimension of the state space, namely N T , and simplifying assumptions are needed. We concentrate on the class of discrete probability distributions that satisfies the Markov property: q(s1 , s2 , · · · , sT ) = q(s1 )
T −1 Y
q(st+1 |st )
t=1
where q(st+1 |st ) represents the probability of moving from St = st to St+1 = st+1 . By allowing the transition probability q(st+1 |st ) to be time-varying, we take benefit from the flexibility of non-homogeneous Markov processes. Under the Markov assumption, the MH proposal distribution for a given block q(st,t+h−1 |s1,t , st+h,T ) simplifies to q(st,t+h−1 |st , st+h ). It thus factorizes into the product of marginal and one-step transition probabilities as we detail below. Three cases must be distinguished: the initial, the inner, and the final block. Initial block S1 , · · · , Sh : The importance weight attached to s1 , · · · , sh is: q(s1,h |sh+1 ) =
h Y j=1
5
q(sj |sj+1 )
(2.5)
with backward transition probabilities given by: q(sj+1 |sj ) q(sj ) q(sj+1 )
q(sj |sj+1 ) =
(2.6)
Inner block St , · · · , St+h−1 : The probability attached to inner blocks can be computed as: h−1 Y
q(st,t+h−1 |st−1 , st+h ) =
q(st+j |st−1 , st+j+1 )
(2.7)
j=0
where q(st+j |st−1 , st+j+1 ) is such that: q(st+j |st−1 , st+j+1 ) =
q(st+j+1 |st+j ) q(st+j |st−1 ) q(st+j+1 |st−1 )
(2.8)
and the term q(st+j |st−1 ) can be calculated using the Chapman-Kolmogorov formula: q(st+j |st−1 ) =
N X st =1
···
N X
j Y
st+j−1 =1
`=0
q(st+` |st+`−1 ),
j>0
Ending block ST −k+1 , · · · , ST , where k = T − [T /h]h is the length of the last block, [ . ] denoting integer part. The weight attached to this last block is given by: q(sT −k+1,T |sT −k ) =
k Y
q(sT −j+1 |sT −j )
(2.9)
j=1
Factorizations different from (2.5)-(2.9) are possible; here we use the most convenient for recursive sampling. Let n refer to the MCMC iteration with output s(n) , and let B represent the length of the burn-in period of, for instance, the GCK single move sampler. The marginal probabilities q(st ), t = 1, · · · , h + 1, and the transition probabilities q(st+1 |st ), t = 1, · · · , T − 1, involved in
6
(2.5)-(2.9) are estimated from the burn-in period according to: PB
(n)
n=1
q(st ) =
PB
= st )
(n)
n=1
q(st+1 |st ) =
1(st B
(n)
1(st+1 = st+1 , st = st ) PB (n) = st ) n=1 1(st
(2.10)
where 1(·) denotes the indicator function. At iteration n + 1, the multi-move MH sampler proposes a candidate s?t , · · · , s?t+h−1 with acceptance probability:
α = min{1,
(n+1)
(n)
(n+1)
(n)
(n)
(n+1)
(n)
(n+1)
(n)
f (s?t,t+h−1 | s1,t−1 , st+h,T , y, θ) q(st,t+h−1 |st−1 , st+h ) (n)
f (st,t+h−1 | s1,t−1 , st+h,T , y, θ) q(s?t,t+h−1 |st−1 , st+h )
}
(2.11)
where f (st,t+h−1 | s1,t−1 , st+h,T , y, θ) is given in (2.4). The main steps of the MH algorithm are summarized below: (i) Burn-in: For n = 1, · · · , B, generate s(n) using the GCK single move sampler and use this output to calculate the marginal and transition probabilities as in (2.10). For n = B + 1, · · · , G proceed as follows: (ii) Initial block S1 , · · · , Sh : sample backward s?` by setting s?` = min{i : m|s`+1 ) > U }, where s`+1 =
(n) s`+1
if ` = h, s`+1 =
s?`+1
Pi m=1
q(S` =
if ` = h − 1, · · · , 1, and U denotes a draw (n)
from the uniform distribution on (0, 1). The term q(s` |s`+1 ) is given in (2.6). If s?1,h = s1,h , accept the candidate. Otherwise, compute the acceptance probability α in (2.11), draw U , accept the candidate block if α < U , and reject it otherwise. (iii) Inner block St , · · · , St+h−1 : sample backward s?t+` by setting s?t+` = min{i : (n+1) m|st−1 , st+`+1 )
> U }, with st+`+1 =
(n) st+`+1
Pi m=1
q(St+` =
if ` = h−1 and st+`+1 = s?t+`+1 for ` = h−2, · · · , 0.
The term q(st+` |st−1 , st+`+1 ) is given in (2.8). Accept or reject the candidate as in (ii). (iv) Ending block ST −k+1 , · · · , ST : sample forward by setting s?` = min{i : m|s`−1 ) > U }, with s`−1 =
(n+1) sT −k
for ` = T − k + 1 and s`−1 =
Accept or reject the candidate as in (ii).
7
s?`−1
Pi m=1
q(S` =
for ` = T − k + 2, · · · , T .
This MH algorithm involves two entire runs of the Kalman filter which are noticeably less than the N h /h runs required by Gibbs block-sampling. A speed up occurs when the candidate block is equal to the one previously sampled since in this case there is no need to compute the acceptance probability α. This happens when the data are highly informative about the discrete latent variable for a given time-span, i.e. f (St = i1 , · · · , St+h−1 = ih |θ, y) is close to one, for some vector (i1 , · · · , ih ). Typical examples are outliers and structural breaks. Notice that even when the evaluation of α is not required the Kalman filter must be iterated until the end of the block in order to evaluate the density mass attached to the innovations of the successive segment.
2.4
Adaptive Metropolis-Hastings
The proposal distribution can be further enhanced by using all available draws instead of tuning it on the burn-in period only. To generate a candidate block at iteration n + 1, we follow Giordani and Kohn (2010) and consider a mixture such as: q˜n (st,t+h−1 ) = δq0 (st,t+h−1 ) + (1 − δ)qn (st,t+h−1 )
(2.12)
where 0 < δ < 1. For simplicity the dependence on the block-adjacent variables St−1 , St+h is omitted in the notation. The distribution q0 (st,t+h−1 ) remains constant over the iterations and assigns a strictly positive weight to all possible paths. We shall use the uniform distribution q0 (st,t+h−1 ) = 1/N h . Adaptation is made through qn (st,t+h−1 ) that depends on all previous draws 1, 2, · · · , n − 1 except the last one. This adaptive component is built from marginal and transition probabilities as in (2.5), (2.7) and (2.9), but the probabilities now evolve with the iterates according to: Pn−1 qn (st ) =
k=1
Pn−1 qn (st+1 |st ) =
k=1
(k)
1(st = st ) n−1 (k)
(k)
1(st+1 = st+1 , st = st ) Pn−1 (k) = st ) k=1 1(st
8
that can also be expressed recursively as: (n−1)
qn (st ) =
(n − 2)qn−1 (st ) + 1(st n−1
)
(2.13) (n−1)
qn (st+1 |st ) =
(n−1)
(n − 2)qn−1 (st+1 |st )qn−1 (st ) + 1(st+1 , st (n−1)
(n − 2)qn−1 (st ) + 1(st
)
)
(2.14)
These recursions run on-line at an almost zero cost. In general, under adaptation the usual Markov property of the chain induced by the MH kernel is lost. One must thus verify that the desired ergodic properties of the chain hold. Giordani and Kohn (2010) give sufficient conditions which ensure that the proposed adaptive scheme converge to the correct target distribution. We show in Appendix A that the adaptive MH scheme defined above satisfies these conditions.
3
Efficiency comparison
To compare the performance of the various samplers, two measures are considered: the inefficiency factor (IF ) and the relative inefficiency factor (RIF ). Let z (1) , · · · , z (G) denote a sample drawn from a Markov chain that approximates the stationary distribution of a random variable z. The inefficiency factor is estimated by: IF = 1 + 2
M X
wk ρ(k)
(3.1)
k=1
where ρ(k), k = 1, 2, · · · denotes the autocorrelation function of z (1) , · · · , z (G) weighted with the lag window wk . If the z’s were independent then IF = 1. Hence IF measures the efficiency loss with respect to an independent sample. For two competing algorithms A and B with inefficiency factors IFA and IFB , the ratio IFA /IFB gives the proportion by which the number of iterations of sampler A must be increased to achieve sampler B’s precision in estimating posterior means. In order to take computing time into account we define the relative inefficiency factor as: RIF =
T imeA IFA × T imeB IFB
(3.2)
where T imeA and T imeB denote the computing time of each sampler. Thus RIF measures the factor by which sampler A’s run-time must be increased to achieve the same precision as sampler B; values larger than one point to a greater efficiency of scheme B. We consider three empirical examples where the discrete latent variable controls either the 9
growth or the volatility of the series. The posterior distribution of unobserved variables and model parameters is inferred via Gibbs sampling with full conditional distributions f (X|s, θ, y) and f (Θ|s, x, y). To draw s from f (S|θ, y), we consider the previously detailed Gibbs and MH samplers with various block lengths. The inefficiency factors are computed on one million draws using a Parzen window of length M = 500, after a burn-in of 20,000 iterations. Both IF and RIF are reported for the transition probability parameters and for the six St -variables for which GCK’s single-move algorithm yields the largest inefficiency factor. Transition probability parameters are especially relevant because their full conditional distribution depends on sufficient statistics that are function of the S (see Appendix B-D). Finally, computing time is measured as the time spent to simulate an entire S-sequence.
Example 1: Markov switching growth model for US real GDP. We analyze the quarterly US real GDP series between 1980-I and 2009-IV. The data are handled in logarithm and multiplied by 100. The model decomposes the observed series yt into a trend pt and a cycle ct according to: yt = pt + ct pt = pt−1 + m(St ) + Vp1/2 apt m(St ) = µ [St + (1 − St ) δ]
δ 0 and r > 0.
Condition GK1 is immediate since the distributions π(st,t+h−1 ) and qn (st,t+h−1 ) are defined on the discrete space ℵ, so they are bounded above at 1. Furthermore, the constant term q0 (st,t+h−1 ) equals 1/N h , so the two ratios in GK1 are bounded above. Condition GK2 needs more attention. The weight that the adaptive proposal attaches to a given block is a function of the marginal and transition probabilities and can be written as: Qh qn (st,t+h−1 ) = P
st · · ·
qn (st+k |st+k−1 ) Qh st+h−1 k=0 qn (st+k |st+k−1 )
P k=0
(1.1)
We first show that the marginal and transition probabilities qn (st ) and qn (st+1 |st ) involved in equation (A.1) settle down asymptotically. The recursion for marginal probabilities (2.13) yields: qn (st ) − qn+1 (st ) =
1 (n−1) [qn (st ) − 1(st = st )] = O(n−1 ) n
whereas the one for transition probabilities (2.14) implies: (n)
| qn (st+1 |st ) − qn+1 (st+1 |st ) | = | ≤
qn (st )1(st
(n)
(n)
= st ) − 1(st+1 = st+1 , st
(n − 1)qn (st ) +
(n) 1(st
= st )
= st )
|
1 (n − 1)qn (st )
Giordani and Kohn (see Lemma 1 in Appendix, 2010) show that Condition GK1 implies uniform ergodicity, i.e. Tn (st,t+h−1 , st,t+h−1 ) ≥ ² π(st,t+h−1 ) for some ² > 0. This ensures that qn (st ) is 22
strictly positive asymptotically and thus qn (st+1 |st ) − qn+1 (st+1 |st ) = O(n−1 ). Let us write qn (st+k |st+k−1 ) − qn+1 (st+k |st+k−1 ) = ²k with ²k = O(n−1 ), and let Cn denote the numerator of qn (st,t+h−1 ) in equation (A.1). We have: Cn − Cn+1 = =
h Y k=0 h Y
qn (st+k |st+k−1 ) − qn (st+k |st+k−1 ) −
k=0
= ²0
h Y
+ ²h
qn+1 (st+k |st+k−1 ) =
k=0 h Y
(qn (st+k |st+k−1 ) − ²k ) =
k=0
qn (st+k |st+k−1 ) + ²1
k=1 h−1 Y
h Y
h Y
qn (st+k |st+k−1 ) + · · · +
k=0,k6=1
qn (st+k |st+k−1 ) + O(n−2 )
k=0
so Cn − Cn+1 = ²C = O(n−1 ). Let Dn denote the denominator of qn (st,t+h−1 ). Since Dn = P P −1 st · · · st+h−1 Cn , Dn − Dn+1 = ²D = O(n ) as well. Hence: qn (st,t+h−1 ) − qn+1 (st,t+h−1 ) =
Cn Cn+1 − Dn Dn+1
=
Cn (Dn − ²D ) − (Cn − ²C )Dn Dn (Dn − ²D )
=
−²D Cn + ²C Dn Dn Dn − ²D Dn
which proves Condition GK2.
B
Prior and full conditional distributions for Example 1
We assume the following independent prior distributions: ϕ ∼ β(2.63, 2.63),
τ −τ l τ u −τ l
∼ β(2.27, 7.54)
with τ l = 3 and τ u = 70, µ ∼ N (.75, 1.0) × I[0.25,1.2] , δ ∼ N (0, 1.0) × I[−2.0,0.5] , p00 ∼ β(3, 3), p11 ∼ β(10, 2), Vc ∼ IG2 (3.71, 6), Vp ∼ IG2 (0.23, 6), where β(., .) is the beta density, N (., .) is the normal, and IG2 (., .) refers to the inverse gamma-2 (see Bauwens et al., 1999, p.292). The index set IR imposes the support R. The mean cycle amplitude and periodicity are set a priori to 0.5 and 18.5 quarters, with standard deviation 0.2 and 8.6 respectively so as to not
23
be excessively assertive. The cycle periodicity is bounded between 3 and 70 quarters. The GDP annual growth is centered around an annual mean of 3.0% in expansion phases and 0% in recessions, with a standard deviation of 1.0 percentage point. Beta distributions are used for the transition probability parameters p00 and p11 . The prior for p00 is consistent with recessions mean duration of two quarters, and that of p11 with expansions lasting six quarters on average. The six degrees of freedom of the IG2 prior distribution for variance parameters make mean and standard deviation equal a priori so as to be rather vague about variance parameters. We detail below the sampling of model parameters given the rest of the variables plus the observations, focusing first on the full conditionals p(ϕ|τ , Vc , cT ), p(τ |ϕ, Vc , cT ) and p(Vc |ϕ, τ , cT ). The first two verify: T
p(ϕ|τ , Vc , c ) ∝ p(c1 , c2 |ϕ, τ , Vc )
T Y
p(ct |ct−1 , ϕ, τ , Vc )p(ϕ)
t=3
and T
p(τ |ϕ, Vc , c ) ∝ p(c1 , c2 |ϕ, τ , Vc )
T Y
p(ct |ct−1 , ϕ, τ , Vc )p(τ )
t=3
The sampling of ϕ and τ from the distributions above is made with the adaptive rejection Metropolis scheme proposed by Gilks et al. (1995, 1997). For Vc , textbook formulae yield (see Bauwens, Lubrano and Richard, 1999, p.304): p(Vc |ϕ, τ , cT ) = IG2 (sc∗ , vc∗ ) with: sc∗ = sc0 +
(c1 c2 )Σ−1 c (c1
0
c2 ) +
T X
(ct − 2ϕcos(2π/τ )ct−1 + ϕ2 ct−2 )2
t=3
vc∗ = vc0 + T where sc0 , vc0 are the hyperparameters in the prior p(Vc ) = IG2 (sc0 , vc0 ), and Σc is the variancecovariance matrix of (c1 , c2 ) given A and τ re-scaled by the innovation variance Vc , i.e. Σc ≡ V [(c1 , c2 )|ϕ, τ , Vc ]/Vc . Regarding µ, δ, and Vp given pT and S, the first full conditional p(µ|δ, Vp , pT , S) takes the form N (µ? , Vµ? ) × I[0.25,1.2] , where µ? and Vµ? are such that:
24
à Vµ? = à µ? =
T 1 X 1 + (St + (1 − St )δ)2 Vµ0 Vp t=2
!−1
! T µ0 1 X + ∆pt (St + (1 − St )δ) Vµ? Vµ0 Vp t=2
and µ0 and Vµ0 represent the mean and variance parameters of the µ-prior. The full conditional for δ, i.e. p(δ|µ, Vp , pT , S), is similarly obtained as truncated normal N (δ ? , Vδ? ) × I[−2.0,0.5] with: Ã Vδ? = Ã δ? =
T 1 1 X + (1 − St )2 µ2 Vδ0 Vp t=2
!−1
! T X 1 δ0 + (∆pt − µSt )(1 − St )µ Vδ? Vδ0 Vp t=2
where δ 0 and Vδ0 are the mean and variance of the δ-prior. The full conditional distribution of Vp is an IG2 (sp? , vp? ) with: sp? = sp0 +
T X
(∆pt − µ[St + (1 − St )δ])2
t=2
vp? = vp0 + T − 1 where sp0 , vp0 are hyperparameters in the prior p(Vp ) = IG2 (sp0 , vp0 ). Finally, the full conditional distribution of the transition probabilities p00 and p11 , i.e. p(p00 , p11 |S), is such that: p(p00 , p11 |S) ∝ p(S1 |p00 , p11 )
T Y
p(St |St−1 , p00 , p11 ) p(p00 , p11 )
t=2
∝ p(S1 |p00 , p11 ) pa0000 −1 (1 − p00 )b00 −1 pa1111 −1 (1 − p11 )b11 −1
with:
25
(1 − p11 )(1 − S1 ) + (1 − p00 )S1 2 − p00 − p11 T X = α0 + (1 − St−1 )(1 − St )
p(S1 |p00 , p11 ) = a00
b00 = λ0 + a11 = α1 + b11 = λ1 +
t=2 T X
(1 − St−1 )St
t=2 T X t=2 T X
St−1 St St−1 (1 − St )
t=2
and αi and λi are the hyperparameters of the beta-prior distribution for pii , i = 0, 1. The terms paiiii −1 (1 − pii )bii −1 , i = 0, 1, correspond to the kernel of a beta-density with parameter aii and bii . To sample p00 and p11 we resort to an MH scheme with proposal β(a00 , b00 ) for p00 and β(a11 , b11 ) for p11 . The acceptance probability depends on the weight that the candidates and the previously sampled values for p00 and p11 put on S1 .
C
Prior and full conditional distributions for Example 2
We assume the following independent prior distributions: φ ∼ N (0, 4)×I(−1,1) , p00 ∼ β(4.6, 2.3), p11 ∼ β(4.6, 2.3),
αc −αlc l αu c −αc
∼ β(1.6, 4.0) with αlc = 0 and αuc = 0.2, Vc ∼ IG2 (0.008, 6), and
Vp ∼ IG2 (0.004, 6). The prior distribution of φ is set like in Engle and Kim (1999). The prior distribution of αc , Vc and Vp are close to Engle and Kim (1999) results, whereas the prior mean of p00 and p11 is set to two-third. We now describe the Gibbs sampling of model parameters. The full conditional p(φ|αc , Vc , cT , S) is known up to the initial condition: T
p(φ|αc , Vc , c , S) ∝ p(c1 |φ, αc , Vc , S)
T Y
p(ct |ct−1 , φ, αc , Vc , S)p(φ)
t=2
since conditionally on c1 , S1 , αc , Vc ,
QT t=2
p(ct |ct−1 , φ, αc , Vc , S)p(φ) is truncated normal with
26
hyperparameters φ? , Vφ? such that: " Vφ? = " φ? =
T c2t−1 1 1 X + Vc t=2 St−1 + αc (1 − St−1 ) Vφ0
#−1
# T 1 X ct ct−1 φ0 Vφ? + 1/2 Vc t=2 (St−1 + α1/2 V φ0 c (1 − St−1 ))(St + αc (1 − St ))
To sample φ we thus resort to an MH scheme with proposal N (φ? , Vφ? )I(−1,1) and acceptance probability given by: s min(1,
1 − φ2c 1 c21 exp {− (φ2o − φ2c )} ) 2 2Vc S1 + (1 − S1 )αc 1 − φo
where φc is the candidate and φo is the previously sampled value. The kernel of the full conditional of αc given cT , S, φ, Vc verifies: p(αc |cT , S, φ, Vc ) ∝ p(cT |αc , S, φ, Vc )p(αc )
that is straightforward to evaluate. Use is made of the ARMS algorithm by Gilks et al. (1995,1997). The full conditional distributions of Vc and Vp are inverse gamma-2 with scale parameters: T
sc? = sc0 + sp? = sp0 +
c21
X 1 − φ2 + S1 + (1 − S1 )αc t=2
T X
Ã
ct 1/2
St + (1 − St )αc
−
!2
φct−1 1/2
St−1 + (1 − St−1 )αc
∆p2t
t=2
and degrees of freedom vc? = vc0 + T and vp? = vp0 + T − 1 like in Example 1. The transition probabilities are drawn like previously.
D
Prior and full conditional distributions for Example 3
We assume the following independent prior distributions: ω ∼ β(2, 16), and Vλ ∼ IG2 (48, 6). The prior distribution of Vλ allows large jumps in volatility. The full conditional distribution 27
of ω is such that: p(ω|K1 ) ∝
T Y
p(K1t |ω)p(ω)
t=1
As the prior for ω is conjugate we get p(ω|K1 ) = β(a? , b? ) with hyperparameters: a? = a0 +
T X
K1t
t=1
b? = b0 + T −
T X
K1t
t=1
where a0 , and b0 are the hyperparameters of the ω-prior. Finally, from (3.5) the full conditional distribution p(Vλ |λT , K1 ) is an IG2 density with hyperparameters: sλ? = sλ0 + vλ? = vλ0 +
T X t=2 T X t=2
where sλ0 , vλ0 are the Vλ -prior hyperparameters.
28
∆λ2t K1t
References Andrieu C., and Moulines E. (2006), ‘On the ergodicity properties of some adaptive MCMC algorithms’, The Annals of Applied Probability, 16, 3, 1462-1505. Atchad Y., and Rosenthal J. (2007), ‘On adaptive Markov Chain Monte Carlo algorithms’, Bernoulli, 11, 815-828. Bauwens L., Lubrano M. and Richard J. (1999), Bayesian Inference in Dynamic Econometric Models, Oxford University Press. Besag J., Green E., Higdon D., and Mengersen K. (1995), “Bayesian computation and stochastic systems” (with discussion), Statistical Science, 10,3-66. Carter C. and Kohn R. (1994), “On Gibbs sampling for state space models”, Biometrika, 81, 541-553. Carter C. and Kohn R. (1997), “Semiparametric Bayesian inference for time series with mixed spectra ”, Journal of the Royal Statistical Society, B, 59, 1, 255-268. Engle C. and Kim C.-J. (1999), “The long-run US/UK real exchange rate”, Journal of Money, Credit, and Banking, 31, 3, 335-356. Fama E.F. and French K.R. (1993), “Common risk factors in the returns on stocks and bonds”, Journal of Financial Economics, 33, 1, 3-56. Fruhwirth-Schnatter S. (2006), Finite Mixture and Markov Switching Models, New York, Springer. Casella G. and E. I. George (1992), “Explaining the Gibbs Sampler”, The American Statistician, 46, 3, pp. 167-174. Gerlach R., Carter C., and Kohn, R. (2000), “Efficient Bayesian inference for dynamic mixture models”, Journal of the American Statistical Association, 95, 819-828. Gilks W., Best N. and Tan K., (1995), “Adaptive rejection Metropolis sampling with Gibbs sampling”, Applied Statistics, 44, 4, pp. 455-472. Gilks W., Neal R., Best N. and Tan K., (1997), “Corrigendum: Adaptive rejection Metropolis sampling”, Applied Statistics, 46, 4, pp. 541-542. Giordani P. and Kohn R. (2008), “Efficient Bayesian inference for multiple change-point and mixture innovation models”, Journal of Business & Economic Statistics, 26, 1, 66-77. Giordani P. and Kohn R. (2010), “Adaptive independnt Metropolis-Hastings by fast estimation of mixtures of normals”, Journal of Computational & Graphical Statistics, 19, 2, 243-259. 29
Giordani P., Kohn R., and van Dijk D. (2007), “A unified approach to nonlinearity, structural change, and outliers”, Journal of Econometrics, 137, 112-133. Haario H., Saksman E., and Tamminen G. (2001), “An adaptive Metropolis algorithm”, Bernoulli, 7, 223-242. Hamilton J.D. (1989), “A new approach to the economic analysis of nonstationary time series and the business cycle”, Econometrica, 57, 2, 357-384. Kim C.-J. and Nelson C.R. (1999), State-space Models with Regime Switching: Classical and Gibbs Sampling Approaches with Applications, Massachussets Institute of Technology Press, Cambridge. Kim C.-J., Shephard N., and Chib S. (1998), “Stochastic volatility: likelihood inference and comparison with ARCH models”, The Review of Economic Studies, 65, 3, 361-393. Liu J.S., Wong W.H., and Kong A. (1994), “Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes”, Biometrika, 81, 1, 27-40. Nott D.J. and Kohn R. (2005), “Adaptive sampling for Bayesian model selection”, Biometrika, 92, 4, 747-763. Roberts G.O. and Rosenthal J. (2007), ‘Coupling and ergodicity of adaptive MCMC’, Journal of Applied Probability, 44, 458-475. Roberts G.O. and Rosenthal J. (2009), ‘Examples of adaptive MCMC’, Journal of Computational & Graphical Statistics, 18, 2, 349-367. Roberts G.O. and Sahu S.K. (1997), “Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler”, Journal of the Royal Statistical Society, B, 59, 2, 291-337. Scott S.L. (2002), “Bayesian methods for hidden Markov models: recursive computing in the 21st century”, Journal of the American Statistical Association, 97, 457, 337-351. Seewald W. (1992), “Discussion on Parameterization issues in Bayesian inference” (by S.E. Hills and F.M. Smith). In Bayesian Statistics, 4 (eds J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith), pp. 241-243. Oxford: Oxford University Press. Tierney L. (1994), “Markov chains for exploring posterior distributions”, Annals of Statistics, 22, 4, 1701-1762.
30