On Monte Carlo methods for Bayesian multivariate regression models ...

Report 2 Downloads 58 Views

On Monte Carlo methods for Bayesian multivariate regression models with heavy-tailed errors Vivekananda Roy

James P. Hobert

Department of Statistics

Department of Statistics

Iowa State University

University of Florida

December 2009

Abstract We consider Bayesian analysis of data from multivariate linear regression models whose errors have a distribution that is a scale mixture of normals. Such models are used to analyze data on financial returns, which are notoriously heavy-tailed. Let π denote the intractable posterior density that results when this regression model is combined with the standard non-informative prior on the unknown regression coefficients and scale matrix of the errors. Roughly speaking, the posterior is proper if and only if n ≥ d + k, where n is the sample size, d is the dimension of the response, and k is number of covariates. We provide a method of making exact draws from π in the special case where n = d + k, and we study Markov chain Monte Carlo (MCMC) algorithms that can be used to explore π when n > d + k. In particular, we show how the Haar PX-DA technology studied in Hobert and Marchev (2008) can be used to improve upon Liu’s (1996) data augmentation (DA) algorithm. Indeed, the new algorithm that we introduce is theoretically superior to the DA algorithm, yet equivalent to DA in terms of computational complexity. Moreover, we analyze the convergence rates of these MCMC algorithms in the important special case where the regression errors have a Student’s t distribution. We prove that, under conditions on n, d, k, and the degrees of freedom of the t distribution, both algorithms converge at a geometric rate. These convergence rate results are important from a practical standpoint because geometric ergodicity guarantees the existence of central limit theorems which are essential for the calculation of valid asymptotic standard errors for MCMC based estimates.

Key words and phrases. Data augmentation algorithm, Drift condition, Geometric ergodicity, Markov chain, Minorization condition, Robust multivariate regression

1

1

Introduction

Let y1 , y2 , . . . , yn be d-dimensional random vectors satisfying the linear regression model yi = β T xi + εi ,

(1)

where β is a k × d matrix of unknown regression coefficients, the xi ’s are known k × 1 covariate vectors, and the d-dimensional error vectors, ε1 , . . . , εn , are iid with common density d

∞

Z fH (ε) =

δ2 d

0

1

exp

(2π) 2 |Σ| 2

n

o δ − εT Σ−1 ε dH(δ) , 2

where H(·) is the distribution function of some non-negative random variable. In practice, H will be fixed, but, for the time being, we leave it unspecified because many of our results hold true for any H. The density fH is a multivariate scale mixture of normals and it belongs to the class of elliptically symmetric distributions. It can be made heavy-tailed by choosing H appropriately. In particular, when H is the distribution function corresponding to a Gamma( ν2 , ν2 ) random variable, then fH becomes the multivariate Student’s t density with ν > 0 degrees of freedom, which, aside from a normalizing − d+ν 2 constant, is given by 1 + ν −1 εT Σ−1 ε . These heavy-tailed error distributions are often used when modelling financial data (see, e.g., Affleck-Graves and McDonald, 1989; Bradley and Taqqu, 2003; Cademartori, Romo, Campos and Galea, 2003; Fama, 1965; Zhou, 1993). Let y denote the n × d matrix whose ith row is yiT , and let X stand for the n × k matrix whose ith row is xTi , and, finally, let ε represent the n × d matrix whose ith row is εTi . Using this notation, we can state the n equations in (1) more succinctly as follows y = Xβ + ε . We assume throughout the paper that X has full column rank. The density of y is given by "Z # d n T ∞ Y δ2 δ f (y|β, Σ) = − yi − β T xi Σ−1 yi − β T xi dH(δ) . d 1 exp 2 2 2 0 (2π) |Σ| i=1

(2)

We consider a Bayesian analysis using the standard non-informative prior for multivariate location scale problems given by π(β, Σ) ∝ |Σ|−

d+1 2

. Of course, whenever an improper prior is used, one must

check that the resulting posterior is proper. Fernandez and Steel (1999) provide necessary and sufficient conditions for propriety (see Section 2). Roughly speaking, these authors show that the posterior is proper if and only if n ≥ d + k. Assuming it is proper, the posterior density is characterized by π(β, Σ|y) ∝ f (y|β, Σ)π(β, Σ) . 2

Unfortunately, this posterior density is intractable in the sense that posterior expectations, which are required for Bayesian inference, cannot be computed in closed form. In the special case where n = d+k, we provide a simple algorithm for making exact draws from π(β, Σ|y), which allows one to use classical Monte Carlo methods (based on iid simulations) to approximate posterior expectations. Unfortunately, this method breaks down when n > d + k, and in these cases we must resort to Markov chain Monte Carlo (MCMC) methods. The first MCMC method developed for this problem was the data augmentation (DA) algorithm of Liu (1996). Our main result in this paper is a proof that, in the important special case where the εi s have a multivariate Student’s t distribution, the Markov chain that drives Liu’s (1996) algorithm converges at a geometric rate. This result is important from a practical standpoint because it guarantees the existence of central limit theorems (CLTs) for ergodic averages, which in turn allows for the calculation of valid asymptotic standard errors for the MCMC estimates of posterior expectations (Flegal, Haran and Jones, 2008; Jones, Haran, Caffo and Neath, 2006; Jones and Hobert, 2001). We also present an alternative MCMC algorithm, called the Haar PX-DA algorithm, that is computationally equivalent to Liu’s (1996) algorithm, but is theoretically superior. The remainder of this paper is organized as follows. In Section 2, we present an alternative proof of the sufficiency half of Fernandez and Steel’s (1999) propriety result, which yields a simple method of making exact draws from the posterior, π(β, Σ|y), in the special case where n = d + k. In Section 3, we consider MCMC methods for simulating from π(β, Σ|y) in cases where n > d + k. In particular, we describe the DA algorithm of Liu (1996) and we show how the Haar PX-DA technology studied in Hobert and Marchev (2008) can be used to improve upon it. Finally, in Section 4 we investigate the rate of convergence of these two MCMC algorithms in the important special case where the regression errors have a Student’s t distribution. Our main result is that, for certain configurations of (n, d, k, ν), both of the algorithms converge at a geometric rate.

2

An Alternative Proof of Propriety and an Exact Sampling Algorithm

As mentioned in the introduction, Fernandez and Steel (1999) studied the propriety of the posterior density that results when the multivariate regression likelihood, f (y|β, Σ), is combined with the standard non-informative prior, π(β, Σ). In order to state their result precisely, define Z Z c(y) = f (y|β, Σ)π(β, Σ) dβ dΣ , W

Rdk

where W denotes the space of d × d positive definite matrices. (Note that dependence on X is being suppressed in the notation.) For a fixed data set, y ∈ Rdn , the posterior distribution is proper if and only 3

if c(y) < ∞. (Note that c(y) is always strictly positive.) Of course, when c(y) is finite, the posterior density is given by π(β, Σ|y) = f (y|β, Σ)π(β, Σ) c(y) . Fernandez and Steel (1999) sketched a proof of the following result. Proposition 1. Let φ denote Lebesgue measure on Rdn . If n ≥ d + k, then there exists a set S ⊂ Rdn (that depends on X) with φ(S) = 0 such that, if y ∈ / S, then c(y) < ∞. Conversely, if n < d + k, then c(y) = ∞. In words, Proposition 1 says that, if there are too few data vectors (n < d + k), then the posterior is improper, and, if there is “enough” data (n ≥ d + k), then, with probability one, the observed data vector will result in a proper posterior. In this section, we provide the details of an alternative proof of the sufficiency part of Proposition 1 that leads to a method of making exact draws from the posterior density in the special case where n = d + k. Our proof is based on the standard missing data model that includes both the regression data, y, and the missing observations from the mixing distribution H. Suppose that, conditional on (β, Σ), {(yi , qi )}ni=1 are iid pairs such that yi |qi , β, Σ ∼ Nd β T xi , Σ/qi

qi |β, Σ ∼ H(·) . Now write the vector of missing data as q = (q1 , . . . , qn ) and assume (for convenience) that the probability measure associated with H has a density with respect to Lebesgue measure, call it h. Then we can write the joint density of these iid pairs as n Y f (yi |qi , β, Σ)h(qi ) , f (y, q β, Σ) =

(3)

i=1

and the corresponding y-marginal is given by Z f (y β, Σ) = f (y, q|β, Σ) dq Rn +

=

"Z n Y

∞

d

d

i=1

0

δ2

1

(2π) 2 |Σ| 2

exp

# T δ − yi − β T xi Σ−1 yi − β T xi h(δ) dδ , 2

(4)

which is the same as (2). This shows that the marginal density of y has not been altered by the introduction of q into the model. Using (4) and Fubini’s theorem, we have Z Z Z Z Z Z c(y) = f (y, q|β, Σ) dq π(β, Σ) dβ dΣ = W

Rdk

Rn +

Rn +

W

f (y, q|β, Σ)π(β, Σ) dβ dΣ dq ,

Rdk

(5) 4

where R+ := (0, ∞). As we will see, this representation simplifies things since two of the three integrals on the right-hand side of (5) can be evaluated in closed form. Alternative proof of the sufficiency part of Proposition 1. First, note that f (y, q|β, Σ)π(β, Σ) # #" Y n d+1 qi T T −1 T = qi exp − (β xi − yi ) Σ (β xi − yi ) h(qi ) |Σ|− 2 nd n 2 (2π) 2 |Σ| 2 i=1 i=1 " # #" Y n n − d2 X |Q| 1 = exp − qi (β T xi − yi )T Σ−1 (β T xi − yi ) h(qi ) , n+d+1 nd 2 (2π) 2 |Σ| 2 "

1

n Y

d 2

i=1

i=1

where Q denotes an n × n diagonal matrix whose ith diagonal element is qi−1 . Now n X

qj (β T xj −yj )T Σ−1 (β T xj − yj )

j=1

=

n X

qj tr (β T xj − yj )T Σ−1 (β T xj − yj )

j=1

=

n X

qj tr Σ−1 (β T xj − yj )(β T xj − yj )T

j=1

n X = tr Σ−1 qj (β T xj − yj )(β T xj − yj )T j=1

T T T T −1 T T T −1 β X −y β X −y Q = tr Σ −1 T T −1 T T −1 T −1 T −1 = tr Σ β X Q Xβ − β X Q y − y Q Xβ + y Q y T o n T −1 T −1 T T −1 T T −1 β −µ −µ Ω µ+y Q y = tr Σ β −µ Ω T −1 T T −1 T T −1 T −1 T −1 = tr Σ β −µ Ω β −µ + tr Σ y Q y−µ Ω µ , where Ω = (X T Q−1 X)−1 and µ = (X T Q−1 X)−1 X T Q−1 y. Thus, as a function of β, T 1 . f (y, q|β, Σ)π(β, Σ) ∝ exp − tr Σ−1 β T − µT Ω−1 β T − µT 2 Using results from Appendix A concerning the matrix normal density, we have Z T kd k d 1 −1 T T −1 T T exp − tr Σ β −µ Ω β −µ dβ T = (2π) 2 |Σ| 2 |Ω| 2 . 2 Rdk

5

(6)

Thus, we have shown that Z f (y, q|β, Σ)π(β, Σ) dβ T Rdk d

d

|Ω| 2 |Q|− 2

= (2π)

d(n−k) 2

|Σ|

n−k+d+1 2

exp

" n # Y 1 − tr Σ−1 y T Q−1 y − µT Ω−1 µ h(qi ) . 2

(7)

i=1

Now recall the assumption that n ≥ d + k. We first establish that, for φ-almost all y, the matrix y T Q−1 y

− µT Ω−1 µ is positive definite. We use the following representation: y T Q−1 y − µT Ω−1 µ = y T Q−1 y − y T Q−1 X(X T Q−1 X)−1 X T Q−1 y 1 − 12 T −1 −1 T − 21 T − 21 I − Q X(X Q X) X Q Q− 2 y . =y Q 1

1

Since I − Q− 2 X(X T Q−1 X)−1 X T Q− 2 is idempotent, the matrix above is positive semi-definite. We can now establish positive definiteness by showing that its determinant is nonzero. To this end, let Λ be the n × (d + k) augmented matrix (X : y). Then,   T h i X  Q−1 X y ΛT Q−1 Λ =  yT   X T Q−1 X X T Q−1 y  . = y T Q−1 X y T Q−1 y Therefore, |ΛT Q−1 Λ| = |X T Q−1 X||y T Q−1 y − y T Q−1 X(X T Q−1 X)−1 X T Q−1 y| = |X T Q−1 X||y T Q−1 y − µT Ω−1 µ| = |Ω|−1 |y T Q−1 y − µT Ω−1 µ| .

(8)

Let S denote the set of ys that lead to linear dependencies among the columns of Λ, and note that S has φ-measure zero. Throughout the remainder of this proof, we assume that y ∈ / S. Now since y ∈ / S and n ≥ d + k, Λ has full column rank, which implies that |ΛT Q−1 Λ| > 0. It now follows from (8) that |y T Q−1 y − µT Ω−1 µ| > 0 and hence y T Q−1 y − µT Ω−1 µ is positive definite. We may now apply the results from Appendix A concerning the inverse Wishart distribution to conclude that h ih Q i d Qn d 1 Z Z |Ω| 2 (n − k + 1 − j) h(q ) Γ i i=1 j=1 2 f (y, q|β, Σ)π(β, Σ) dβ T dΣ = n−k . dk 2 W R d(2(n−k)−d+1) d 4 π |Q| 2 y T Q−1 y − µT Ω−1 µ 6

(9)

Now suppose that n = d + k. Then, using (8) we have hQ ih Q i n d 1 Z Z h(q ) Γ (d + 1 − j) i i=1 j=1 2 f (y, q|β, Σ)π(β, Σ) dβ T dΣ = d d W Rdk π d(d+1)/4 |ΛT Q−1 Λ| 2 |Q| 2 ih Q hQ i d n 1 Γ h(q ) (d + 1 − j) i j=1 i=1 2 . = π d(d+1)/4 |Λ|d Q Note that, as a function of q, the expression above is proportional to ni=1 h(qi ). Hence, Z Rn +

Z W

Z

Qd f (y, q|β, Σ)π(β, Σ) dβ T dΣ dq =

Rdk

1 j=1 Γ 2 (d + 1 − π d(d+1)/4 |Λ|d

j)

(10)

d + k. In this case, we must resort to MCMC algorithms and this is the topic of the next section.

3

The DA and Haar PX-DA Algorithms

Liu (1996) developed a DA algorithm for exploring π(β, Σ|y) that is based on the missing data model from the previous section. The algorithm simulates a Markov chain, { βm , Σm }∞ m=0 , with Markov transition density 0

0

Z

k(β , Σ |β, Σ) =

π(β 0 , Σ0 |q, y)π(q|β, Σ, y) dq .

Rn +

8

Note that, while the Markov transition density does depend on the data, y, and the matrix X, these quantities are fixed, so this dependence is suppressed in the notation. Simulating this Markov chain is straightforward. Indeed, given the value of the current state, βm , Σm , we move to the next state, βm+1 , Σm+1 , by first drawing q from π(q|βm , Σm , y) and then using this new value of q to draw βm+1 , Σm+1 from π(β, Σ|q, y). Of course, we already know from the previous section how to draw from π(β, Σ|q, y) by exploiting the factorization π(β, Σ|q, y) = π(β|Σ, q, y) π(Σ|q, y). As for drawing from π(q|β, Σ, y), note that π(q|β, Σ, y) = R

π(q, β, Σ|y) ∝ f (y, q|β, Σ) . Rn π(q, β, Σ|y) dq +

Hence, it follows from (3) that, conditional on (β, Σ, y) the elements of q are independent and qi has density proportional to d 2

π(qi |β, Σ, y) ∝ h(qi ) qi exp

qi − (β T xi − yi )T Σ−1 (β T xi − yi ) 2

.

(12)

Putting all of this together, a single iteration of the DA algorithm uses the current state (β, Σ) to produce the new state (β 0 , Σ0 ) through the following three steps: 1. Draw q1 , q2 , . . . qn independently with qi drawn from (12). −1 0 T −1 T −1 2. Draw Σ ∼ IWd n − k, y Q y − µ Ω µ 3. Draw β 0 T ∼ Nd,k µT , Σ0 , Ω No matter what the form of h, (12) is just a univariate density which can be sampled using rejection sampling. Moreover, in the important special case where h(qi ) is a Gamma ν2 , ν2 density, sampling from π(qi |β, Σ, y) is especially simple since ν + d ν + (β T xi − yi )T Σ−1 (β T xi − yi ) qi |β, Σ, y ∼ Gamma , . 2 2 Recall from the previous section that an exact draw from the posterior density π(β, Σ|y) can (at least theoretically) be made by drawing sequentially from π(q|y), π(Σ|q, y) and π(β|Σ, q, y). Note that one iteration of the DA algorithm also involves simulating from π(Σ|q, y) and π(β|Σ, q, y), but, instead of a draw from π(q|y), the DA algorithm only requires n independent univariate draws. It is interesting that, if we are willing to settle for a Markov chain that converges to π(β, Σ|y) instead of an exact draw from π(β, Σ|y), then the complicated draw from π(q|y) can be replaced by the much simpler task of making n univariate draws from (12).

9

The basic theory of DA algorithms implies that the Markov chain { βm , Σm }∞ m=0 is reversible with respect to the posterior density π(β, Σ|y); i.e., for all (β, Σ), (β 0 , Σ0 ) ∈ Rdk × W , we have k(β 0 , Σ0 |β, Σ)π(β, Σ|y) = k(β, Σ|β 0 , Σ0 )π(β 0 , Σ0 |y) . It follows immediately that the posterior density is invariant for the chain; i.e., Z Z k(β, Σ|β 0 , Σ0 )π(β 0 , Σ0 |y) dβ 0 dΣ0 = π(β, Σ|y) . W

Rdk

Moreover, Roy (2008) shows that this Markov chain is Harris ergodic; that is, irreducible, aperiodic and Harris recurrent. Harris ergodicity implies that the chain converges to its stationary distribution and that ergodic averages based on the DA algorithm converge almost surely to their population counterparts, which are, of course, posterior expectations with respect to π(β, Σ|y). Over the last decade, several authors have shown that it is possible to drastically improve the convergence behavior of DA algorithms by adding a computationally simple “extra step” (see, e.g., Hobert and Marchev, 2008; Liu and Wu, 1999; Meng and van Dyk, 1999; van Dyk and Meng, 2001). Indeed, suppose that R(q, dq 0 ) is a Markov transition function on Rn+ that is reversible with respect to π(q | y). Consider adding an extra step to the DA algorithm where, after q is drawn in the first step, we move to a new value, q 0 ∼ R(q, ·), before drawing new values of β and Σ. To be more specific, let ˜ Σ) ˜ to the next state ˜ m }∞ be a new Markov chain that proceeds from the current state (β, { β˜m , Σ m=0 ˜ 0 ) via the following four steps (β˜0 , Σ 1. Draw q1 , q2 , . . . , qn independently with qi drawn from (12). 2. Draw q 0 ∼ R(q, ·) −1 −1 T 0 −1 0 0 T 0 0 3. Draw Σ ∼ IWd n − k, y Q y − µ Ω µ 4. Draw β 0 T ∼ Nd,k µ0 T , Σ0 , Ω0

where Q0 is a diagonal matrix with the reciprocals of the qi0 on the diagonal, and µ0 and Ω0 are just µ and Ω with Q0 in place of Q. A routine calculation shows that the reversibility of R with respect to π(q|y) implies that the new ˜ m }∞ , is reversible with respect to the target (posterior) density, π(β, Σ|y). Moreover, chain, { β˜m , Σ m=0

the results in Hobert and Marchev (2008) imply that the new algorithm is at least as efficient as the DA algorithm, and, if the new chain satisfies an additional regularity condition, then it converges at least as fast as the DA chain. To make this more precise, we need to introduce some notation. Let L2 denote the space of real-valued functions that are square integrable with respect to the target density; that is, Z Z 2 dk 2 L = h:R ×W →R h (β, Σ) π(β, Σ|y) dβ dΣ < ∞ . W

Rdk

10

Also, let L20 denote the subspace of functions in L2 that satisfy

R

R

W

Rdk

h(β, Σ) π(β, Σ|y) dβ dΣ = 0.

The space L20 is a Hilbert space with inner product defined as Z Z h1 (β, Σ) h2 (β, Σ) π(β, Σ|y) dβ dΣ . hh1 , h2 i = Rdk

W

Of course, the corresponding norm is khk =

p ˜ : L2 → L2 denote the hh, hi. Let K : L20 → L20 and K 0 0

usual Markov operators defined by the DA chain and the new chain, respectively. In particular, K maps h ∈ L20 to

Z

Z

h(β 0 , Σ0 ) k(β 0 , Σ0 |β, Σ) dβ 0 dΣ0 ,

(Kh)(β, Σ) := W

Rdk

˜ acts similarly. It is well known that K is a positive operator; that is, for any h ∈ L2 , hKh, hi ≥ 0 and K 0 (Liu, Wong and Kong, 1994). The results in Hobert and Marchev (2008) can be used to show that ˜ is also a positive operator, and this implies that the new chain is at least as efficient as the K−K DA chain. To be specific, let h ∈ L2 and define τ 2 to be the asymptotic variance in the CLT for P ˜2 similarly using hm := m−1 m−1 i=0 h(βi , Σi ) if such a CLT exists, and ∞ otherwise. Define τ P ˜ i ) in place of hm . The positivity of K − K ˜ implies that τ˜2 ≤ τ 2 (Mira and Geyer, m−1 m−1 h(β˜i , Σ i=0

1999). ˜ denote the (operator) norms of K and K, ˜ so, for example, Let kKk and kKk kKk =

kKhk .

sup h∈L20 ,khk=1

In general, the closer the norm of a Markov operator is to 0, the faster the corresponding Markov chain converges (see, e.g., Liu, Wong and Kong, 1995). Results in Hobert and Rosenthal (2007) imply that, if ˜ is also a positive operator, then kKk ˜ ≤ kKk, so the new chain converges at least as fast as the DA K chain. We will construct a specific R using a recipe of Liu and Wu (1999) that involves group actions and ˜ is indeed (left) Haar measure. The results in Hobert and Marchev (2008) imply that the corresponding K a positive operator, and hence the alternative algorithm, which we call the Haar PX-DA algorithm, is at least as good as the DA algorithm in terms of efficiency and convergence rate. Let G be the multiplicative group R+ where group composition is defined as multiplication. This is a unimodular group, so the left and right Haar measures are the same and are given by %(dg) = dg/g, where dg denotes Lebesgue measure on R+ . Allow G to act on the space Rn+ through component-wise multiplication; i.e., gq = (gq1 , gq2 , · · · , gqn ). Note that, if g ∈ G and h : Rn+ → R is an integrable function (with respect to Lebesgue measure), then Z Z n h(z)dz = g Rn +

Rn +

11

h(gz)dz .

This shows that Lebesgue measure on Rn+ is relatively invariant with multiplier χ(g) = g n . (See Eaton (1989) for details on the group theory we are using here.) To construct the Haar PX-DA algorithm using the group structure introduced above, we must first demonstrate that there is a probability density (with respect to Haar measure) that is proportional to π(gq | y) χ(g). In other words, we must show that R m(q) = G π(gq | y) χ(g) %(dg) < ∞ for all q ∈ Rn+ . Note that we can re-express π(q | y) as follows: n−k n d T −1 − d T −1 X Q X 2 y Q y − y T Q−1 X(X T Q−1 X)−1 X T Q−1 y − 2 Y qi 2 h(qi ) , π(q | y) ∝ q· q· q· i=1

where q· =

Pn

i=1 qi .

Therefore, as a function of g, π(gq | y) = c

Qn

i=1 h(gqi )

where c is a constant.

Hence, using the transformation g → 1/g, we have # # Z ∞"Y Z ∞"Y n n 1 dσ qi n dg =c h . m(q) = c h(gqi ) g g σ σ σ 0 0

(13)

i=1

i=1

We now use a simple Bayesian statistical argument to show that m(q) is finite. Indeed, suppose that W is a random variable from the scale family σ1 h w σ and, as a prior density on the scale parameter σ we use 1/σ. Then the posterior density of σ given w is proper because Z ∞ Z 1 w dσ 1 ∞ 1 h = h(u) du = 2L/(1 − λ). For the associated minorization condition, we must find a density function u : Rdk × W → [0, ∞) and an ε ∈ (0, 1] such that, for all (β 0 , Σ0 ) ∈ C and all (β, Σ) ∈ Rdk × W , k(β, Σ|β 0 , Σ0 ) ≥ ε u(β, Σ) . If we can establish these two conditions, then the Markov chain is geometrically ergodic (Meyn and Tweedie (1993, Chapter 15), Jones and Hobert (2004)). Moreover, Rosenthal’s (1995) Theorem 12 provides a computable upper bound on M (β, Σ)ρm that involves the functions and constants from the drift and minorization conditions. We begin with the drift condition. Lemma 1. Let V (β, Σ) =

Pn

i=1 (yi

− β T xi )T Σ−1 (yi − β T xi ) and assume that ν + d > 2. Then the

Markov chain underlying the DA algorithm satisfies the following (P V )(β, Σ) ≤

nν(n + d − k) n+d−k V (β, Σ) + . ν+d−2 ν+d−2

14

Proof. First, by Fubini’s theorem, we have Z Z 0 0 V (β, Σ) k(β, Σ | β 0 , Σ0 ) dβ dΣ (P V )(β , Σ ) = dk W R Z Z Z V (β, Σ) π(β, Σ | q, y) π(q | β 0 , Σ0 , y) dβ dΣ dq = Rn +

Z

Rdk

W

Z

Z

V (β, Σ) π(β | Σ, q, y) π(Σ | q, y) π(q | β 0 , Σ0 , y) dβ dΣ dq

= Rn +

Rdk

W

Z

Z

Z

V (β, Σ) π(β | Σ, q, y) dβ π(Σ | q, y) dΣ π(q | β 0 , Σ0 , y) dq . (14)

= Rn +

W

Rdk

The inner-most integral in (14) can be viewed as an expectation with respect to π(β | Σ, q, y), which is a matrix normal density. The integral in curly brackets can then be viewed as an expectation with respect to π(Σ | q, y), which is an inverse Wishart density. And, finally, the outer-most integral is an expectation with respect to π(q | β 0 , Σ0 , y), which is a product of univariate gamma densities. Hence, we write o i h n (P V )(β 0 , Σ0 ) = E E E V (β, Σ) Σ, q, y q, y β 0 , Σ0 , y . Starting with the innermost expectation, we have E V (β, Σ) Σ, q, y " n # n n X X X T −1 T −1 T T −1 T −1 T =E yi Σ yi − yi Σ β xi + xi βΣ yi + xi βΣ β xi Σ, q, y i=1

=

n X

i=1

i=1

n n h i X h i X T −1 T −1 T T −1 yi Σ yi − E yi Σ β xi + xi βΣ yi Σ, q, y + E xTi βΣ−1 β T xi Σ, q, y .

i=1

i=1

Recall that β T | Σ, q, y ∼ Nd,k

i=1

µT , Σ, Ω . Hence, E β T Σ, q, y = µT and results in Arnold (1981,

chap 17) show that E βΣ−1 β T Σ, q, y = dΩ + µΣ−1 µT . It follows that E V (β, Σ) Σ, q, y =

n X i=1

yiT Σ−1 yi

−

n X

yiT Σ−1 µT xi

+

xTi µΣ−1 yi )

+

i=1

n X

xTi dΩ + µΣ−1 µT xi

i=1

n n X X T T −1 T = (yi − µ xi ) Σ (yi − µ xi ) + d xTi Ωxi . i=1

i=1

Now, recall that −1 T −1 T −1 Σ | q, y ∼ IWd n − k, y Q y − µ Ω µ . 15

Results in Mardia, Kent and Bibby (1979, p. 85) imply that −1 E Σ−1 | q, y = (n − k) y T Q−1 y − µT Ω−1 µ . Therefore, o n E E V (β, Σ) Σ, q, y q, y =

n n X X (yi − µT xi )T E Σ−1 | q, y (yi − µT xi ) + d xTi Ωxi i=1

i=1

= (n − k)

n X

T

(yi − µ xi )

T

−1

T

y Q

T

y−µ Ω

−1

µ

−1

T

(yi − µ xi ) + d

n X

xTi Ωxi .

i=1

i=1

(15) Note that we were able to compute the first two conditional expectations exactly. Unfortunately, we cannot compute the expectation of (15) with respect to π(q | β 0 , Σ0 , y) in closed form. We instead compute the expectation of a simple upper bound on (15). First, note that y T Q−1 y − µT Ω−1 µ = y T Q−1 y + µT Ω−1 µ − 2µT Ω−1 µ =

n X

qi yi yiT + µT (X T Q−1 X)µ − 2µT X T Q−1 y

i=1

=

n X

qi yi yiT + µT

i=1

=

n X

n X

n X qi xi xTi µ − 2µT qi xi yiT

i=1

i=1

qi (yi − µT xi )(yi − µT xi )T .

i=1

So, −1 (yi − µT xi )T y T Q−1 y − µT Ω−1 µ (yi − µT xi ) X −1 n T T T T T = (yi − µ xi ) qj (yj − µ xj )(yj − µ xj ) (yi − µT xi ) j=1

1 = (yi − µT xi )T qi Now,

X n j=1

qj (yj − µT xj )(yj − µT xj )T qi

−1

(yi − µT xi ) .

n X qj 1 T −1 T −1 (yj − µT xj )(yj − µT xj )T , y Q y−µ Ω µ = qi qi j=1

which is positive definite. Also, n X qj j=1

qi

(yj − µT xj )(yj − µT xj )T − (yi − µT xi )(yi − µT xi )T =

X qj j6=i

16

qi

(yj − µT xj )(yj − µT xj )T ,

which is positive semidefinite. Hence, we may apply Lemma 3 from Appendix B to conclude that −1 1 (yi − µT xi )T y T Q−1 y − µT Ω−1 µ (yi − µT xi ) ≤ . qi Therefore, (n − k)

n X

T

T

(yi − µ xi )

T

y Q

−1

T

y−µ Ω

−1

i=1

n X −1 1 T . µ (yi − µ xi ) ≤ (n − k) qi i=1

Pn

We now use a similar argument to bound the term d i=1 xTi Ωxi . Since X has full column rank, Pn qj P qj 1 T −1 T T j=1 qi xj xj is positive definite. Furthermore, it is clear that j6=i qi xj xj is positive qi X Q X = semidefinite. Another application of Lemma 3 yields qi xTi Ωxi

=

xTi

1 qi

T

X Q

−1

X

−1

xi =

xTi

X n j=1

It follows that d

n X

xTi Ωxi ≤ d

i=1

qj xj xTj qi

−1 xi ≤ 1 .

n X 1 . qi i=1

Putting all of this together, we have o n E E V (β, Σ) Σ, q, y q, y = (n − k)

n n X X −1 (yi − µT xi )T y T Q−1 y − µT Ω−1 µ (yi − µT xi ) + d xTi Ωxi i=1

i=1

n X 1 ≤ (n + d − k) . qi i=1

Now recall that π(q | β 0 , Σ0 , y) is a product of n univariate gamma densities with ν + d ν + (β 0 T xi − yi )T Σ0 −1 (β 0 T xi − yi ) qi | β 0 , Σ0 , y ∼ Gamma , . 2 2 So, as long as ν + d > 2, we have E(qi−1 | β 0 , Σ0 , y) =

(yi − β 0 T xi )T Σ0 −1 (yi − β 0 T xi ) + ν . ν+d−2

Finally, " # n X n + d − k T −1 T (P V )(β 0 , Σ0 ) ≤ nν + (yi − β 0 xi )T Σ0 (yi − β 0 xi ) ν+d−2 i=1

n+d−k nν(n + d − k) = V (β 0 , Σ0 ) + , ν+d−2 ν+d−2 and this completes the proof. 17

Here is a formal statement of the required minorization condition. Lemma 2. Fix l > 0 and let C = (β, Σ) : V (β, Σ) ≤ l . For all (β 0 , Σ0 ) ∈ C and all (β, Σ) ∈ Rdk × W , we have k(β, Σ | β 0 , Σ0 ) ≥ ε u(β, Σ) , where the density u(β, Σ) is given by n Y

# g(qi ) R∞ dq u(β, Σ) = π(β, Σ|q, y) g(t)dt R+ n i=1 0 "

Z

n g(t)dt . The function g(·) is given by ν + d ν ν + d ν + l g(t) = Γ , ; t I(0,q∗ ) (t) + Γ , ; t I(q∗ ,∞) (t) 2 2 2 2 l where q ∗ = ν+d l log 1 + ν and Γ(a, b; x) denotes the Gamma(a,b) density evaluated at the point x. and ε =

R∞ 0

Proof. For i = 1, 2, . . . , n, define n o Ci = (β, Σ) : (yi − β T xi )T Σ−1 (yi − β T xi ) ≤ l . Clearly, C ⊂ Ci for each i. Recall that, π(q | β, Σ, y) is product of n univariate Gamma densities. Hence, # n Y ν + d ν + (β T xi − yi )T Σ−1 (β T xi − yi ) inf π(q | β, Σ, y) = inf Γ , , qi 2 2 (β,Σ)∈C (β,Σ)∈C i=1 " # n Y ν + d ν + (β T xi − yi )T Σ−1 (β T xi − yi ) ≥ inf Γ , , qi 2 2 (β,Σ)∈C i=1 " # n Y ν + d ν + (β T xi − yi )T Σ−1 (β T xi − yi ) ≥ inf Γ , , qi . 2 2 (β,Σ)∈Ci "

i=1

Then, by Hobert’s (2001) Lemma 1, it straightforwardly follows that, for any (β 0 , Σ0 ) ∈ C, 0

0

π(q | β , Σ , y) ≥

n Y

g(qi ) .

i=1

Therefore, if (β 0 , Σ0 ) ∈ C, we have Z 0 0 k(β, Σ | β , Σ ) = π(β, Σ | q, y) π(q | β 0 , Σ0 , y) dq Rn +

"

Z ≥

π(β, Σ | q, y) Rn +

Z = 0

n Y

# g(qi ) dq

i=1 ∞

"

n Y

# g(qi ) R∞ π(β, Σ | q, y) dq . g(t) dt Rn + i=1 0

n Z g(t) dt

and the proof is complete. 18

Lemmas 1 and 2 immediately yield the following result. Theorem 1. The Markov chain underlying the DA algorithm is geometrically ergodic if n < ν + k − 2. A similar drift and minorization type analysis of the Haar PX-DA algorithm is much more complicated due to the extra step (Roy, 2008). Fortunately, such an analysis is unnecessary because Hobert and Marchev’s (2008) Theorem 4 shows that the Haar PX-DA algorithm inherits the geometric ergodicity of the DA algorithm upon which it is based. Hence, we have the following result: Corollary 1. The Markov chain underlying the Haar PX-DA algorithm is geometrically ergodic if n < ν + k − 2. Theorem 1 and Corollary 1 are generalizations of results in Marchev and Hobert (2004), who studied the multivariate location-scale model that is the special case of model (1) in which k = 1 and there are no covariates.

Appendices A

Distributional facts

Matrix Normal Distribution Suppose Z is an r × c random matrix with density o 1 n −1 1 −1 T , tr A (z − θ)B (z − θ) exp − fZ (z) = c r rc 2 (2π) 2 |A| 2 |B| 2 where θ is an r × c matrix, A and B are r × r and c × c positive definite matrices. Then Z is said to have a matrix normal distribution and we denote this by Z ∼ Nr,c (θ, A, B) (Arnold, 1981, Chapter 17). Simulating a random matrix from the Nr,c (θ, A, B) distribution is easy. For i = 1, . . . , r, draw Zi ∼ Nc (0, B) and then set 

Z1T

  ZT  2 Z=A  .  ..  T Zm 1 2

    +θ .  

Inverse Wishart Distribution Suppose W is a p × p random positive definite matrix with density n o m+p+1 |w|− 2 exp − 21 tr Θ−1 w−1 fW (w) = mp p(p−1) , m Q 2 2 π 4 |Θ| 2 pi=1 Γ 12 (m + 1 − i) where m ≥ p and Θ is a p × p positive definite matrix. Then W is said to have an inverse Wishart distribution and this is denoted by W ∼ IWp (m, Θ). 19

B

A matrix result

Lemma 3. If P is a positive definite matrix and, for some vector x, the matrix P − xxT is positive semidefinite, then xT P −1 x ≤ 1. Proof. Calculating the determinant of the matrix  P  xT

x 1

 

twice using the Schur complement of P and 1, we get the following identity |P |(1 − xT P −1 x) = |P − xxT | . This implies that xT P −1 x = 1 −

|P − xxT | . |P |

Since P is positive definite and P − xxT is a positive semidefinite, we know that |P | > 0 and |P − xxT | ≥ 0. Hence, |P − xxT |/|P | ≥ 0 and it follows that xT P −1 x ≤ 1.

Acknowledgments The authors thank two reviewers for helpful comments and suggestions. The second author’s research was supported by NSF grants DMS-05-03648 and DMS-08-05860.

References A FFLECK -G RAVES , J. and M C D ONALD , B. (1989). Nonnormalities and tests of asset pricing theories. Journal of Finance, 44 889–908. A RNOLD , S. F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley, New York. B RADLEY, B. O. and TAQQU , M. S. (2003). Handbook of Heavy-Tailed Distributions in Finance, chap. Financial Risk and Heavy Tails. Elsevire, Amsterdam, 35–103. C ADEMARTORI , D., ROMO , C., C AMPOS , R. and G ALEA , M. (2003). Robust estimation of symmetric risk using the t distribution in the chilean stock markets. Applied Economics Letters, 10 447–453. E ATON , M. L. (1989). Group Invariance Applications in Statistics. Institute of Mathematical Statistics and the American Statistical Association, Hayward, California and Alexandria, Virginia. 20

FAMA , E. F. (1965). The behavior of stock market prices. Journal of Business, 38 34–105. F ERNANDEZ , C. and S TEEL , M. F. J. (1999). Multivariate Student-t regression models: Pitfalls and inference. Biometrika, 86 153–167. F LEGAL , J. M., H ARAN , M. and J ONES , G. L. (2008). Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science, 23 250–260. H OBERT, J. P. (2001). Discussion of “The art of data augmentation” by D.A. van Dyk and X.-L. Meng. Journal of Computational and Graphical Statistics, 10 59–68. H OBERT, J. P. and M ARCHEV, D. (2008). A theoretical comparison of the data augmentation, marginal augmentation and PX-DA algorithms. The Annals of Statistics, 36 532–554. H OBERT, J. P. and ROSENTHAL , J. S. (2007). Norm comparisons for data augmentation. Advances and Application in Statistics, 7 291–302. J ONES , G. L., H ARAN , M., C AFFO , B. S. and N EATH , R. (2006). Fixed-width output analysis for Markov chain Monte Carlo. Journal of the American Statistical Association, 101 1537–1547. J ONES , G. L. and H OBERT, J. P. (2001). Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Statistical Science, 16 312–34. J ONES , G. L. and H OBERT, J. P. (2004). Sufficient burn-in for Gibbs samplers for a hierarchical random effects model. The Annals of Statistics, 32 784–817. L IU , C. (1996). Bayesian robust multivariate linear regression with incomplete data. Journal of the American Statistical Association, 91 1219–1227. L IU , C. and RUBIN , D. B. (1995). ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statistica Sinica, 5 19–39. L IU , C., RUBIN , D. B. and W U , Y. N. (1998). Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika, 85 755–770. L IU , J. S., W ONG , W. H. and KONG , A. (1994). Covariance structure of the Gibbs sampler with applications to comparisons of estimators and augmentation schemes. Biometrika, 81 27–40. L IU , J. S., W ONG , W. H. and KONG , A. (1995). Covariance structure and convergence rate of the Gibbs sampler with various scans. Journal of the Royal Statistical Society, Series B, 57 157–169.

21

L IU , J. S. and W U , Y. N. (1999). Parameter expansion for data augmentation. Journal of the American Statistical Association, 94 1264–1274. M ARCHEV, D. and H OBERT, J. P. (2004). Geometric ergodicity of van Dyk and Meng’s algorithm for the multivariate Student’s t model. Journal of the American Statistical Association, 99 228–238. M ARDIA , K., K ENT, J. and B IBBY, J. (1979). Multivariate Analysis. Academic press, London. M ENG , X.-L. and VAN DYK , D. A. (1999). Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika, 86 301–320. M EYN , S. P. and T WEEDIE , R. L. (1993). Markov Chains and Stochastic Stability. Springer Verlag, London. M IRA , A. and G EYER , C. J. (1999). Ordering Monte Carlo Markov chains. Tech. Rep. No. 632, School of Statistics, University of Minnesota. R D EVELOPMENT C ORE T EAM (2006). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org. ROBERTS , G. O. and ROSENTHAL , J. S. (1997). Geometric ergodicity and hybrid Markov chains. Electronic Communications in Probability, 2 13–25. ROSENTHAL , J. S. (1995). Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal of the American Statistical Association, 90 558–566. ROY, V. (2008). Theoretical and methodological developments for Markov chain Monte Carlo algorithms for Bayesian regression. Ph.D. thesis, Department of Statistics, University of Florida. VAN

DYK , D. A. and M ENG , X.-L. (2001). The art of data augmentation (with discussion). Journal of

Computational and Graphical Statistics, 10 1–50. Z HOU , G. (1993). Asset-pricing tests under alternative distributions. Journal of Finance, 48 1927– 1942.

22

Recommend Documents