Analysis of MCMC algorithms for Bayesian linear - Department of ...

Report 20 Downloads 62 Views
Analysis of MCMC algorithms for Bayesian linear regression with Laplace errors Hee Min Choi and James P. Hobert Department of Statistics University of Florida January 2013

Abstract Let π denote the intractable posterior density that results when the standard default prior is placed on the parameters in a linear regression model with iid Laplace errors. We analyze the Markov chains underlying two different Markov chain Monte Carlo algorithms for exploring π. In particular, it is shown that the Markov operators associated with the data augmentation (DA) algorithm and a sandwich variant are both trace-class. Consequently, both Markov chains are geometrically ergodic. It is also established that for each i ∈ {1, 2, 3, . . . }, the ith largest eigenvalue of the sandwich operator is less than or equal to the corresponding eigenvalue of the DA operator. It follows that the sandwich algorithm converges at least as fast as the DA algorithm. AMS 2000 subject classifications. Primary 60J27; secondary 62F15 Abbreviated title. MCMC algorithms for Bayesian linear regression Key words and phrases. Asymmetric Laplace distribution, Data augmentation algorithm, Eigenvalues, Geometric convergence rate, Markov chain, Markov operator, Monte Carlo, Sandwich algorithm, Trace-class operator

1

1

Introduction

Let {Yi }ni=1 be independent random variables such that Yi = xTi β + σi ,

(1)

where xi ∈ Rp is a vector of known covariates associated with Yi , β ∈ Rp is a vector of unknown regression coefficients, and σ ∈ (0, ∞) is an unknown scale parameter. The errors, {i }ni=1 , are assumed to be iid from ||

the Laplace distribution with scale equal to two, so the common density is d() = e− 2 /4. (The reason for the peculiar scaling will be made clear later.) The Laplace distribution is often used as a heavy-tailed alternative to the Gaussian distribution when the data contain outliers (see, e.g., West, 1984). We consider a Bayesian model with an improper prior on (β, σ 2 ) that takes the form π(β, σ 2 ) = (σ 2 )−(a+1)/2 IR+ (σ 2 ), where R+ := (0, ∞) and a is a hyper-parameter. The standard default prior can be recovered by taking a = 1. Of course, whenever one deals with an improper prior, it must be established that the corresponding posterior is proper. The joint density of the data is given by 1 f (y|β, σ ) = n n exp 4 σ 2



 n 1 X T − yi − x i β , 2σ i=1

where y = (y1 , . . . , yn ). By definition, the posterior density is proper if Z

Z

m(y) := R+

f (y|β, σ 2 )π(β, σ 2 ) dβ dσ 2 < ∞ .

Rp

As usual, let X denote the n × p matrix whose ith row is xTi , and also let C(X) denote the column space of X. Results in Fern´andez and Steel (1999) imply that when a = 1, the posterior is proper if and only if X has full column rank and y ∈ / C(X). The following result, which is proven in the Appendix, is a generalization. Proposition 1. The posterior is proper (that is m(y) < ∞) if and only if X has full column rank, a > −n + p + 1, and y ∈ / C(X). Assume now that m(y) < ∞ so that the posterior density, which we denote by π(β, σ 2 |y), is welldefined. The posterior is intractable in the sense that posterior expectations cannot be computed in closed 2

form. In this paper, we analyze a pair of Markov chain Monte Carlo (MCMC) algorithms for this problem. One is a data augmentation (DA) algorithm that is based on a representation of the Laplace density as a scale mixture of normals with respect to the inverse gamma distribution. The other is a variant of the DA algorithm that requires one additional simulation step at each iteration. In order to formally state the algorithms, we must introduce a bit more notation. (See Section 2 for derivations.) When we write W ∼ IG(α, γ), we mean that W has density proportional to w−α−1 e−γ/w IR+ (w), where α and γ are strictly positive parameters. Similarly, when we write W ∼ Inverse Gaussian(µ, λ), we mean that W has density given by r

λ exp 2πw3

 −

 λ(w − µ)2 IR+ (w) , 2µ2 w

where µ and λ are strictly positive parameters. Given z ∈ Rn+ , let Q be an n × n diagonal matrix whose ith diagonal element is zi−1 . Also, define Ω = (X T Q−1 X)−1 and θ = (X T Q−1 X)−1 X T Q−1 y. We now state the DA algorithm. 2 )}∞ p Let Φ = {(βm , σm m=0 be a Markov chain with state space X = R × R+ whose dynamics are defined 2 ) = (implicitly) through the following three-step procedure for moving from the current state, (βm , σm 2 (β, σ 2 ), to (βm+1 , σm+1 ).

Iteration m + 1 of the DA algorithm: 1. Draw Z1 , . . . , Zn independently with  Zi ∼ Inverse Gaussian

1 σ , T 2|yi − xi β| 4

 ,

and call the observed value z = (z1 , . . . , zn )T 2. Draw 2 σm+1

2 3. Draw βm+1 ∼ Np θ, σm+1 Ω

  n − p + a − 1 y T Q−1 y − θT Ω−1 θ ∼ IG , 2 2



3

Because β is drawn from a (full rank) multivariate normal distribution, |yi − xTi β| is non-zero with probability one. Thus, from a practical (simulation) standpoint, we can ignore the case where |yi −xTi β| = 0. On the other hand, in our theoretical analysis, we will have to deal with this possibility (see Section 2). Before stating the second algorithm, we provide some background on the so-called sandwich algorithm (Hobert and Marchev, 2008). A generic DA algorithm has two steps. The first step involves simulating the latent data, given the current value of the parameters, and in the second step, a new parameter vector is drawn, given the current value of the latent data. (The DA algorithm stated above has three steps, but this is only because the simulation of the new parameter is broken down into two steps.) In a sandwich algorithm, the latent data that are drawn at the first step are “tweaked” in a random way before the new parameter vector is drawn. This random tweak, which is typically extremely inexpensive computationally, often results in huge gains in efficiency. See, for example, van Dyk and Meng (2001) and Hobert, Roy and Robert (2011). We now describe a sandwich version of the DA algorithm stated above. 2 )}∞ ˜ = {(β˜m , σ Let Φ ˜m m=0 be a second Markov chain on X whose dynamics are defined through the 2 ) = (β, σ 2 ), to (β 2 ˜m+1 , σ following four-step procedure for moving from the current state, (β˜m , σ ˜m ˜m+1 ).

Iteration m + 1 of the sandwich algorithm: 1. Draw Z10 , . . . , Zn0 independently with Zi0

 ∼ Inverse Gaussian

σ 1 , T 2|yi − xi β| 4

and call the observed value z 0 = (z10 , . . . , zn0 )T 2. Draw   n 2n + a − 1 X 1 , , g ∼ IG 2 8zi0 i=1

and set z =

(gz10 , . . . , gzn0 )T

4

 ,

3. Draw   n − p + a − 1 y T Q−1 y − θT Ω−1 θ ∼ IG , 2 2

2 σm+1

2 4. Draw βm+1 ∼ Np θ, σm+1 Ω



Note that the amount of computer time required to perform one iteration of the sandwich algorithm is only slightly larger than that required for the DA algorithm. Indeed, simulation of a single inverted gamma variate is computationally cheap relative to all the other simulation that is required to run one iteration of the DA algorithm. When a = 1, the two algorithms described above can be derived as special cases of algorithms introduced by Roy and Hobert (2010), who considered (a multivariate version of) model (1) with errors from a generic scale mixture of normals. These authors were mainly concerned with the particular situation in which the errors in model (1) have a Student’s-t distribution. Our focus in this paper is on the theoretical properties of the Markov chains underlying the DA and sandwich algorithms given above. In order to describe our main result, we must introduce the operators  ˆ σ ˜ First, let k β, σ 2 β, associated with the Markov chains Φ and Φ. ˆ 2 denote the Markov transition density 2 ) = (β, 2 ˆ σ ) given that (βm , σm ˆ 2 ). Analogously, (Mtd) of Φ, that is, the conditional density of (βm+1 , σm+1

˜ (See Section 2 for the precise form of k.) Let L2 be the space of real-valued let k˜ denote the Mtd of Φ. 0 functions with domain X that are square integrable and have mean zero with respect to the posterior density, π(β, σ 2 |y). This is a Hilbert space in which inner product of h, h0 ∈ L20 is defined as 0

Z

Z

hh, h i = R+

h(β, σ 2 ) h0 (β, σ 2 ) π(β, σ 2 |y) dβ dσ 2 ,

Rp

and the corresponding norm is, of course, given by khk = hh, hi1/2 . The Mtds k and k˜ define operators on L20 , the spectra of which contain a great deal of information about the convergence behavior of the corresponding Markov chains. (For an introduction to these ideas, see Hobert et al. (2011).) Indeed, let

5

K : L20 → L20 denote the Markov operator that maps h ∈ L20 to ˆ σ (Kh)(β, ˆ2) =

Z

Z

R+

Rp

 ˆ σ h(β, σ 2 ) k β, σ 2 β, ˆ 2 dβ dσ 2 ,

˜ denote the Markov operator associated with the sandwich algorithm. and, analogously, let K ˜ are both self-adjoint and positive, the spectra of K and K ˜ are both Because the operators K and K subsets of the interval [0, 1] (Hobert and Marchev, 2008; Liu, Wong and Kong, 1994). If, in addition to ˜ is also compact, then its spectrum consists solely of eigenvalues being self-adjoint and positive, K (or K) (which are all strictly less than one) and the point {0} (which may or may not be an eigenvalue). Finally, if the sum of the eigenvalues is finite, then the operator is called trace-class (see, e.g., Conway, 1990, p. 267). Here is our main result, which is proven in Section 3. ˜ i }∞ ˜ are both trace-class. Moreover, letting {λi }∞ and {λ Theorem 1. The Markov operators K and K i=1 i=1 ˜ i ≤ λi < 1 for all i ∈ N, and ˜ respectively, we have that 0 ≤ λ denote the ordered eigenvalues of K and K, ˜ i < λi for at least one i ∈ N. λ Theorem 1 has important implications, both theoretical and practical. Indeed, since the norm of a positive, self-adjoint, compact Markov operator is equal to its largest eigenvalue, Theorem 1 implies that ˜ 1 = kKk ˜ ≤ kKk = λ1 < 1 . 0≤λ

(2)

Since a reversible Markov chain is geometrically ergodic if and only if the norm of its operator is less than one (Roberts and Rosenthal, 1997), (2) implies that the Markov chains underlying the DA and sandwich algorithms are both geometrically ergodic. This fact is extremely important from a practical standpoint because geometric ergodicity guarantees the existence of the central limit theorems that form the basis of all the standard methods of calculating valid asymptotic standard errors for MCMC-based estimators (see, e.g., Flegal, Haran and Jones, 2008; Jones, Haran, Caffo and Neath, 2006). On the theoretical side, the norm of a self-adjoint Markov operator represents its asymptotic rate of convergence, with smaller values associated with faster convergence (see, e.g., Rosenthal, 2003). Thus, (2) also implies that the sandwich algorithm converges at least as fast as the DA algorithm. 6

The remainder of this paper is organized as follows. The DA and sandwich algorithms are derived in Section 2, and then analyzed in Section 3, which contains a proof of Theorem 1. In Section 4, we show that our propriety result (Proposition 1) can be easily extended to the so-called Bayesian quantile regression model. Finally, the Appendix contains our proof of Proposition 1.

2

Derivations of the DA and sandwich algorithms

In this section, we derive the DA and sandwich algorithms that are stated in the Introduction. Again, when a = 1, these algorithms are special cases of those in Roy and Hobert (2010). While the derivation for the a 6= 1 case is similar to the a = 1 case, we feel that it is worthwhile to give the details here for completeness, and because some of the results developed in this section are used again later in the paper. We begin by introducing one latent random variable for each observation. Indeed, let {(Yi , Zi )}ni=1 be independent random pairs such that Yi |Zi = zi ∼ N(xTi β, σ 2 /zi ) and, marginally, Zi ∼ IG(1, 1/8). A straightforward calculation (using the fact that the inverse Gaussian density integrates to one) shows T

that the marginal density of Yi is (4σ)−1 e−|yi −xi β|/(2σ) , which is precisely the density of Yi under the original model. Therefore, if we let z = (z1 , . . . , zn )T and we denote the joint density of {(Yi , Zi )}ni=1 by f (y, z; β, σ), then Z f (y, z; β, σ) dz = f (y; β, σ) ,

(3)

Rn +

showing that {Zi }ni=1 are indeed latent variables. We now use the latent data to construct the DA algorithm. Combining the latent data model with the prior, π(β, σ 2 ), yields the augmented posterior density defined as π(β, σ 2 , z|y) =

f (y, z; β, σ 2 ) π(β, σ 2 ) . m(y)

It follows immediately from (3) that Z

π(β, σ 2 , z|y) dz = π(β, σ 2 |y) ,

Rn +

7

which is our target posterior density. The Mtd of the DA algorithm is defined as 2 ˆ



k β, σ β, σ ˆ

2



Z =

ˆ σ π(β, σ 2 |z, y) π(z|β, ˆ 2 , y) dz ,

Rn +

where π(β, σ 2 |z, y) and π(z|β, σ 2 , y) are conditional densities associated with π(β, σ 2 , z|y), so, for example, π(z|β, σ 2 , y) := π(β, σ 2 , z|y)/π(β, σ 2 |y). Note that k maps X × X into (0, ∞). By construction, π(β, σ 2 |y) is an invariant density for this Mtd. Moreover, because k is strictly positive, it is straightforward to show that the corresponding Markov chain, Φ, is Harris ergodic; that is, irreducible, aperiodic, and Harris recurrent. We now establish that the DA algorithm stated in the Introduction does indeed simulate the Markov chain defined by k. We begin by showing that the first of the three steps draws from π(z|β, σ 2 , y). Indeed, π(z|β, σ 2 , y) ∝ π(β, σ 2 , z|y) ∝ f (y, z; β, σ 2 ) π(β, σ 2 ), which is given by #" # " n 1  #" Y n Y a+1 zi2 zi (yi − xTi β)2 1 − 8z1 2 − − e i (σ ) 2 . 1 1 exp 2 σ2 8zi2 (2π) 2 (σ 2 ) 2

(4)

i=1

i=1

Thus, conditional on (β, σ 2 , y), {Zi }ni=1 are independent, and the conditional density of zi given (β, σ 2 , y) is proportional to −3 zi 2

 exp

zi (yi − xTi β)2 1 − − 2 2 σ 8zi

 .

When |yi − xTi β| > 0, this is an (unnormalized) inverse Gaussian density with µ =

σ 2|yi −xT i β|

and λ = 1/4.

On the other hand, if |yi − xTi β| = 0, then it is an (unnormalized) inverse gamma density with α = 1/2 and γ = 1/8. Hence, in either case, the conditional density of zi given (β, σ 2 , y) takes the form 

1 8π

1 2

−3 zi 2

 exp

zi (yi − xTi β)2 |yi − xTi β| 1 − + − 2σ 2 2σ 8zi

 .

(5)

As noted in the Introduction, the case in which |yi − xTi β| = 0 is irrelevant from a simulation standpoint because β is drawn from a (full rank) multivariate normal distribution, so |yi − xTi β| > 0 with probability one. It remains to show that the second and third steps of the algorithm result in a draw from π(β, σ 2 |z, y). In fact, we will show that the second step yields a draw from π(σ 2 |z, y), while the third step results in a 8

draw from π(β|σ 2 , z, y). As a function of β, (4) is proportional to  exp

 n 1 X T 2 − 2 zi (yi − xi β) . 2σ i=1

However, n X

zi (yi − xTi β)2 = (y − Xβ)T Q−1 (y − Xβ)

i=1

= β T (X T Q−1 X)β − 2β T X T Q−1 y + y T Q−1 y = (β − θ)T Ω−1 (β − θ) + [y T Q−1 y − θT Ω−1 θ] ,

(6)

where, as in the Introduction, Q is an n × n diagonal matrix whose ith diagonal element is zi−1 , Ω =  (X T Q−1 X)−1 and θ = (X T Q−1 X)−1 X T Q−1 y. Thus, β|σ 2 , z, y ∼ Np θ, σ 2 Ω . Finally, it is clear that R the conditional density of σ 2 given (z, y) is proportional to Rp f (y, z; β, σ 2 ) π(β, σ 2 ) dβ, and it follows from (6) that this quantity is itself proportional to 2 −

(σ )

(n−p+a−1) −1 2

 exp

y T Q−1 y − θT Ω−1 θ − 2σ 2

 ,

which is the kernel of the IG distribution from the third step of the DA algorithm. We note that y ∈ / C(X) implies that y T Q−1 y > θT Ω−1 θ (see, e.g., Roy and Hobert, 2010). We end this section with an explanation of why the sandwich algorithm is valid. Results in Hobert and Marchev (2008) show that we can form a sandwich algorithm by adding the extra step z → z 0 = gz = (gz1 , . . . , gzn )T , as long as the random variable g is drawn from a density (with respect to Lebesgue measure) that is proportional to π(gz|y) g n−1 IR+ (g). Note that Z

Z

π(z|y) = R+

π(β, σ 2 , z|y) dβ dσ 2

Rp

y T Q−1 y − y T Q−1 X(X T Q−1 X)−1 X T Q−1 y ∝ 1 X T Q−1 X 2

− (n−p+a−1) " 2

n Y

−3 − 1 zi 2 e 8zi

i=1

Hence, π(gz|y) g

n−1

IR+ (g) ∝ g

− 2n+a+1 2

 exp

 n 1X 1 IR+ (g) . − g 8zi i=1

9

# .

So we must draw   n 2n + a − 1 X 1 . g ∼ IG , 2 8zi i=1

˜ is Harris ergodic. Again, it’s easy to see that the Markov chain Φ

3

Proof of Theorem 1

We begin by proving that the operator K is trace-class. As in Khare and Hobert (2011), we establish this result by showing that the Mtd k satisfies the following condition Z Rp

Z

 k β, σ 2 β, σ 2 dσ 2 dβ < ∞ .

(7)

R+

It follows from (5) that π(z|β, σ 2 , y) "  1 #  n Y 1 1 2 − 23 zi (yi − xTi β)2 |yi − xTi β| + − = zi exp − 8π 2σ 2 2σ 8zi i=1 "  1 3  # n Y 1 2 −2 zi (yi − xTi β)2 1 6zi |yi − xTi β| 1 = + − zi exp − 8π 2σ 2 6zi 2σ 8zi i=1 " 1     2  # n Y 1 2 − 23 zi (yi − xTi β)2 1 9zi (yi − xTi β)2 1 ≤ zi exp − + +1 − 8π 2σ 2 12zi σ2 8zi i=1 " n  1 #     n Y 1 2 −3 1 1 X T 2 2 = zi exp − exp zi (yi − xi β) 8π 24zi 4σ 2 i=1 i=1 " n  1  #    T −1  Y 1 2 −3 1 (β − θ)T Ω−1 (β − θ) y Q y − θT Ω−1 θ 2 = exp zi exp − exp , (8) 8π 24zi 4σ 2 4σ 2 i=1

where the inequality holds because |x| ≤ (x2 + 1)/2, and the last equality is due to (6). Now note that 2

π(β|σ , z, y) exp



(β − θ)T Ω−1 (β − θ) 4σ 2



Z



= (2π)

− 1 σ Ω 2 exp

− p2 2



(β − θ)T Ω−1 (β − θ) − 4σ 2

 ,

so that 2

π(β|σ , z, y) exp Rp

(β − θ)T Ω−1 (β − θ) 4σ 2

10



p

dβ = 2 2 .

(9)

Similarly,  y T Q−1 y − θT Ω−1 θ π(σ |z, y) exp 4σ 2 n−p+a−1 y T Q−1 y−θT Ω−1 θ  

2

2

2

=

Γ

σ

n−p+a−1  2

 2 −

(n−p+a−1) −1 2

 exp

y T Q−1 y − θT Ω−1 θ − 4σ 2

 ,

so that Z



2

π(σ |z, y) exp R+

y T Q−1 y − θT Ω−1 θ 4σ 2



dσ 2 = 2

(n−p+a−1) 2

.

(10)

Also, "Z n Y i=1



1 8π

R+

1 2

−3 zi 2

 exp

1 − 24zi



# n

dzi = 3 2 .

(11)

Finally, combining (8), (9), (10) and (11), we have Z Rp

Z

2

 k β, σ β, σ 2 dσ 2 dβ =

Z Rp

R+

Z R+

Z

π(β, σ 2 |z, y)π(z|β, σ 2 , y) dz dσ 2 dβ ≤ 2

n+a−1 2

n

32 .

Rn +

Hence, (7) is satisfied. Now, because the extra step of the sandwich algorithm was constructed using a group action and the Haar measure on that group, as described in Section 4 of Hobert and Marchev (2008), Theorem 1 of Khare ˜ i ≤ λi < 1 for all i ∈ N. The ˜ is trace-class, and that 0 ≤ λ and Hobert (2011) immediately implies that K ˜ i < λi < 1 follows from Theorem 2 of Khare and Hobert fact that there is at least one i ∈ N such that 0 ≤ λ (2011). The proof is complete.

4

Posterior propriety for Bayesian quantile regression

In this section, we consider an alternative version of model (1) in which the Laplace errors are replaced by errors from the asymmetric Laplace density given by h i d(; r) = r(1 − r) e(1−r) IR− () + e−r IR+ () ,

11

where r ∈ (0, 1) is fixed and R− := (−∞, 0]. This density has rth quantile equal to zero, and when r = 1/2, we recover the Laplace errors. (This correspondence is the reason for the peculiar scaling in the Introduction.) The joint density of the data is now given by fX,r (y|β, σ 2 ) =

rn (1 − r)n exp σn

 −

 n i h 1X . yi − xTi β r − IR− yi − xTi β σ i=1

As noted by Yu and Moyeed (2001), the maximum likelihood estimator of β under this fully parametric model is given by argmin β∈Rp

n X

 ρr Yi − xTi β ,

i=1

  where ρr (u) = u r − IR− (u) . Of course, this also happens to be the standard nonparametric estimator of β(r) when the conditional quantile function of Y given X = x takes the form Q(r|X = x) = xT β(r) (see, e.g., Koenker, 2005). For this reason, the fully parametric model with the asymmetric Laplace errors is sometimes used as the basis of a Bayesian quantile regression model (see, e.g., Kozumi and Kobayashi, 2011; Yu and Moyeed, 2001; Yuan and Yin, 2010). In particular, Kozumi and Kobayashi (2011) put a proper prior on (β, σ 2 ), and developed a three-variable Gibbs sampler to explore the resulting intractable posterior density. (See Khare and Hobert (2012) for a theoretical analysis of that Gibbs sampler.) If, instead of the proper prior used by Kozumi and Kobayashi (2011), we use the improper prior π(β, σ 2 ) = (σ 2 )−(a+1)/2 IR+ (σ 2 ), then the posterior is proper if Z

Z

mr (y, X, a) := R+

fX,r (y|β, σ 2 )π(β, σ 2 ) dβ dσ 2 < ∞ .

Rp

The following result allows us to reuse Proposition 1 to establish necessary and sufficient conditions for posterior propriety under the more general model with asymmetric Laplace errors. Lemma 1. Fix X, y and r ∈ (1/2, 1). For all (β, σ 2 ) ∈ Rp × R+ , (4r)n (1 − r)n fX ∗ ,1/2 (y ∗ |β, σ 2 ) ≤ fX,r (y|β, σ 2 ) ≤ (4r)n (1 − r)n fX ∗∗ ,1/2 (y ∗∗ |β, σ 2 ) , where X ∗ = 2rX, y ∗ = 2ry, X ∗∗ = 2(1 − r)X and y ∗∗ = 2(1 − r)y. On the other hand, if r ∈ (0, 1/2), then the same inequalities hold when (X ∗ , y ∗ ) and (X ∗∗ , y ∗∗ ) are reversed. 12

Proof. We provide details for only one of the four inequalities, as the others are completely analogous. If r ∈ (1/2, 1), then r > 1 − r, and for each i ∈ {1, 2, . . . , n},  i h 1 T T yi − xi β r − IR− yi − xi β exp − σ      1 T T = exp 2(1 − r) yi − xi β IR− yi − xi β + exp − 2σ      1 ≤ exp 2(1 − r) yi − xTi β IR− yi − xTi β + exp − 2σ    1 T = exp − 2(1 − r) yi − xi β . 2σ 

   1 T 2r yi − xi β IR+ yi − xTi β 2σ    1 2(1 − r) yi − xTi β IR+ yi − xTi β 2σ

Therefore,   n h i rn (1 − r)n Y 1 T T fX,r (y|β, σ ) = yi − xi β r − IR− yi − xi β exp − σn σ i=1   n  4n rn (1 − r)n 1 Y 1 T ≤ exp − 2(1 − r) yi − xi β σn 4n 2σ 2

i=1

= (4r)n (1 − r)n fX ∗∗ ,1/2 (y ∗∗ |β, σ 2 ) .

Of course, multiplication of X by a positive constant does not alter its rank. Furthermore, y ∈ / C(X) is equivalent to y ∗ ∈ / C(X ∗ ) and to y ∗∗ ∈ / C(X ∗∗ ). Hence, it follows immediately from Lemma 1 that Proposition 1 remains valid when we replace the Laplace errors by asymmetric Laplace errors. We state this as another Proposition. Proposition 2. For any fixed r ∈ (0, 1), mr (y, X, a) < ∞ if and only if X has full column rank, a > −n + p + 1 and y ∈ / C(X). Lastly, we note that Yu and Moyeed (2001) considered a restriction of the model described above in which the scale parameter, σ, is known and equal to 1, and the prior on β is flat. In this case, the posterior is proper when Z fX,r (y|β, 1) dβ < ∞ .

mr (y, X) := Rp

13

Yu and Moyeed (2001) claimed in their Theorem 1 that the posterior density is always proper. This is actually false. In fact, arguments similar to those used in the Appendix can be used in conjunction with Lemma 1 (with σ = 1) to show that mr (y, X) < ∞ if and only if X has full column rank.

Appendix In this section, we provide a proof of Proposition 1. Before presenting the proof, we state a useful equation that will be used a couple of times within the proof. Let k·k denote the usual Euclidean norm, and let M − denote a generalized inverse of the matrix M . Setting βˆ = (X T X)− X T y and P = X(X T X)− X T , we have n X

ˆ T (X T X)(β − β) ˆ + k(I − P )yk2 . (yi − xTi β)2 = (β − β)

i=1

Of course, if rank(X) = p, then (X T X)− = (X T X)−1 . Proof of Proposition 1. We begin with necessity. Observe that f (y|β, σ 2 )π(β, σ 2 ) ) n T β| X (a+1) 1 |y − x i i (σ 2 )− 2 = c (σ 2 ) exp − 2 σ i=1 (  ) n (a+n+1) 1 X 1 (yi − xTi β)2 ≥ c exp − +1 (σ 2 )− 2 2 2 2 σ i=1 ( ) ( ) 2 (a+n+1) 1 k(I − P )yk ˆ T (X T X)(β − β) ˆ exp − = c0 exp − 2 (β − β) (σ 2 )− 2 , (12) 2 4σ 4σ (

−n 2

where c and c0 are constants (in β and σ 2 ) and the inequality follows from the fact that |x| ≤ (x2 + 1)/2. Now by the spectral decomposition, we can write X T X = ΓΛΓT , where Γ is orthogonal and Λ is diagonal with its elements equal to the eigenvalues of X T X. If X is not of full column rank, then X T X has at ˆ it follows that the least one eigenvalue equal to zero. Then by a change of variables with u = ΓT (β − β), integral of the right-hand side of (12) with respect to β diverges. Hence, the posterior is improper whenever rank(X) < p.

14

Now assume that rank(X) = p so that (X T X)− = (X T X)−1 . Since ( ) Z 1 ˆ dβ = (2π) p2 (2σ 2 ) p2 |X T X|− 12 , ˆ T (X T X)(β − β) exp − 2 (β − β) 4σ Rp we have Z

Z

f (y|β, σ 2 )π(β, σ 2 ) dβ dσ 2 R+ ( ) Z 2 (a+n−p−1) k(I − P )yk −1 2 (σ 2 )− ≥ c00 exp − dσ 2 . 4σ 2 R+

m(y) =

Rp

But this integral diverges unless a > −n + p + 1 and k(I − P )yk > 0. Hence, the proof of necessity is complete. Now for the sufficiency part. A routine calculation reveals that ( ) " n #−(n+a−1) Z n X (n+a+1) 1 X |yi − xTi β| 2 − 2 a+n T 2 exp − (σ ) dσ = 2 Γ(a + n − 1) |yi − xi β| , 2 σ R+ i=1

i=1

and it follows that, m(y) = c 2

a+n

X n

Z Γ(a + n − 1) Rp

000

Z

=c

 X n

Rp 000

Z

≤c

Rp

= c000

Z

|yi −

xTi β|

−(n+a−1)

|yi −

xTi β|



i=1

2 − (n+a−1) 2



i=1

X n

yi −

2 xTi β

− (n+a−1) 2



i=1

 − (n−p+a−1)+p 2 T T 2 ˆ ˆ (β − β) (X X)(β − β) + k(I − P )yk dβ

Rp

− α+p 2 1 T −1 ˆ ˆ 1 + (β − β) Σ (β − β) dβ , (13) α Rp  = (n − p + a − 1)(X T X) k(I − P )yk2 . But the integrand in (13) is the

c000 = k(I − P )ykn+a−1 where α = n − p + a − 1 and Σ−1

Z



kernel of a p-variate Student’s t density with α degrees of freedom, and location and scale (matrix) equal to βˆ and Σ, respectively. Hence, m(y) < ∞ and the proof is complete.

Acknowledgment. The authors thank two anonymous reviewers for helpful comments and suggestions. The second author was supported by NSF Grant DMS-11-06395. 15

References C ONWAY, J. B. (1990). A Course in Functional Analysis. 2nd ed. Springer, New York. ´ F ERN ANDEZ , C. and S TEEL , M. F. J. (1999). Multivariate Student-t regression models: Pitfalls and inference. Biometrika, 86 153–167. F LEGAL , J. M., H ARAN , M. and J ONES , G. L. (2008). Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science, 23 250–260. H OBERT, J. P. and M ARCHEV, D. (2008). A theoretical comparison of the data augmentation, marginal augmentation and PX-DA algorithms. The Annals of Statistics, 36 532–554. H OBERT, J. P., ROY, V. and ROBERT, C. P. (2011). Improving the convergence properties of the data augmentation algorithm with an application to Bayesian mixture modelling. Statistical Science, 26 332– 351. J ONES , G. L., H ARAN , M., C AFFO , B. S. and N EATH , R. (2006). Fixed-width output analysis for Markov chain Monte Carlo. Journal of the American Statistical Association, 101 1537–1547. K HARE , K. and H OBERT, J. P. (2011). A spectral analytic comparison of trace-class data augmentation algorithms and their sandwich variants. The Annals of Statistics, 39 2585–2606. K HARE , K. and H OBERT, J. P. (2012). Geometric ergodicity of the Gibbs sampler for Bayesian quantile regression. Journal of Multivariate Analysis, 112 108–116. KOENKER , R. (2005). Quantile Regression. No. 38 in Econometric Society Monographs, Cambridge Univesity Press, Cambridge. KOZUMI , H. and KOBAYASHI , G. (2011). Gibbs sampling methods for Bayesian quantile regression. Journal of Statistical Computation and Simulation, 81 1565–1578.

16

L IU , J. S., W ONG , W. H. and KONG , A. (1994). Covariance structure of the Gibbs sampler with applications to comparisons of estimators and augmentation schemes. Biometrika, 81 27–40. ROBERTS , G. O. and ROSENTHAL , J. S. (1997). Geometric ergodicity and hybrid Markov chains. Electronic Communications in Probability, 2 13–25. ROSENTHAL , J. S. (2003). Asymptotic variance and convergence rates of nearly-periodic MCMC algorithms. Journal of the American Statistical Association, 98 169–177. ROY, V. and H OBERT, J. P. (2010). On Monte Carlo methods for Bayesian multivariate regression models with heavy-tailed errors. Journal of Multivariate Analysis, 101 1190–1202. VAN

DYK , D. A. and M ENG , X.-L. (2001). The art of data augmentation (with discussion). Journal of

Computational and Graphical Statistics, 10 1–50. W EST, M. (1984). Outlier models and prior distributions in Bayesian linear regression. Journal of the Royal Statistical Society, Series B, 46 431–439. Y U , K. and M OYEED , R. A. (2001). Bayesian quantile regression. Statistics & Probability Letters, 54 437–447. Y UAN , Y. and Y IN , G. (2010). Bayesian quantile regression for longitudinal studies with nonignorable missing data. Biometrics, 66 105–114.

17