MCMC Methods in Wavelet Shrinkage: Non-Equally ... - Duke University

Report 4 Downloads 32 Views
This is page 1 Printer: Opaque this

MCMC Methods in Wavelet Shrinkage: Non-Equally Spaced Regression, Density and Spectral Density Estimation Peter Muller and Brani Vidakovic

Abstract

We consider posterior inference in wavelet based models for non-parametric regression with unequally spaced data, density estimation and spectral density estimation. The common theme in all three applications is the lack of posterior independence for the wavelet coecients djk . In contrast, most commonly considered applications of wavelet decompositions in Statistics are based on a setup which implies a posteriori independent coecients, essentially reducing the inference problem to a series of univariate problems. This is generally true for regression with equally spaced data, image reconstruction, density estimation based on smoothing the empirical distribution, time series applications and deconvolution problems. We propose a hierarchical mixture model as prior probability model on the wavelet coecients. The model includes a level-dependent positive prior probability mass at zero, i.e., for vanishing coecients. This implements wavelet coecient thresholding as a formal Bayes rule. For non-zero coecients we introduce shrinkage by assuming normal priors. Allowing di erent prior variance at each level of detail we obtain level-dependent shrinkage for non-zero coecients. We implement inference in all three proposed models by a Markov chain Monte Carlo scheme which requires only minor modi cations for the di erent applications. Allowing zero coecients requires simulation over variable dimension parameter space (Green 1995). We use a pseudo-prior mechanism (Carlin and Chib 1995) to achieve this. Keywords: Density estimation; Dependent posterior; Mixture prior; Nonequally spaced regression; Spectral density estimation.

2

Muller and Vidakovic

1 Introduction Many statistical inference problems can be thought of as estimating a random function f (x), conditional on data generated from a sampling model which in some form involves f . In particular, this view is natural for nonlinear regression, density estimation and spectral density estimation where f has the interpretation of the unknown mean function, probability density function, and spectral density, respectively. Bayesian inference in these problems requires a prior probability model for the unknown function f . Sometimes restriction to a low dimensional parametric model for f is not desirable. Common reasons are, for example, that there is no obvious parametric form for a non-linear regression function; or little is known about the shape of the unknown density in a density estimation. This calls for non-parametric methods, i.e., parameterizations of the unknown function f with in nitely many parameters, allowing a priori a wide range of possible functions. Wavelet bases provide one possibility of formalizing this. Wavelet decomposition allows representation of any square integrable function f (x) as

f (x) =

X

k 2Z

cJ0 k J0 k (x) +

X X

j J0 k2Z

djk

jk (x);

(1)

i.e., a parameterization of f in terms of the wavelet coecients  = (cJ0 k ; djk ; j  J0 ; k 2 Z ). Here jk (x) = 2j=2 (2j x,k), and jk (x) = 2j=2 (2j x,k) are wavelets and scaling functions at level of detail j and shift k. When we want to emphasize the dependence of f (x) on the wavelet coecients we will write f (x). Without loss of generality we will in the following discussion assume J0 = 0. See Vidakovic and Muller (1999) and Marron (1999) for a discussion of basic facts related to wavelet representations. Perhaps the most commonly used application of (1) in statistical modeling is to non-linear regression where f (x) represents the unknown mean response E (yjx) for an observation with covariate x. Chipman, Kolaczyk and McCulloch (1997), Clyde, Parmigiani and Vidakovic (1998), and Vidakovic (1998) discuss Bayesian inference in such models assuming equally spaced data, i.e., covariate values xi are on a regular grid. For equally spaced data the discrete wavelet transformation is orthogonal. Together with assuming independent measurement errors and a priori independent wavelet coecients this leads to posterior independence of the djk . Thus the problem essentially reduces to a series of univariate problems, one for each wavelet coecient. See the chapter by Yau and Kohn (1999) in this volume for a review. Generalizations of wavelet techniques to non-equidistant (NES) design impose additional conceptual and computational burdens. A reasonable approximation is to bin observations in equally spaced bins and proceed as in the equally spaced case. If only few observations are missing to complete an equally spaced grid, treating these few as missing

1. MCMC Methods in Wavelet Shrinkage

3

data leads to ecient implementations (Antoniadis, Gregoire and McKeague 1994; Cai and Brown, 1997; Hall and Turlach, 1997; Sardy et al., 1997). Alternatively, we will propose in this chapter an approach which does not rely on posterior independence. Density estimation is concerned with the problem of estimating an unknown probability density f (x) from a sample xi  f (x), i = 1; : : : ; n. Donoho et al. (1996) propose to consider a wavelet decomposition of f (x) = f (x) and estimate the coecients djk byP applying thresholding and shrinkP 1 age to the empirical coecients d^jk = n1 ( x ) and c ^ =  jk i jk n jk (xi ). This is justi ed by the fact that as coecients of an orthonormal basis the wavelet coecients djk are the inner products of f (x) and jk (x). The empirical coecients d^jk and c^jk are simply method-of-moments estimators. Similarly, Delyon and Juditsky (1995), Hall and Patil (1995), Pinheiro and Vidakovic (1997) and Vannucci (1995) considered wavelet based density estimation from a classical and data analytic perspective. Instead, in this chapter we propose density estimation as formal posterior inference on the wavelet coecients, using the exact likelihood xi  f (x). Using the correct likelihood instead of smoothing empirical coecients comes at the cost of loosing posterior independence for the djk . Wavelet based methods have been applied to the problem of spectral density estimation by Lumeau et al. (1992), Gao (1997) and Moulin (1994). The periodogram I (!) provides an unbiased estimate of the spectral density f (!) for a weakly stationary time series. But because of sampling variances which are not decreasing with sample size it is important to introduce some notion of smoothness for the unknown spectral density. Popular methods for spectral density estimation achieve this by data analytic smoothing of the raw periodogram. In this chapter we propose an approach based on a wavelet representation (1) of the spectral density f (!) and posterior inference on the unknown coecients djk . Again, the problem does not t into the usual conjugate a posteriori independent framework. For all three problems we introduce a hierarchical mixture prior model p() on the wavelet coecients djk . The model allows for posterior thresholding, i.e., vanishing coecients djk = 0, and posterior shrinkage of nonzero coecients. The prior model includes separate variance parameters at di erent level of detail j , implying di erential shrinkage for di erent j . We discuss a Markov chain Monte Carlo posterior simulation scheme which implements inference in all three applications.

2 The Prior Model We put a prior probability model on the random function f (x) in (1) by de ning a probability model for the coecients djk and c0k . To allow wavelet coecient thresholding the model needs to include a positive point-

4

Muller and Vidakovic

mass at zero. Thus we use as a marginal prior distribution for each djk a mixture model with a point-mass at zero and a continuous distribution for non-zero values. A convenient notation for such a mixture is to introduce indicators sjk 2 f0; 1g, and replace the coecients djk in (1) by the product sjk  djk , i.e., let djk denote the coecient if it is non-zero only.

f (x) =

X

k 2Z

c0k 0k (x) +

XX

j 0 k2Z

sjk djk

jk (x);

(2)

An important rationale for using wavelet bases in place of alternative function bases is the fact that wavelet decompositions allow parsimonious representations, i.e., using only few non-zero coecients allows close approximation of many functions. Thus our model p() puts progressively smaller prior probability j at djk = 1 for higher levels of details. j = Pr(sjk = 1) = j : (3) Given djk is not vanishing, we assume a normal prior with level-dependent variance p(djk jsjk = 1) = N (0; rj ) with rj = 2,j : (4) The scale factor rj compensates for the factor 2j=2 in the de nition of jk (x). In many discussions of wavelet shrinkage and thresholding this extra scale factor rj does not appear (for example, Vidakovic 1998). This is because instead of explicitly modeling j , decreasing prior probabilities for non-zero (or large) wavelet coecients at higher level of detail are implemented implicitly in the following sense. Shrinkage rules can be thought of as posterior mean functions. For example, in a regression problem a shrinkage rule gives the posterior mean for djk as a function of the corresponding coecient d^jk in the discrete wavelet transform of the data (the empirical wavelet coecient). The typical nonlinear shrinkage rule function (compare Vidakovic 1998, Figure 4) shrinks small coecients signi cantly stronger than large values, and because of the scaling factor 2j=2 in the de nition of jk ; the (true) ne detail coecients tend to be small, i.e., are a posteriori shrunk stronger than low level coecients. Rescaling with a factor like rj would shrink all coecients equally and defy this. This explains why no factor like rj is used in the usual thresholding rules. In contrast, we explicitly specify geometrically decreasing prior probabilities j for non-zero coecient. Together, the scaling factors rj and the probabilities j = j specify prior information on the rate of decay in the magnitudes of the wavelet coecients and thus indirectly on the smoothness of f . More general choices for j and rj are possible. Abramovich, Sapatinas and Silverman (1998) adopt j = min[1; C2 (1=2) j ] and j2 = C1 (1=2) j , with C1 and C2 determined in an empirical Bayes fashion. Chipman, Kolaczyk and McCulloch (1997) take j / fj (1=2)j where fj is the fraction of empirical wavelet coecients greater than a certain cuto .

1. MCMC Methods in Wavelet Shrinkage

5

We complete the model with a prior on c0k and hyperpriors for and  :

c0k  N (0; r0 );  Beta(a; b); 1=  Ga(a ; b ):

(5)

Equations (3) { (5) together de ne the prior probability model

p( ; ; c0k ; sjk ; djk ) = Y Y p( ;  )  p(c0k j )  p(sjk j )  k

j;k

Y

sjk =1

p(djk jsjk = 1;  );

where the indices for djk include only the coecients with sjk = 1.

3 Estimation The applications related to regression, density and spectral density estimation require entirely di erent likelihood speci cations. Still, implementation of posterior simulation is very similar in all three cases. In this section we discuss issues which are common to all three applications.

3.1 Pseudo Prior

To implement posterior inference in the proposed models we use Markov chain Monte Carlo (MCMC) simulation. See, for example, Tierney (1994) or Smith and Roberts (1993) for a discussion of MCMC posterior simulation. The proposed prior probability model for the wavelet coecients includes a variable dimension parameter vector. Depending on the indicators sjk a variable number of coecients djk are included. Green (1995) and Carlin and Chib (1995) proposed MCMC variations to address such variable dimension parameter problems. These methods are known as reversible jump MCMC and pseudo prior MCMC, respectively. Both introduce a mechanism to propose a value for additional parameters if in the course of the simulation an augmentation of the parameter vector is proposed. In the following discussion we will follow Carlin and Chib (1995), but note that there is an obvious corresponding reversible jump algorithm. In fact, Dellaportas, Forster and Ntzoufras (1997) show how a pseudo prior algorithm can always be converted to a reversible jump algorithm, and vice versa. Carlin and Chib (1995) propose to arti cially augment the probability model with priors (\pseudo priors") on parameters which are not currently included in the model (\latent parameters"). In the context of model (3) { (5) this requires to add a pseudo prior h(djk ) = (djk jsjk = 0) for those coecients djk which are currently not included in the likelihood. We will discuss below a speci c choice for h. The pseudo prior augments the prob-

6

Muller and Vidakovic

ability model to Y Y p( ;  ) p(c0k j ) p(sjk j ) k

jk

Y

sjk =1

p(djk jsjk = 1;  )

Y

p(djk js{zjk = 0)} :

sjk =0 |

h(djk )

In the arti cially augmented probability model all parameters are always included. The problem with the varying dimension parameter vector is removed. In choosing the pseudo prior it is desirable to specify h such that we generate values for the latent parameters which favor frequent transitions between the di erent subspaces, i.e., frequent changes of sjk . This is achieved by choosing for h(djk ) an approximation of p(djk jsjk = 1; data). In the current implementation we choose p(djk jsjk = 0) = N (d^jk ; ^jk ); (6) with d^jk a rough preliminary estimate of djk , for example the corresponding coecient in a wavelet decomposition of an initial estimate f^(x) for the unknown function; and ^jk is some initial guess of the marginal posterior variance of djk , for example rj , using the prior mean 1= = a =b . The choice of f^(x) depends on the application. See Section 3.3 for details. In all three applications, after an initial burn-in period of T0 iterations we replace d^jk and ^jk by the ergodic mean and variance of the imputed values for djk over the initial burn-in period. A good choice of the pseudo prior h is important for an ecient implementation of the MCMC simulation. Still, note that any choice, subject to some technical constraints only, would result in a MCMC scheme with exactly the desired posterior distribution as asymptotic distribution. Choice of the pseudo prior alters the simulated Markov chain, but not the asymptotic distribution.

3.2 An MCMC Posterior Simulation Scheme

We describe an MCMC simulation scheme to implement posterior inference in model (3) { (5). Starting with some initial values for sjk , djk , c0k , and  we proceed by updating each of these parameters, one at a time. See the next section for comments on implementation details and initial values. In the following discussion we will write  = (sjk ; djk ; c0k ; ;  ) to denote the parameter vector, Y to denote the observed data, and p(Y j) to generically indicate the likelihood. The speci c form of p(Y j) depends on the application, and will be noted later. We will denote with (,x) the parameter vector without parameter x. 1a. To update djk for latent coecients, i.e., coecients with sjk = 0, generate djk  h(djk ). Note that by de nition of the pseudo prior h(djk ) = p[djk jY; (,djk )] is the complete conditional posterior distribution for djk .

1. MCMC Methods in Wavelet Shrinkage

7

1b. Updating djk when sjk = 1, is done by a Metropolis-Hastings step. Generate a proposal d~  gd(d~jk jdjk ). Use, for example, gd (d~jk jdjk ) = N (djk ; 0:25 ^jk ), where ^jk is some rough estimate of the posterior standard deviation of djk . Compute n

o

a(djk ; d~jk ) = min 1; p(Y j~)p(d~jk )=[p(Y j)p(djk )] ; where ~ is the parameter vector  with djk replaced by d~, and p(djk ) is the p.d.f. of the normal prior distribution given in (4). With probability a(djk ; d~jk ) replace djk by d~jk ; otherwise keep djk unchanged. 2. To update the indicators sjk we replace each sjk by a draw from the complete conditional posterior distribution p[sjk jY; (,sjk )]. Denote with 0 and 1 the parameter vector with sjk replaced by 0 and 1, respectively. Compute p0 = p(Y j0 )  (1 , j )h(djk ) and p1 = p(Y j1 )  j p(djk jsjk = 1). With probability p1 =(p0 + p1 ) set sjk = 1; otherwise sjk = 0. 3. To update c00 , generate a proposal c~00  gc (~c00 jc00 ). Use, for example, gc (~c00 jc00 ) = N (c00 ; 0:25 ^00 ), where ^00 is some rough estimate of the posterior standard deviation of c00 . Analogously to step 1b, compute an acceptance probability a and replace c00 with probability a. If included in the model, c0k , k 6= 0, is updated similarly. In our implementation we only have c0k , k = 0, because of the constraint to [0; 1] discussed in Section 3.3. 4. Generate ~  g (~ j ) = N ( ; 0:12) and compute 8