Measures of Bayesian learning and identifiability in hierarchical models

Report 4 Downloads 79 Views
A Note on Bayesian Learning and Identiability in Hierarchical Models 1

Yang Xie and Bradley P. Carlin

Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455{0392, U.S.A.

Correspondence author: Bradley P. Carlin Telephone: (612) 624-6646 Fax: (612) 626-0660 Email: [email protected] Address: Division of Biostatistics, MMC 303, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455{0392, U.S.A.

July 20, 2004

1 Yang Xie is Graduate Assistant and Bradley P. Carlin is Mayo Professor in Public Health, Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, 55455. The work of the second author was supported in part by NIH grant 5{R01{ES07750-07, while that of both authors was supported in part by NIH grant 1{R01{CA95955{01. The authors are grateful to Mr. Brian Reich and Profs. Peter Bickel, Alan Gelfand, and James Hodges for helpful discussions and other assistance essential to this project's completion.

A Note on Bayesian Learning and Identiability in Hierarchical Models Abstract Identiability has long been an important concept in classical statistical estimation. Historically, Bayesians have been less interested in the concept since, strictly speaking, any parameter having a proper prior distribution is estimable. However, the larger statistical community's recent move toward more Bayesian thinking is largely fueled by an interest in Markov chain Monte Carlo-based analyses using vague or even improper priors. As such, Bayesians have been forced to think more carefully about what has been learned about the parameters of interest (given the data so far), or what could possibly be learned (given an innite amount of data). In this paper, we propose measures of Bayesian learning based on dierences in precision and Kullback-Leibler divergence. After investigating them in the context of some familiar Gaussian linear hierarchical models, we consider their use in a more challenging setting involving two sets of random eects (traditional and spatially arranged), only the sum of which is identied by the data. We illustrate this latter model with an example from periodontal data analysis, where the spatial aspect arises from the proximity of various measurements taken in the mouth. Our results suggest our measures behave sensibly and may be useful in even more complicated (e.g., non-Gaussian) model settings. Key words: Conditionally autoregressive (CAR) model Markov chain Monte Carlo (MCMC)

nonidentiability noninformativity spatial statistics.

1 Introduction In the classical framework, statistical inference depends only on the likelihood function, while in the Bayesian framework, inferences are drawn based on both the likelihood and the prior distri1

bution of the unknown model parameters. Bayesians have long compared the posterior distribution of some parameter of interest to its prior distribution, and referred to this process as Bayesian updating or Bayesian learning. This raises the corresponding question of how much we could ever

learn about a model parameter (say, given an innite amount of data). Bayesian learning and identiability are two dierent concepts, but they are closely related. Before going into details, we review the notion of identiability. Identication is a property of the likelihood function, and thus is the same for both frequentist and Bayesian approaches (c.f. Kadane, 1974). Intuitively speaking, if the available data sheds no light on a parameter (say, ), then we say is unidenti able. More formally, if two dierent values of give rise to the same likelihood for the observed data, then is unidentiable. In order to resolve the unidentiability problem, additional non-data-based information is required. Thus for a frequentist, the only solution is to add constraints to the parameter space to make identiable, while a Bayesian has the choice between adding such constraints, and placing proper prior distributions on the unidentied parameters. Indeed, Lindley (1971) famously observed that unidentiability causes no real diculties for Bayesian analysis since it is always possible to resolve the issue via proper priors. But as Poirier (1998) pointed out, \there is no Bayesian free lunch," since even with this solution it can still be the case that the prior and posterior distributions are identical, in which case we did not learn anything from the data all the information in the posterior

is actually coming from the prior. Dawid (1979) gave a formal denition of Bayesian (i.e., conditional) identi ability that is closely related to Poirier's (1998) notion of conditional uninformativity of the data. Given data y, consider the likelihood L(  y) and the prior p( ). If p( j y) = p( j) then y is conditionally uninformative for given . But this does not mean that there is no Bayesian learning about  this occurs only if p( jy) = p( ), i.e., the data are marginally uninformative for . Poirier (1998) 2

oers a detailed investigation of conditional and marginal uninformativity, and also proves that if is unidentiable in the traditional sense (i.e., if L(

(1)

 y) = L(

(2)

 y) for

(1)

6=

(2)

), then

the data are always conditionally uninformative for given . Many authors (e.g. Poirier, 1998 Gelfand and Sahu, 1999 Eberly and Carlin, 2000) have observed that the unidentiability of a parameter does not preclude Bayesian learning about it. But there has been relatively little consideration given to how much we could possibly hope to learn about an unidentiable parameter from the data, how to quantify the information we obtain, or the important factors aecting this learning process. In this paper, we will address these questions in the context of a variety of Gaussian hierarchical linear models. We are particularly interested in multiple random eects models for areal (regional summary) data, due to their well-known identiability diculties. We will use a combination of analytic and numeric methods to illustrate our approach, adding simulations and real data illustrations as necessary. The reminder of our paper is organized as follows. Section 2 describes a method for using precisions (reciprocal variances) and Kullback-Leibler divergences to quantify the learning process for both the frequentist and Bayesian outlooks on the problem. It also addresses the issues of improper priors and computation for more complex models. Section 3 considers application of these principles to a class of Gaussian linear hierarchical models, many of which allow closed form solutions and thus straightforward understanding of the concepts involved. Section 4 suggests and evaluates methods for density estimation in cases where the closed forms required by our Bayesian learning measures are not available. We then turn to more advanced spatial models with multiple random eects, introducing our methods in this context in Section 5, and illustrating with a periodontal dataset (Reich, Hodges and Carlin, 2004) in Section 6. Finally, Section 7 discusses our ndings and suggests directions for future research.

3

2 Measuring Bayesian learning Suppose as in the previous section that our statistical model features two parameter vectors: an identied parameter, , and an unidentied parameter, . Let y = (y1  : : :  yn ) , and assume the 0

\usual" regularity conditions for the convergence of the posterior to a point mass at 0 , the \true" value of , as more and more data are accumulated. In symbols, we have that p(jy) ! I ( = 0 ) as n ! 1, where I denotes the indicator function. This in turn implies

p( jy) =

Z

p( j y)p(jy)d =

Z

p( j)p(jy)d

! p( j ) 0

as n ! 1 :

Thus p( j) evaluated at the true value 0 provides an upper bound on how much one could possibly learn about the unidentied (since with an innite amount of data, we would know  exactly). This motivates characterizing the potential for Bayesian learning about in terms of the change from the prior p( ) to the conditional prior p( j). If we go on to observe some data y, the posterior

p( jy) oers a distribution that is typically intermediate to these two in terms of learning, and now

the change from p( jy) to p( j) becomes relevant. The remainder of this section suggests a few methods for measuring these changes.

2.1 Precision For normal distributions, the precision (reciprocal of the variance) oers a sensible univariate summary of information. For example, we might dene the potential learning di erence,  , as  = V ar 1 ( j) ; V ar 1 ( ) : ;

;

4

(1)

Once y is observed, we might measure how much more we can learn about given what we already have learned by dening the remaining learning di erence, y , as y = V ar 1 ( j) ; V ar 1 ( jy)  ;

;

(2)

and the remaining learning ratio, Ry , as

Ry = y = :

(3)

Ry is then the fraction of the most we could learn about that remains unlearned given the true value of  and the current y.

Example: Now we consider a simple hierarchical model to illustrate the basic ideas. Suppose that Y j

 N ( +

 2 ), j  N ( 2 ), and j  N (  !2 ). Dene  =  + , so in this model

 is identiable, but both and  are unidentiable. Assuming prior independence of  and , the prior distribution for  emerges as j  N ( +  2 + !2 ). Routine calculation then yields the conditional and marginal posterior distributions for as

j y = j  and

jy 

! 2 !2 2 + ! 2 ( ; ) N  2 + !2 2 + !2 !  2 2 2 (y ; ) 2 +  2 )! 2 ( +  )

+ ! (  2 + !2 + 2 : N 2 + !2 + 2 

From (1), the potential learning dierence emerges as  = 1= 2 , the prior precision of . Thus the most we can learn about depends on what we already know about . On the other hand, from (2) the relative learning dierence is y = 2 = 2 (2 + 2 )]. Thus once we've seen the data

y, the most we can learn about depends on how much we know about  and how informative the data are. Finally, equation (3) yields the remaining learning ratio Ry = 2 =(2 + 2 ). This 5

expression is bounded between 0 and 1 it is close to 0 (i.e., there is little left to learn about ) if 2 is large relative to 2 (i.e., the data are much more informative than the prior on ). The remaining learning ratio is close to 1 (i.e., we have much left to learn about ) if 2 is small relative to 2 (i.e., the data are weak relative to our initial understanding of ). Of course, in more general settings, these learning potentials will depend on  and y, the former of which will never be known in practice. Thus we may need to estimate  (say, by its posterior mean) in order to compute these quantities for any given dataset. However, before proceeding to data illustration, we need to determine if our learning potentials perform well in a theoretical sense. That is, we wish to show that they really do seem to capture learning potential in settings where the measurement of learning is uncontroversial. We can do this by simulating the performance of our measures for a given hierarchical model. Here we could follow standard practice (see e.g. Carlin and Louis, 2000, Ch.4) and judge performance from either a frequentist or an \empirical Bayes" (EB also called \preposterior") point of view. In the former case we assume the true  value to be xed and known, while in the latter we assume  (like the data y) is random and drawn from its prior distribution. To clarify our simulation approach, suppose we are in the design stage of our experiment, and we wish to anticipate the value of a data sample of size n, Y = (Y1  : : :  Yn ) , in learning about 0

. Adopting a frequentist outlook, we would rst assume that  is xed at 0 , the true value. We would then estimate Bayesian learning potentials by Monte Carlo as follows: 1. Fix =

0

 p(yj 0) b = 1 : : :  B. and draw Yb iid 

2. For each b, compute 

0 yb

3. Compare 

= B1

0

to 

0 y

PB  b=1

0 yb

We would likely then repeat these three steps for several dierent values of 0 , in order to understand 6

learning potential for a variety of \true" values of the underlying identied parameter. Adopting an empirical Bayesian point of view, here we instead wish to average over the variability in both  and Y. Thus we may estimate learning potentials by Monte Carlo as follows: 1. Sample 2. Draw



b



b

values from its hyperprior p( ), or else simply set

 p( j

iid



b

), followed by Yb



3. For each b, compute  4. Compare  = B1

b

and 

 p(yj

iid

b yb

PB  to  b=1 b

y



b



b

=

0

for all b, b = 1 : : :  B 

) b = 1 : : :  B 

 = B1

PB  b=1

b yb

Of course, for reasonably complex models, each individual 

b

and 

b yb

may require Markov

chain Monte Carlo (MCMC) methods to obtain we return to this subject in the next subsection.

2.2 Kullback-Leibler divergence Using normal distributions with known variances, precision is always additive, so that (1) and (2) will be positive. However, one can devise counterexamples where this will not be the case.

 N ( 1) and j  N (0 100 ; jj). Then for any prior for  on the interval (;100 100), we have that V ar( ) = E (V ar( j)) + V ar(E ( j)) = E (100 ; jj) + 0  100 ; jE ()j = V ar( jE ()) by Jensen's inequality. Thus for  = E (), we have that V ar( )  V ar( j ), and thus  = V ar ( j ) ; V ar ( )  0. While this is obviously a For example, suppose that Y

0

0

0

1

;

0

1

;

rather pathological case, for many nonnormal distributions, posterior precision will not be the most sensible summary of information in a distribution. As such, we might instead measure the dierence between p( j) and p( ) (or p( jy)) using the Kullback-Leibler (KL) divergence. This quantity is widely used to measure the dierence between

7

two probability distributions f and g, dened as

KL(f g) =

Z

1

;1

f (x) log fg((xx)) dx :

Then following the same logic as Subsection 2.1, we might dene the potential (prior) learning divergence, D , as

D = KL(p( j) p( )) =

Z

1

;1

p( j) log pp(( j)) d 

(4)

and the remaining (posterior) learning divergence, Dy , as

Dy = KL(p( j) p( jy)) =

Z

1

;1

p( j) log pp(( jjy)) d :

(5)

Since these quantities also depend on both  and y, the Monte Carlo algorithms described in previous subsection may again be used to estimate their frequentist or EB expectations for a given model. However, given the data y, the integrals present in equations (4){(5) may be dicult to evaluate. Worse, they require knowledge of the densities p( jy) and p( j), which for all but the simplest models will be unavailable in closed form. In this case, we can adopt an MCMC approach: 1. Run an ordinary MCMC sampler (see e.g. Carlin and Louis, 2000, Sec 5.4) for and  given the data y, obtaining marginal posterior samples f

(g)

 g = 1 : : :  Gg from p( jy)

2. Estimate V ar( jy) and the density function p( jy) from this sample 3. Rerun the sampler in step 1, but this time deleting the data terms from the model and xing

 = 0, so that what is produced at convergence is now a sample from p( j0 ) 4. Estimate V ar( j0 ) and the density function p( j0 ) from this second sample 5. Compute 0 and 0 y or D0 and D0 y , as appropriate. 8

If these calculations are nested within a simulation, we simply replace y by yb (and  by b and 



perhaps by b , in the EB case) in the steps above. We defer specics regarding density estimation 

and integral evaluation in (4){(5) to the context of the models outlined in subsequent sections.

2.3 Improper priors Recall that Poirier (1998) gave a formal connection between marginal and conditional uninformativity. In our framework, we can interpret this connection as follows: If  (identiable) and (unidentiable) are a priori independent, then there is no Bayesian learning for . Thus if there is Bayesian learning for , then  and must be priori dependent. Hence all of the information regarding learned from the data y arrives through the connection between and . This rearms our previous claim regarding p( j) being the upper bound on how much one could learn about . All of this implies we must be particularly wary of improper priors, since impropriety may break the connection between and , and thereby disrupt the Bayesian learning process. Poirier (1998) gives an example of the complications arising under improper priors for the basic model mentioned above. We will explore the impact of improper priors on hierarchical models by taking limits of our learning measures as certain precision parameters tend to 0.

3 Illustration and simulation in the normal case Here we reconsider the normal hierarchical model introduced in Subsection 2.1. Replacing  with

   + in the parametrization, this model becomes Y j  N ( 2 ), j    N ( +  2 ), and

j  N (  ! ). In this section, we begin by describing an algorithm for computing our Bayesian 2

learning measures in this setting, as well as a simulation study for evaluating their performance. We then see how the values of the various hyperparameters aect the amount we can learn.

9

3.1 Simulation and computation In order to study Bayesian learning in this model, we use a small simulation study. First, we simulate the data following the EB approach described in Subsection 2.1 above. That is, we rst sample iid



b

values from its N (  !2 ) hyperprior, followed by iid values b from p(j b ) = N ( b + 





 2 ), both for b = 1 : : :  B (we set B = 500). Finally, we draw Yb iid from p(yjb ) = N (b  2 ). 





Given each simulated replicate b, we must use MCMC to get numerical results for this model, as described in Subsection 2.2 above. Specically, for each b we rst run a -only sampler, using the prior information for and the b as our \data," leading to conditional prior samples 

j  g = 1 : : :  G (we took G = 1000). 

b

data yb , obtaining posterior samples 

from

(g) b

We then run an ordinary - Gibbs sampler given the

(g) b

and b(g)



from jyb and jyb , respectively, again for 



g = 1 : : :  G = 1000. Given these samples, combined with the closed forms for the distributions of j and jy, we can calculate the required precision and KL divergence learning measures either analytically or numerically. For example, for each b, there are G conditional prior (



) and posterior (



)

samples. The sample variances of these draws can be used to estimate the precisions of jb and jyb , and thus we can estimate  b and  b yb , and nally  = B1 PBb=1  b and  y = 1 PB b=1  b yb . B 



Table 1 compares the estimated  and 

y

with their corresponding true analytic results.

The table indicates that the estimated value and analytic results match very well. As a result, using the sample variance to estimate precision learning seems a very good and simple approach that should prove useful when we lack a closed form for the precision.

10

2

1 1 1 1 10 10 10 10 100 100 100 100 1000 1000 1000 1000

2

1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000

y (RLD) Dy (posterior KL divergence) MC true analytic CMDE crude 0.49 0.50 0.36 0.36 0.36 0.01 0.01 0.05 0.05 0.05 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.90 0.91 1.16 1.24 1.18 0.05 0.05 0.31 0.31 0.32 0.00 0.00 0.03 0.03 0.03 0.00 0.00 0.00 0.00 0.00 0.98 0.99 1.92 2.39 2.07 0.09 0.09 0.85 0.86 0.86 0.01 0.01 0.15 0.15 0.15 0.00 0.00 0.01 0.01 0.01 0.99 1.00 2.17 3.16 2.46 0.10 0.10 1.08 1.09 1.09 0.01 0.01 0.28 0.28 0.28 0.00 0.00 0.02 0.02 0.03

 (PLD) MC true 1.00 1.00 0.10 0.10 0.01 0.01 0.00 0.00 1.00 1.00 0.10 0.10 0.01 0.01 0.00 0.00 1.00 1.00 0.10 0.10 0.01 0.01 0.00 0.00 1.00 1.00 0.10 0.10 0.01 0.01 0.00 0.00

Table 1: Accuracy of Monte Carlo (MC) methods in estimating precision learning and posterior KL divergence learning for  when !2 = 100. To get the prior KL learning divergence D , we again use Monte Carlo integration, obtaining



Z

p( j) log pp(( j)) d (g) B G 1X 1X p ( j) = 1 XB 1 XG log p( b log B b=1 G g=1 p( (g) B b=1 G g=1 ) b

D = KL(p( j) p( )) =

1

;1





(g) b

j) ; log p(

(g) b

)] :

(6)

Similarly, the posterior learning divergence Dy is estimable as

Dy = KL(p( j) p( jy))



B G 1X 1X B G log p( g=1

b=1

(g) b

j) ; log p(

(g) b

jy)] :

(7)

Since we actually know the density functions p( ), p( j), and p( jy) explicitly in this case, we may simply evaluate the appropriate density at each

(i) b

and use expressions (6) and (7) to obtain the

necessary divergences. (Below, we refer to this as the \analytic" approach, since it uses a closed 11

1.0

2.0

0.9

1.5

D

0.8

1.0

0.6

0.7

Δ

Prior KL Posterior KL

0.5

PLD RLD

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

1 σ2

0.6

0.8

1.0

1 σ2

Figure 1: The eects of 2 on precision (left) and KL divergence (right) learning for !2 = 1, 2 = 10. form for all density functions.) Were the conditional prior or posterior densities unavailable, we would estimate them from the corresponding conditional prior (



) and posterior (



) samples

see Section 4 below.

3.2 Impacts of the parameters on learning We now consider plots of the prior and posterior precisions and KL divergences in this example for various values of !2 , 2 , and 2 , which the reader will recall are the model variances of , , and y, respectively. First, Figure 1 illustrates the eect of 1=2 on the precisions and learning divergences when !2 = 1 and 2 = 10. We see that PLD is always greater than RLD, so PLD oers an upper bound on the precision in Bayesian learning. In particular, when the precision of data equals 0 (i.e., there is no data available), RLD equals PLD, which means no actual learning. Second, as the precision of data increases, there is no impact on PLD, but RLD decreases rapidly, so that the actual learning for precision increases towards its upper bound. Turning to KL divergences, now we see no simple relationship between prior and posterior KL divergence neither is a bound 12

1.5

1.5

0.0

0.5

Δ 0.0

0.5

Δ

1.0

PLD RLD

1.0

PLD RLD

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

1 τ2

0.6

0.8

1.0

1 ω2

Figure 2: The eects of 1= 2 (left) and 1=!2 (right) on precision learning for 2 = 10. for the other. However, when the precision of the data is very small, the posterior divergence will be greater than prior divergence, while when the precision of data is very high, the posterior divergence will be be less than prior divergence. More specically, when the variance of data is close to 0 (i.e., \perfect" data), the learning reaches its upper bound for the whole distribution, consistent with asymptotic theory. Thus we again see that good data helps us learn about the unidentiable parameter, but bad data may mislead us in this regard. Figure 2 illustrates the eect of 1= 2 and 1=!2 on precision learning when 2 is xed at 10. As the precision of the identiable parameter  increases, both PLD and RLD increase conversely, when we put an improper prior on  (1= 2 = 0), both the potential and remaining learning differences are 0. By contrast, the precision of unidentiable parameter (1=!2 ) has no impact on precision learning. Figure 3 plots prior and posterior divergences versus the precision of the identied parameter

 (1= 2 ) for 2 = 10 and various values of !2 . In general, the prior divergence is greater than posterior divergence, though sometimes not by much. As the precision of identiable parameter

 increases, both prior and posterior divergence increase in particular, when we put an improper 13

3.0

3.0

= = = =

1 10 100 1000

2.0 0.0

0.5

1.0

1.5

Posterior D

2.0 1.5 0.0

0.5

1.0

Prior D

ω2 ω2 ω2 ω2

1 10 100 1000

2.5

= = = =

2.5

ω2 ω2 ω2 ω2

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

1 τ2

0.4

0.6

0.8

1.0

1 τ2

Figure 3: The eects of 1= 2 on prior (left) and posterior (right) KL divergence learning for 2 = 10. prior on  (1= 2 = 0), the divergences are very close to 0. Since our \gold standard" p( j) is already very imprecise in this case, anything we learned from y apparently puts us near the upper bound for learning. Finally, the precision of unidentiable parameter (1=!2 ) does aect rate of increase in the divergences, with quicker increases associated with higher prior precisions. The rate of increase in the posterior divergence is also lowest when !2 = 1000. Finally, Figure 4 plots prior and posterior divergences versus the precision of the unidentied parameter (1=!2 ) for 2 = 10 and various values of 2 . In general, the prior divergence is greater than posterior divergence, but there is no simple relationship between divergence and 1=!2 , since it also depends on the precision of the identiable parameter. The prior and posterior divergences typically increase with 1=!2 , but when the precision of  is very high, the situation is reversed.

4 Methods for density estimation Up to this point, we have used the available closed form for the posterior density p( jy) when calculating the posterior divergence (7). In more general cases, however, we will not have such 14

3.0

3.0

2.0

2.5

τ2 = 1 τ2 = 10 τ2 = 100 τ2 = 1000

0.0

0.5

1.0

1.5

Posterior D

1.5 0.0

0.5

1.0

Prior D

2.0

2.5

τ2 = 1 τ2 = 10 τ2 = 100 τ2 = 1000

0

200

400

600

800

1000

0

200

400

1 ω2

600

800

1000

1 ω2

Figure 4: The eects of 1=!2 on prior (left) and posterior (right) KL divergence learning for 2 = 10. a form, and so will need to estimate this density. In this subsection we consider three dierent methods for density function estimation, comparing them to each other and to results obtained using the analytic form to see which might be best for use in more advanced model settings. Recall that for each simulated data replicate b, we have G posterior samples from jy (or jy). The rst and most obvious solution would be to use a kernel density estimate,

 G X 1 p^kernel( jy) = Gh K G g=1

;

(g)

!

hG



where K ( ) is a kernel density (typically a normal see e.g. Silverman, 1986), and hG is a suitably chosen window width. The second method we consider is the conditional marginal density estimator (CMDE),

p^CM DE ( jy) = G1

G X g=1

p( j(g)  y) : 

This estimate was originally suggested for use in Gibbs sampling contexts by Gelfand and Smith (1990), who referred to it as the \Rao-Blackwellized density estimate," apparently since its \conditioning, then unconditioning" form reminded them of the celebrated Rao-Blackwell Theorem. The 15

0.20

0.20

0.10

Density

0.15

Analytic CMDE Kernal Crude

0.00

0.05

0.10 0.00

0.05

Density

0.15

Analytic CMDE Kernal Crude

−15

−10

−5

0

5

10

15

−15

−10

−5

5

10

15

20

25

30

0.10

0.15

Analytic CMDE Kernal Crude

0.00

0.00

0.05

0.10

Density

0.15

Analytic CMDE Kernal Crude

0.05

Density

0 X

0.20

0.20

X

−15

−10

−5

0

5

10

15

0

5

10

X

15 X

Figure 5: Density estimation comparison for dierent y values for 2 = 2 = 10 and !2 = 100. third and nal method we will use is a \crude" method that simply takes as our density estimate a normal distribution having mean and variance equal to the sample mean and variance of the

(g)

posterior samples. Since we know that p( jy) actually is a normal distribution in the examples we have considered thus far, this crude method may actually perform acceptably here. Figure 5 compares density estimates from the kernel, CMDE, and crude methods with the analytic density function for four values of y (1.6, 3.3, 3.0 and 24.5, values randomly generated assuming 2 = 2 = 10 and !2 = 100). From these plots we can see that both the CMDE and crude methods work rather well, while the kernel method is not as accurate. This is not surprising, since it is essentially just a \smoothed histogram" that does not take advantage of the normality of the marginal or conditional densities, as the other methods do. Continuing on with the CMDE and crude methods then, Table 1 shows the estimated posterior KL divergences Dy for a variety of values of 2 and 2 when !2 = 100. That is, the \analytic," \CMDE," and \crude" results in the table use p( 16

(i) b

jy), p^

CMDE

(

(i) b

jy), and p^

crude

(

(i) b

jy),

respectively, in equation (7). Agreement is good in general, except for some of the larger divergences where the crude method seems to outperform the CMDE. Again, this seems likely due to the actual normality of p( jy) in this case for nonnormal true posterior densities, the CMDE should be superior. As a nal remark, we note that p( j) (and perhaps even p( )) may also be unavailable in many advanced model settings, and therefore also require some sort of density estimate in order to evaluate (6) and (7).

5 Spatial hierarchical models In this section, we turn our attention to random eects models for areal (regional summary) data. These models are of particular interest in geological, environmental, and other applied elds, and typically feature one or more random eects for each area. Specically, for Gaussian responses

Yi we might work with the model Yiji i  2

ind

N (i + i  2 ) i = 1 : : :  n

ij h2

iid

N ( h2 ) i = 1 : : :  n

and j c2

  

(8)

CAR(  c2 ) :

Here, the i random eects capture region-wide heterogeneity (using scale parameter h2 ), while the

i capture local clustering (using scale parameter c2) according to a conditionally autoregressive (CAR) distribution for  = ( 1  : : :  n ) (Besag, 1974 Banerjee et al., 2004, Sec. 3.3) having o n joint density proportional to exp ; 21c2  (Dw ; W ) . Here, W is the proximity (or adjacency) 0

0

matrix, which is usually dened as having wij = 1 if and only if areal units i and j are neighbors (i.e., are spatially adjacent), and 0 otherwise (wii is customarily set equal to 0). Dw is diagonal 17

with (Dw )ii = wi+ =

P w , the number of neighbors of unit i. Finally, is a spatial association j ij

parameter chosen to make y 1  Dw ; W nonsingular. ;

Under these assumptions, one can show the full conditional distribution of i given f j =i g is   N Pj wij j =wi+  c2=wi+ . Under our 0-1 adjacency structure, the mean of this distribution 6

is times i , the average of the neighboring f j =i g. As a result, in practice is often set equal 6

to 1, resulting in an intrinsically autoregressive (IAR) specication. While intuitively appealing, this renders the distribution improper, since (Dw ; W )1 = 0, and thus y 1 is singular. Since the ;

CAR is being used to model spatial random eects (not observed data), this is often not seen by Bayesians as a limitation, since the posterior of  will typically still be proper. However, in order P to identify the grand intercept term  above, the constraint i i = 0 is often added when = 1. When computing is via MCMC, this constraint is typically imposed numerically by recentering the

(g) i MCMC draws at the end of every complete iteration of the algorithm. In this paper we use the proper CAR model, replacing the adjacency matrix W by the scaled

f  Diag(1=wi+ )W . This ensures that y 1 can be written as M 1 (I ; W f ), adjacency matrix W ;

;

where M is diagonal and y can be shown to exist provided the new propriety parameter  lies in (;1 1) (Banerjee et al., 2004, p.81). More specically, we used the in

WinBUGS

car.proper

commmand

and set  = 0:9. But while the priors for  and  are both proper, a signicant

nonidentiability problem still arises since only the sum i = i + i is estimable from each unit's (single) data point Yi . We thus apply the approach of this paper using  as our unidentied variable, and

as our identied variable. Our goal is to measure the potential and remaining

learning for . As a rst step, we replace  with in the parametrization, recasting model (8) as

Yiji ind  N (i 2), j   + CAR(  c2), and i iid N ( h2) i = 1 : : :  n. In order to calculate

the desired learning dierences and KL divergences, we need to know the distributions of i j and

ijy. Although we may obtain closed form expressions for ijy and i jy using standard approaches 18

(Lindley and Smith, 1972 Carlin and Louis, 2000, p.34), they will involve dicult matrix inversion a closed form for ijy also requires integration. Fortunately, however, since it is clear that both

ijy and ij are normally distributed, we can use the \crude method" described above to estimate the required densities given only point estimates of their means and variances.

Using standard Gibbs sampling, we can obtain posterior samples i(g) jy and i(g) jy, where g =

1 : : :  G = 1000 and again i = 1 : : :  n. Next we need conditional prior samples i(g) j . Unlike 

in the case of a simulation study (where we can simply set case of real data we must replace ^

 E( jy).

0

equal to the true value

0

), in the

by an estimate in what follows we use the posterior mean

Thus to get the necessary samples i(g) j ^ , we use ^ = (^1  : : :  ^n ) as the input 

0

data, deleting the original data y. Armed with samples fi(g) jyg and fi(g) j ^ g, we can use their 

sample means and variances to estimate densities using the crude method. This in turn leads us to estimates of the various variance- and KL-based learning measures described earlier for all the unidentied parameters i , as well as average measures of learning across all areas i. Another parameter of particular interest to spatial epidemiologists (see e.g. Best et al., 1999) arising from this model is   sd()=sd() + sd( )], where sd denotes the usual sample standard deviation. This parametric function of the random eects is also unidentied, but for a given data set and hyperprior parameter choices provides information about the proportion of excess variability in the data due to spatial clustering (while 1 ;  represents the proportion due to unstructured

heterogeneity). Since we may easily sample from , , jy, and j, we can in turn obtain both

posterior and conditional prior samples of , and hence easily estimate V ar(), V ar(j ), and

V ar(jy), and nally the potential and remaining learning dierences and remaining learning ratio. However, since 's true distribution is clearly not normal, its density is not likely to be well estimated by crude methods, clouding computation of the KL-based measures in this case.

19

6 Data example We now apply our methods to a challenging periodontal data set originally analyzed by Reich et al. (2004). The data feature complex spatial correlations, and as such are appropriate for the Gaussian spatial model described in Section 5. Our goal is to determine whether our measures can indicate the potential learning about various parameters of interest that remains given the model, prior, and data collected so far, and thus whether the collection of additional data is warranted. The Yi values pictured as individual symbols in Figure 6 are measurements of attachment loss (AL), which is the extent of a tooth's root (in mm) that is no longer attached to the surrounding bone by periodontal ligament. Attachment loss is usually measured at 6 sites on each tooth, three on the buccal (cheek) side and 3 on the lingual (tongue) side. Thus a person with the full complement of 28 teeth has 168 measurements, with sites 1-42 and 43-84 being the two long strips on the buccal and lingual sides of the maxillary (upper) jaw, and sites 85-126 and 127-168 being the two long strips on the buccal and lingual sides of the mandibular (lower) jaw. Figure 6 also includes the posterior means of the i assuming 2 = 1 and c2 = h2 = 0:1 the \holes" in the rst two lines of the graph arises because this subject is missing a tooth on the left side of his upper jaw. AL measurements are typically correlated at nearby sites within the mouth, suggesting that the \iid plus spatial CAR" approach of the previous section may be a natural probability model for the true underlying AL values i that in turn give rise to the Yi observations. In choosing the proximity matrix W we follow Reich et al. (2004) as follows: for a given jaw (upper or lower), we adopt a 0-1 proximity matrix W that counts sites as neighboring if they are adjacent and on the same (buccal or lingual) side of the jaw, or if they are on opposite sides of the jaw but are adjacent to a common \"ossing gap" between non-missing teeth. Figure 7 provides a diagram of the situation for the rst three teeth in the upper jaw. 20

7 5 3 1

7 5 3

Attachment Loss (mm)

1

7 5 3 1

7 5 3 1 D

|

M

D

|

M

D

|

M

D

5

6

7

|

M

D

|

M

D

3

4

|

2

M

D

|

1

M

M

|

1

D

M

|

2

D

M

|

D

3

M

|

4

D

M

|

D

5

M

|

6

D

M

|

D

7

Tooth Number Maxillary/Lingual

Mandibular/Lingual

Maxillary/Buccal

Mandibular/Buccal

Posterior Mean

Observed Data

Figure 6: Raw attachment loss data Yi (plotting characters) and posterior means of i (solid lines). 1 d

d

2 d

d

3

d A A A

d

4 d

5 d

A A

A A A Ad

6

d A A A A

d

8

9

A A Ad

d

d

d

A A

d

7

d

d

43 44 45 46 47 48 49 50 51 Figure 7: Illustration of neighbor pairs in the periodontal grid for the rst three teeth in the upper jaw. Bold-face squares represent teeth, small circles represent sites where attachment loss is measured, numerals index measurement sites, and neighbor pairs (i.e., pairs having wij = 1) are connected by lines. Using the model and computational methods described in Section 5, we obtained estimates of our Bayesian learning measures for i  i = 1 : : :  n = 168 for this dataset. Specically, Figure 8 shows the trends in the remaining learning ratio, prior KL divergence, and posterior KL divergence (all averaged over all 168 sites) as one of the variance parameters (2 , c2 , or h2 ) decrease while the other two remain xed. For example, the solid lines in Figure 8 show that as 1=2 increases, both the RLR and the posterior divergence decrease, which seems sensible: as the precision of the observed data increases, the actual learning from this data does as well. In particular, when the 21

1.0

0.25

3.0

1 σ2 1 τ2 c 1 τ2 h

1 σ2 2 1 τc 1 τ2 h

0.15

Posterior KL Divergence

0.10

2.0 1.5 0.0

0.00

0.2

0.5

0.05

1.0

0.4

RLR

0.6

Prior KL Divergence

0.20

2.5

0.8

1 σ2 1 τ2 c 1 τ2 h

0.0

0.2

0.4

0.6

0.8

precision of parameters

1.0

0.0

0.2

0.4

0.6

0.8

precision of parameters

1.0

0.0

0.2

0.4

0.6

0.8

1.0

precision of parameters

Figure 8: Remaining learning ratio (RLR), prior learning divergence, and posterior learning divergence plotted versus precision parameters: solid lines: eect of changing 1=2 when c2 = h2 = 1 dashed lines: eect of changing 1= c2 when 2 = h2 = 1 dotted lines: eect of changing 1= h2 when 2 = c2 = 1. precision of the data is close to 0, the remaining learning ratio is 1, meaning no actual learning to date at this point. But even when the precision is relatively high (equal to 1), RLR is still around 0.7, implying much of what could be learned about  remains unlearned. These results are consistent with our previous simulation study, although the RLR appears to decrease a bit more slowly here, possibly the result of \dilution" of information inherent in our replacement of the true

 value in the simulation by its posterior mean here. The dashed lines in Figure 8 indicate that as the precision of the i (1= c2 ) increases, all three learning measures increase. Apparently increasing the CAR prior precision also increases the proportion of what we could ever learn about the i given and the data so far. In particular, a nearly "at prior for  produces RLR, prior KL divergence, and posterior KL divergence all close to 0, implying that the data have already told us all it can about i if we insist upon starting with no information about . Finally, the dotted lines in Figure 8 indicate that as the precision of the

i (1= h2 ) increases, RLR remains "at while the prior and posterior KL divergences decrease. Even 22

2

0.001 0.001 0.001 0.001 1 1 1 1 0.57

c2

0.001 0.001 1 1 0.001 0.001 1 1 2.61

h2

0.001 1 0.001 1 0.001 1 0.001 1 0.45

RLR Prior KL Posterior KL 0.73 163.427 0.105 0.65 4.453 0.275 0.53 0.011 0.002 0.04 0.720 0.011 1.00 0.258 0.253 1.00 3.405 2.331 0.86 0.001 0.002 0.73 0.420 0.105 0.46 0.140 0.010

Table 2: The eects of parameters 2 , c2 and h2 on precision learning and KL divergence for . under the nearly "at prior for , RLR is not that close to 1, suggesting that the data are slightly informative about i even when these parameters are assigned a "at prior. In order to see a broader range of Bayesian learning measure values, we repeated our calculations with some smaller values of the variance parameters (i.e., very informative priors or data). The results, shown in Table 2, reveal that when the data are very precise (2 = 0:001), RLR is in fact much smaller, as expected. Perhaps less intuitively (though consistent with Figure 8), we also appear to learn more about  when the priors for  and  are more diuse (i.e., larger c2 and

h2). The last line in the table sets 2 c2 , and h2 equal to their posterior means for this dataset under relatively vague IG(0:1 0:1) hyperpriors, where IG denotes the inverse (reciprocal) gamma distribution. These values produce a moderate RLR (0.46), suggesting that for the most realistic values of the variance parameters, there is much left to learn about the i given our data. Finally, Table 3 summarizes RLR for the \spatial proportion of excess heterogeneity" parameter

. Numerical instability prevents us from reporting results in the cases where c2 = 1 and h2 = 10 however, we do add the \most likely" arrangement (posterior means under the IG priors) as the last line, as in Table 2. The results indicate that when the precision of data is higher, RLR is lower in particular, when 2 = 1 but c2 = h2 = 10, this dataset has already told us 78% of the most we 23

2

c2

h2

1 1 1 1 10 1 1 10 10 10 1 1 10 10 1 10 10 10 0.57 2.61 0.45

RLR 0.90 0.57 0.22 1.00 0.86 0.72 0.62

Table 3: The eects of parameters 2 , c2 and h2 on precision learning and KL divergence for . could possibly learn about  (only 22% remains unlearned). Also, with 2 xed, c2 large relative to h2 produces slightly larger values of RLR. This is also intuitively sensible, since larger c2 means more uncertainty about the spatial component of the excess heterogeneity that dominates , and hence a larger proportion of Bayesian learning remains to be done. Overall, the most reliable single summary would seem to come from the table's last line, which indicates 62% of the learning about

 remains to be done. Incidentally, this value of 2 is supported by the calibration studies cited by Reich et al. (2004), which in turn suggest 0:75 <  < 1.

7 Discussion In this paper, we investigated the upper limit of Bayesian learning for unidentiable parameters in the context of several hierarchical models. We dened the potential learning dierence  and remaining learning dierence, y to measure the precision-based learning. Analogous concepts using prior and posterior Kullback-Leibler (KL) divergence dierences from their conditional prior values were also suggested and investigated. Our measures seemed to perform well, both in simulations with ordinary Gaussian models and in a rather advanced, deliberately overparametrized spatial model and dataset, where Bayesian learning is of interest but notoriously dicult to measure. While our approach to learning is decidedly Bayesian (prior to posterior), both frequentist 24

and empirical Bayesian (preposterior) outlooks can be adopted when simulating the performance of our measures, since both conditional and unconditional performance may be of interest. Since our methods often rely on MCMC to estimate weakly identied parameters, the usual concerns about MCMC convergence are particularly important here. In particular, when = i in Section 6 (Table 2), we had more success with a hierarchically centered parametrization (Gelfand, Sahu, and Carlin, 1995), whereas when =  (Table 3), the original, uncentered parametrization fared better. Also, since our KL divergence measures may involve densities for which no closed forms exist, we required a Monte Carlo-based scheme for their estimation. Our results suggest a crude moment matching method can perform well the posterior for  in Section 6, despite being bounded by 0 and 1, exhibits a surprising degree of normality. The CMDE (\Rao-Blackwellized") method oers a potential improvement when this is not the case, but this approach still assumes the availability of certain full conditional distributions. In settings where even these are unavailable, the importance weighted marginal density estimation (IWMDE) approach of Chen and Shao (1999) may prove useful. Future work in this area looks to extending our results to models with nonnormal likelihoods, such as Poisson and binomial regression. Such models are especially common in spatial analyses, where the data arise as counts (say, of a particular disease) over each geographic region i, and multiple sets of weakly identied random eects of the sort considered in Section 6 are the rule. A typical model of this sort (see e.g. Banerjee et al., 2004, Sec 5.4) might take the form

Yiji i



iid

Pois(Eie + ) i = 1 : : :  n  i

i

with the remainder of the specication the same as that of Section 5. Here, Ei are the expected counts in each region, often computed via internal standardization as Ei = ni r, where ni is the 25

total number of individuals at risk in region i, and r = The spatial binomial model Yi ji  i

P Y = P n , the overall disease rate. i i i i

 Bin(n   ) with logit( ) =  +

iid

i

i

i

i

i

arises similarly and

raises similar questions about how much one can learn about the unidentied parameters  and . Such methods should help spatial epidemiologists and others tting complex high-dimensional

models to determine whether worthwhile Bayesian learning is possible in their settings, and whether additional data collection would be worthwhile. Our approach may also be helpful in guiding prior and hyperparameter value selection in such models.

References 1] Banerjee, S., Carlin, B.P. and Gelfand, A.E. (2004). Hierarchical Modeling and Analysis for Spatial Data, Boca Raton, FL: Chapman and Hall/CRC Press.

2] Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Roy. Statist. Soc., Ser. B, 36, 192-236. 3] Best, N.G., Waller, L.A., Thomas, A., Conlon, E.M. and Arnold, R.A. (1999). Bayesian models for spatially correlated diseases and exposure data (with discussion). In Bayesian Statistics 6, eds. J.M. Bernardo et al. Oxford: Oxford University Press, pp. 131{156. 4] Carlin, B.P. and Louis, T.A. (2000). Bayes and Empirical Bayes Methods for Data Analysis. Boca Raton, FL: Chapman and Hall/CRC Press. 5] Chen, M.-H. and Shao, Q.-M. (1999). Monte Carlo estimation of Bayesian credible and HPD intervals. J. Comp. Graph. Statist., 8, 69{92. 6] Dawid, A.P. (1979). Conditional independence in statistical theory. Journal of the Royal Statistical Society, Ser. B, 41, 1{31.

26

7] Eberly, L.E. and Carlin, B.P. (2000). Identiability and convergence issues for Markov chain Monte Carlo tting of spatial models. Statistics in Medicine, 19, 2279{2294. 8] Gelfand, A.E. and Sahu, S.K. (1999). Identiability, improper priors, and Gibbs sampling for generalized linear models. J. Amer. Statist. Assoc., 94, 247{253. 9] Gelfand, A.E., Sahu, S.K., and Carlin, B.P. (1995). Ecient parametrizations for normal linear mixed models. Biometrika, 82, 479{488. 10] Gelfand, A.E. and Smith, A.F.M. (1990). Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc., 85, 398{409. 11] Kadane, J.B. (1974). The role of identication in Bayesian theory. Studies in Bayesian Econometrics and Statistics, 175{191.

12] Lindley, D.V. (1971). Bayesian Statistics: A Review. Philadelphia, PA: SIAM. 13] Lindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linear model (with discussion). J. Roy. Statist. Soc., Ser. B, 34, 1-41.

14] Poirier, D.J. (1998). Revising beliefs in nonidentied models. Econometric Theory, 14, 483{ 509. 15] Reich, B.J., Hodges, J.S., and Carlin, B.P. (2004). Spatial analyses of periodontal data using conditionally autoregressive priors having two types of neighbor relations. Research Report 2004{004, Division of Biostatistics, University of Minnesota, 2004. 16] Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman and Hall.

27