Computational Statistics & Data Analysis 51 (2007) 6367 – 6379 www.elsevier.com/locate/csda
The Bayes factor for inequality and about equality constrained models Irene Klugkist∗ , Herbert Hoijtink Department of Methodology and Statistics, University of Utrecht, P.O. Box 80140, 3508 TC Utrecht, The Netherlands Received 30 January 2006; received in revised form 29 January 2007; accepted 29 January 2007 Available online 16 February 2007
Abstract The Bayes factor is a useful tool for evaluating sets of inequality and about equality constrained models. In the approach described, the Bayes factor for a constrained model with the encompassing model reduces to the ratio of two proportions, namely the proportion of, respectively, the encompassing prior and posterior in agreement with the constraints. This enables easy and straightforward estimation of the Bayes factor and its Monte Carlo Error. In this set-up, the issue of sensitivity to model specific prior distributions reduces to sensitivity to one prior distribution, that is, the prior for the encompassing model. It is shown that for specific classes of inequality constrained models, the Bayes factors for the constrained with the unconstrained model is virtually independent of the encompassing prior, that is, model selection is virtually objective. © 2007 Elsevier B.V. All rights reserved. Keywords: Bayesian model selection; Encompassing prior; Objective bayesian inference; Ordered parameters
1. Introduction In this paper Bayesian model selection for inequality constrained models is discussed. Consider for example a dose–response study, investigating the effect of four doses of a particular medicine. Assume a normally distributed outcome variable, independence between the participants and a common but unknown variance, i.e. the context of a standard analysis of variance. However, specific expected outcomes can be modeled using inequality constraints among the model parameters. Several hypotheses or constrained models can be formulated and compared, for instance: M2 : 1 < 2 < 3 < 4 ; 2 , M3 : 1 < (2 , 3 , 4 ); 2 , M4 : (1 < 2 ), (3 < 4 ); 2 , M5 : 1 ≈ 2 ≈ 3 ≈ 4 ; 2 .
(1) ; 2 .
Each model is nested in the unconstrained, encompassing model M1 : 1 , 2 , 3 , 4 Note that the constrained models differ in the number and type of constraints. Note also, that models M2 , M3 and M4 just impose orderings on the means, while M5 restricts the parameters to be about equal. The latter can be formalized into |i − i | < , for i = i (i, i = 1, . . . , 4) and a small value of . ∗ Corresponding author. Tel.: +31 30 253 5473; fax: +31 30 253 5797.
E-mail address:
[email protected] (I. Klugkist). 0167-9473/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2007.01.024
6368
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
Estimation and testing of inequality constrained models has received a lot of attention in the frequentist framework (see Barlow et al., 1972; Robertson et al., 1988; Silvapulle and Sen, 2004; Dykstra et al., 2002). In the Bayesian framework inequality constrained models have not yet received a lot of attention. However, Gelfand et al. (1992) showed that inequality constraints can straightforwardly be built into the prior distribution, and recently Moreno (2005) presented a Bayesian method for one-sided testing, i.e. a simple case of inequality constrained models. Furthermore, Klugkist et al. (2005) show how Bayesian model selection can be used for the selection of the best of a set of hypotheses with inequality constraints among the model parameters. The models studied here, can be based on about equality constraints, constraints where parameters are constrained to be larger or smaller than a constant, and inequality constraints among parameters. Furthermore, in this paper the sensitivity to prior distributions is thoroughly investigated. Finally, Monte Carlo Errors (MCEs) for our estimation of the Bayes factor are developed. Note that common maximum likelihood-based model selection criteria like Akaike information criterion (AIC; Akaike, 1987), corrected AIC (CAIC; Bozdogan, 1987), or Bayesian information criterion (BIC; Schwarz, 1978), cannot be used in the context of inequality constrained hypotheses, because evaluation of these criteria requires a penalty in terms of the number of parameters of each model. The number of parameters in both M1 : 1 , 2 , 3 , 4 and M2 : 1 < 2 < 3 < 4 is four, but the size or complexity of the models is clearly not the same. Note that the Bayesian model selection criterion, the Bayes factor, also comprises a penalty for model complexity but it does not require active calculation of the number of parameters. In the Bayes factor the penalty is implicit and automatic. A nice feature of (in)equality constrained hypotheses is that they are all nested in the unconstrained, encompassing model. Throughout the paper, the encompassing model will be denoted by M1 , and the constrained models by M2 through MT . As a consequence of the nesting, just one prior distribution has to be specified, the prior for the model parameters under the unconstrained model. This prior distribution will be called the encompassing prior. Denoting the model parameters for M1 by 1 , the encompassing prior is denoted by (1 |M1 ). The prior distribution of any nested model Mt (t = 2, . . . , T ) follows directly from this encompassing prior, using (t |Mt ) =
(t |M1 )IMt (1 |M1 )IMt d1
,
(2)
where the indicator function IMt has the value one if the argument is true, that is, if the parameter values are in accordance with the constraints imposed by model Mt , and zero otherwise. Note that for a model with equality constraints (2) is not defined but for a model with about equality constraints it is. This will be elaborated in Section 4. Eq. (2) is related to the conditioning method that can be used to obtain so called “compatible priors” (see, Dawid and Lauritzen, 2001). Our motivation for the use of compatible priors is expressed best by a quote from Leucari and Consonni (2003) and Roverato and Consonni (2004): “If nothing was elicited to indicate that the two priors should be different, then it is sensible to specify [(t |Mt )] to be, …, as close as possible to [(1 |M1 )]. In this way the resulting Bayes factor should be least influenced by dissimilarities between the two priors due to differences in the construction processes, and could thus more faithfully represent the strength of the support that the data lend to each model”. Dawid and Lauritzen (2001) show that the conditioning approach is sensitive to the parametrization chosen (the Borel–Kolmogorov paradox). They suggest Jeffreys conditioning to deal with this paradox. This will work well if a nested model is obtained by fixing parameters at zero. As can be seen from (2), in this paper nested models are obtained via a restriction of the parameter space by means of inequality constraints. As will be shown in Section 3.2.1. the Borel–Kolmogorov paradox does not apply to the most important classes of models discussed in this paper. Like Bartlett’s/Lindley’s paradox it does apply to the models discussed in Sections 3.2.2 and 4. However, because none of the parameters are actually fixed at zero to obtain a nested model, Jeffreys conditioning is not a solution and we will simply use the natural parametrization of the model at hand. The evaluation of models after observing data y will be based on the Bayes factor for the encompassing model (M1 ) and each of the constrained models (Mt ). The Bayes factor is the ratio of the two corresponding marginal likelihoods: f (y|t , Mt )(t |Mt ) dt p(y|Mt ) BFt1 = = , p(y|M1 ) f (y|1 , M1 )(1 |M1 ) d1
(3)
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
6369
for t=2, . . . , T . Note that from the Bayes factors for each of the constrained models (M2 , . . . MT ) with the encompassing model (M1 ), the Bayes factor for two of the nested (constrained) models is obtained by BFt t =
BFt 1 . BFt1
(4)
The integrands in (3) are often complicated to compute analytically. Therefore, Bayes factors or marginal likelihoods are often estimated using Monte Carlo methods. (e.g. Newton and Raftery, 1994; Gelfand and Dey, 1994; Kass and Raftery, 1995; Chib, 1995; Green, 1995; Carlin and Chib, 1995; Chen and Shao, 1997). Some general concerns of these methods are computational complexity, convergence, and, efficiency (see for a recent comparative review, Han and Carlin, 2001). However, for the situation at hand (that is comparison of the encompassing model and a constrained model), a straightforward estimate of the Bayes factor, that does not require the computation of marginal likelihoods, will be presented and discussed in this paper. In Section 2, it will be shown that the Bayes factor for a nested model and the encompassing model reduces to the ratio of the proportions of, respectively, the encompassing posterior and the encompassing prior in agreement with the constraints of the nested model. In a specific, simple one parameter problem, that is the comparison of M1 : 0.6 and M2 : < 0.6, where denotes a proportion, Carlin and Louis (2000, pp. 45–46) also use this estimate of the Bayes factor, but they give no derivation or frequency properties as we do in Sections 2 and 6 in this paper. An important issue in any Bayesian analysis is the specification of and sensitivity to prior distributions. In (3), it can be seen that the Bayes factor does not only depend on the density of the data, f (y|t , Mt ), but also on the choice of the prior distributions for the model parameters for each model, (t |Mt ). Defining these model specific prior distributions is not straightforward. This has lead to a search for ‘noninformative’ or ‘objective’ priors. Using such priors (with slightly different interpretations also denoted by reference, default or conventional priors) works well for estimation problems (as long as the data dominate the prior, the precise form of the latter is unimportant) but are usually problematic in model selection procedures. Improper noninformative priors almost always lead to indeterminate Bayes factors (Berger and Pericchi, 1996). Proper ‘noninformative’ priors (also called vague or diffuse priors) do not solve this problem. Berger and Pericchi (2001) conclude that using vague priors often result in arbitrary values for Bayes factors. For nested models however, diffuse or even improper priors can be used. In this paper it will be shown that using the encompassing approach for equality and inequality constrained models, diffuse priors indeed can be used and for specific classes of models provide virtually objective model selection. In the remainder of the paper, the estimation approach of the Bayes factor for a nested model with the unconstrained model is presented. Furthermore, it will be outlined for which types of models and under which conditions the model selection is virtually objective, that is, not sensitive to the specification of the encompassing prior distribution. For other types of constrained models, objective priors cannot be found. The approach presented in this paper illustrates how the model selection for these models depends on the scale factor of the encompassing prior for the constrained model parameters. In two illustrations the estimation of the Bayes factor and its MCE are presented, followed by an examination of the sensitivity to the encompassing prior. 2. The Bayes factor for a nested and the encompassing model In this section it will be shown that using encompassing priors renders a nice interpretation of the Bayes factor for a constrained model (Mt ) with the encompassing model (M1 ). The marginal likelihood p(y|Mt ), for t = 1, . . . , T , is written as follows: p(y|Mt ) =
f (y|t , Mt )(t |Mt ) , (t |y, Mt )
(5)
for all t ∈ t , where t denotes the parameter space of model Mt . In (5), the numerator is the product of the density and the prior, with all integrating constants included, and the denominator is the posterior density (Chib, 1995). Consequently, the Bayes factor for Mt with M1 is BFt1 =
f (y|t , Mt )(t |Mt )/(t |y, Mt ) f (y|1 , M1 )(1 |M1 )/(1 |y, M1 )
.
(6)
6370
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
If ∗ ∈ t then ∗ ∈ 1 and f (y|∗ , Mt ) = f (y|∗ , M1 ). Therefore (6) reduces to BFt1 =
(∗ |Mt )/(∗ |y, Mt ) , (∗ |M1 )/(∗ |y, M1 )
(7)
where ∗ ∈ t . Note that, ∗ can include parameters that are involved in constraints, ∗c , and, parameters that are unconstrained in all models, ∗u . A priori, ∗c and ∗u are assumed to be independent and consequently (7) is equal to BFt1 =
(∗c |Mt )(∗u |Mt )/(∗c , ∗u |y, Mt ) , (∗c |M1 )(∗u |M1 )/(∗c , ∗u |y, M1 )
(8)
where (∗u |M1 ) and (∗u |Mt ) are equal and therefore cancel and t ⊂ 1 for all t. Applying (2) gives IMt (∗c |Mt ) = (∗c |M1 ) = ct · (∗c |M1 ) (∗c |M1 )IMt d∗c and likewise (∗c , ∗u |y, Mt ) =
IMt (∗c , ∗u |y, M1 )IMt d∗c ∗u
(∗c , ∗u |y, M1 ) = dt · (∗c , ∗u |y, M1 ),
(9)
(10)
where ct and dt are constants. Substituting (9) and (10) in (8) leads to BFt1 =
ct · (∗c |M1 )/dt · (∗c , ∗u |y, M1 ) ct = . (∗c |M1 )/(∗c , ∗u |y, M1 ) dt
(11)
The proportion of the encompassing prior that is in agreement with the constraints of model Mt provides the value 1/ct , the proportion of the encompassing posterior in agreement with the constraints of Mt is 1/dt . The ratio 1/dt /1/ct =ct /dt , nicely reflects that the Bayes factor for two models (here Mt and M1 ) actually measures the change in the odds in favor of Mt when going from the prior to the posterior (Lavine and Schervish, 1999). Besides providing an easy and straightforward estimate of the Bayes factor for (in)equality constrained models, the quantities 1/ct and 1/dt will be used to examine the sensitivity of BFt1 to the encompassing prior. This will be done in the next sections for inequality and about equality constraints separately. 3. Inequality constraints In this section, two general and intuitive guidelines for specification of the encompassing prior are presented. The estimation approach for the Bayes factor for nested models, as presented in the previous section, does not require that the prior is specified according to these guidelines. However, the combination of the encompassing prior set-up, the consequent ct /dt estimator and prior specification as introduced below has nice consequences in terms of prior insensitivity for certain types of inequality constrained models. 3.1. Specification of the encompassing prior Note that for notational convenience, in the sequel the encompassing prior (1 |M1 ) = (c |M1 )(u |M1 ) = (c )(u ). The two general guidelines proposed for the specification of the encompassing prior are: 1. The prior for the encompassing model is specified such that it does not favor any of the (ordered) models a priori. This is accomplished by specifying similar prior distributions for those elements that are subjected to constraints in one or more of the nested models, that is the elements of c . • When the parameters contained in c are a priori considered to be independent, this means that for all elements the same prior distribution is specified, i.e. (c ) = (1 ) · · · (k ), with (k ) = () for k ∈ c and k = 1, . . . , K. If c ={1 , ..., B } for b =1, . . . , B, i.e. c falls apart into B groups, and the constraints apply only to parameters from the same group, then (k ) = b () for k ∈ b .
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
6371
• When the parameters contained in c are a priori considered to be correlated, this means that also the correlations between all constrained parameters are a priori considered to be equal. For example, when the encompassing prior is a multivariate normal distribution, it is specified using equal prior means, equal prior variances and equal prior covariances. However, the influence of any prior correlation between parameters is dominated by the large prior variances (see property 2 hereafter), and therefore not very relevant. 2. Furthermore, a vague prior is specified for the encompassing model. Using vague or diffuse priors (that is, priors that are dominated by the observed data) leads to an encompassing posterior that is virtually independent of the prior. As was noted in the introduction, vague priors do not work well for testing or model selection problems in general (Berger and Pericchi, 2001), but can be used and lead to sensible results comparing nested models. 3.2. Prior sensitivity The prior sensitivity for inequality constrained models will be discussed by examining the effect on 1/ct and 1/dt , the proportion of the encompassing prior and posterior in agreement with the constrained model, respectively. The sensitivity with respect to (c ) and (u ) will be discussed separately. In the previous section, it was already seen that (u ) does not influence 1/ct . It is also clear that (u ) does influence the posterior distribution, and thus 1/dt . However, it is well-known (see, for example, Gelman et al., 1995, p. 101) that this influence is small when vague priors are used, i.e. if the data dominate the prior distribution. Stated otherwise, the proportion of the encompassing posterior in agreement with the constraints of model Mt is almost independent of the encompassing prior if the data dominate the prior. Similarly, if (c ) is vague, its influence on the posterior distribution is negligible, that is, its influence on 1/dt is negligible if the data dominate the prior. This leaves only the possible influence of (c ) on 1/ct . Several types of inequality constraints are discussed. 3.2.1. Models for which 1/ct is not sensitive to the prior Let c = {(1) , (2) }. For models using a constraint of the form P
mp (1) p >
Q
nq (2) q ,
(12)
q=1
p=1
with the restriction
P
p=1 mp
=
Q
q=1 nq ,
the value for 1/ct is independent of the encompassing prior (c ). Firstly, (1)
consider an encompassing prior assuming independence between the prior elements, that is (c ) = (1 ) · · · (1)
(2)
(2)
i.i.d.
(2) ∼ (), for p = 1, . . . , P and (P )(1 ) · · · (Q ). According to the first property of Section 3, (1) p , q Q P (1) (2) (2) q = 1, . . . , Q. As a consequence P (p > q ) = .50 ∀p, q. Therefore, also P ( p=1 mp (1) p > q=1 nq q ) with P Q p=1 mp = q=1 nq is .50, independent of the choice of (). Note that (12) does also apply if is replaced by g(). Stated otherwise, the result is independent of the parametrization, and the Borel–Kolmogorov paradox does not apply. The same holds for constraint (14) below. (2) In a similar way, P ((1) p > q ) for 2 p p c ∼N , , (13) ( ) = 2 q q
with , 2 and denoting, respectively, the prior mean, variance and covariance of the encompassing prior, is also .50 ∀p, q, because ( p ) = ( q ). q p These results can be elaborated in two ways. Firstly, consider models with more constraints of the form (12). For example, let c = {1 , 2 , 3 , 4 }. The value of 1/c2 for model M2 : 1 > 2 > 3 > 4 equals 41 ! = .04167, independent of the actual choice of (c ). Secondly, the parameters can be organized in two groups, with as the model of interest M3 : 1 > 2 ∧ 3 > 4 . The value of 1/c3 equals .50 × .50 = .25 independent of the prior distribution for each group of parameters. Note that, this holds for encompassing priors assuming prior independence as well as priors with homogeneous prior covariances.
6372
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
Similar considerations hold for models using one or more constraints of the form P
p=1
(1) p >
Q
(2) q ,
(14)
q=1
with the restriction P = Q. Constraints of the form (14) are for instance seen in models for contingency tables, where the ’s represent cell probabilities and constraints are imposed on odds ratios. An illustration will be provided in Section 7. 3.2.2. Models for which 1/ct is sensitive to the prior Another inequality constrained hypothesis is the one-sided test, that is (in its simplest form) M2 : > 0. Under the general guidelines presented in this section, the value for 1/c2 is not independent of the exact specification of the encompassing prior. However, adding one specification guideline that restricts the encompassing prior to be symmetrical around the test-value (e.g. zero in this example) ensures a value of .5 for 1/c2 . Note that the type of constraints considered in this paper are mainly in terms of parameters to be smaller or larger than other parameters. One sided testing and the effect of prior specifications is discussed extensively in Casella and Berger (1987). A final example of an inequality constrained model is M3 : 1 > 2 +, where can be chosen such that 1 is relevantly larger than 2 . Again 1/c3 is not independent of the encompassing prior. Let () ∼ U [−z, z], than 1/c3 → .5 for z → ∞. This implies that for z → ∞, 1/c3 is evaluated as if the model is 1 > 2 (i.e. the quantity does not affect 1/c3 ). Stated otherwise, the effect of depends on the vagueness of the prior. 4. About equality constraints In this paper a ‘no effect’-model is represented by about equality constraints between the model parameters. The motivation is twofold. Firstly, the model ‘1 and 2 are approximately equal’ is in our opinion more realistic than ‘1 and 2 are exactly equal’. To quote Berger and Delampady (1987) on this concept: ‘It is rare, and perhaps impossible, to have a null hypothesis that can be exactly modeled as = 0 ’ and ‘... surely there is some effect, although perhaps a very miniscule effect’. Secondly, from (11) it is immediately seen that using exact equality constraints in the encompassing set-up, the Bayes factor is not defined, i.e. 1/ct = 1/dt = 0. Using about equality constraints, e.g. M4 : 1 ≈ 2 ⇔ | 1 − 2 | < ,
(15)
with a small value for , the Bayes factor is defined. 4.1. Prior sensitivity It is easily seen that the model selection based on encompassing priors is not objective for M4 . Like before 1/d4 is virtually independent of the encompassing prior. However, this does not hold for 1/c4 . In (15), let () ∼ U [−z, z], than, 1/c4 → 0 for z → ∞. Stated otherwise, the larger z, the smaller the Bayes factor in favor of the unconstrained model. This is a phenomenon similar to Bartlett’s or Lindley’s paradox (e.g. Lindley, 1957; Bernardo and Smith, 1994). In the context of precise hypotheses (e.g. = 0 versus = 0 ) the same phenomenon occurs (see for instance Berger and Sellke, 1987). In this context suggestions for ‘automatic’ or default priors can be found in the literature (for instance, Jeffreys, 1961; Zellner and Siow, 1980; Smith and Spiegelharter, 1980). Although the exact functional form of the prior is rather irrelevant, the scale factor (i.e. the amount of ‘vagueness’ of the prior) has a large effect on the answer. Berger and Delampady (1987) conclude that there is no prior distribution for that can claim to be objective. Also for about equality constrained models, the Bayes factor for the nested with the encompassing model is increasing in favor of the nested model when the prior gets more diffuse. Therefore, a subjective choice with respect to the degree of diffuseness of the encompassing prior must be made. The resulting Bayes factor depends on the encompassing prior distribution.
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
6373
4.2. A suggestion for specification of the encompassing prior We propose to use a prior for the encompassing model that is restricted to the region of interest (not more diffuse than is reasonable for the data at hand) and relatively flat (non-informative) within this region. Stated differently, we aim for a balance between diffuse (non-informative) but not too diffuse (to avoid too much effect of Lindley’s paradox). When the parameter space is bounded, a restricted prior range follows automatically. For example, if is a probability, the parameter space is ‘naturally’ bounded by zero and one. Specification of the encompassing prior can also be based on the response-format of the data. For instance, if data are measured on a scale from zero to ten, the prior distribution for the mean of this data is chosen such that it is vague for the whole range zero through ten, and zero otherwise. In other situations, the range of the observed data can be used to specify the encompassing prior. An example is given in Section 6.1. 5. Evaluation of the encompassing prior approach A small simulation study is performed to evaluate the encompassing prior approach. The set-up of the simulation is largely corresponding to the illustration presented in the next section as well as the parameter values used. Data are simulated from four sub-populations with common variance (set at the value 15). In the first simulation the population means are 90 in each sub-population. In the subsequent simulations the means are decreasing (with overall average 90) with increasing differences between the smallest and largest mean (denoted by ). The values for used, are 0, 5, 10, 15, 20 and 30, respectively. Each simulation is based on 500 data sets from the specified populations. The first six simulations are based on data sets of size 40 (four subgroups of 10 observations per group). Subsequently data sets of 25 per subgroup (total N = 100) are simulated from the same six population specifications. The set of models evaluated in each simulation consists of the unconstrained model (M1 ), two different inequality constrained models (M2 : 1 < 2 < 3 < 4 and M3 : 1 > 2 > 3 > 4 ) and the about equal model (M4 ). For the last model, the user-specified relevant difference d = 5, in accordance with the d that was chosen in the illustration (see Section 6). In the first two simulations ( = 0 and 5, respectively) the ‘true model’ is therefore the about equality model (M4 ). In the remainder of the simulations the data come from a population with ordered means, where the ordering is in accordance with M3 . In Table 1, each line presents the results of one of the 12 simulations. On the left hand side, the population values used as well as the corresponding value (between brackets) are provided. On the right hand side of the table the average posterior model probabilities (over the 500 replications) for each of the four models are presented. In the last column a percentage is provided denoting the amount of replications for which the ‘true model’ (M4 for the first two populations, and M3 for the other four) was the model with the highest posterior probability. Table 1 Results of the simulation study
1 Nj = 10 90.0 92.5 95.0 97.5 100.0 105.0 Nj = 25 90.0 92.5 95.0 97.5 100.0 105.0
2
3
4
( )
M1
M2
M3
M4
90.0 90.8 91.7 92.5 93.0 95.0
90.0 89.2 88.3 87.5 87.0 85.0
90.0 87.5 85.0 82.5 80.0 75.0
(0) (5) (10) (15) (20) (30)
.07 .08 .08 .10 .10 .08
.04 .02 .01 .00 .00 .00
.05 .12 .28 .47 .67 .90
.84 .78 .63 .43 .23 .03
98 91 23 51 79 98
90.0 90.8 91.7 92.5 93.0 95.0
90.0 89.2 88.3 87.5 87.0 85.0
90.0 87.5 85.0 82.5 80.0 75.0
(0) (5) (10) (15) (20) (30)
.06 .07 .09 .09 .07 .05
.05 .01 .00 .00 .00 .00
.04 .20 .49 .77 .91 .95
.84 .72 .42 .14 .02 .00
99 88 54 89 100 100
%
6374
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
The results shown are satisfactory. For the two populations with ‘no differences’ and ‘no relevant differences’ respectively the posterior probabilities are high (ranging from .72 to .84), as well as the percentage denoting how often M4 was the model with the highest probability (ranging from 88% to 99%). For the other four populations, where the true model is M3 , it can be seen that for larger differences between the population means (larger ) the average posterior probabilities of M3 are higher. For instance in the top panel of the table, where the sample size per group equals 10, these probabilities are, respectively, .28, .47, .67 and .90. Note that the value .28 may not be what we hope for, but is also not surprising given the fact that we try to determine a very small effect ( = 5 for data coming from a population with errors with a standard deviation of 15) with a small sample. The corresponding simulations with a sample size of 25 per group have probabilities of .49, .77, .91 and .95, respectively. The same conclusions can be drawn from the percentages of samples that assign the highest probability to the correct model. For small samples and (very) small effect sizes M3 is not often recognized as the correct model (23% for = 10, Nj = 10). However, the percentages for slightly larger effects (51% for = 15, Nj = 10) or for larger samples (54% for = 10, Nj = 25) are already much improved and all the other results with larger effects and/or larger sample size are more than satisfactory. 6. Example: analysis of variance 6.1. The data The data come from an experiment to study the gain in weight of 40 rats fed on different diets, distinguished by both source and amount of proteins (Snedecor and Cochran, 1967). The rats are randomly assigned to one of four groups receiving a high protein beef diet (group 1), a low protein beef diet (group 2), a high protein cereal diet (group 3), a low protein cereal diet (group 4). Gain in weight, yi , (measured in grams) is assumed to be normally distributed yi =
4
k dki + εi
iid
with εi ∼ N (0, 2 ).
(16)
k=1
Group membership is denoted by dki , where dki = 1 if the ith rat is a member of group k, and zero otherwise. The data are summarized in Table 2. Three nested models are compared with the encompassing model (M1 ). In the first theory (M2 ), a large effect of amount of protein is expected (more weight gain for higher amounts), followed by a smaller effect of the source (more gain for beef diets). The second theory (M3 ) describes the same directions of effects but expects a larger effect of source than of amount of proteins. Finally, a ‘no-effect’ (or ‘about equal means’) model is defined using the restriction that the difference between the smallest and highest mean gain in weight is less than 5 g (M4 ). M1 M2 M3 M4
: 1 , 2 , 3 , 4 , 2 , : 1 > 3 > 2 > 4 , 2 , : 1 > 2 > 3 > 4 , 2 , : 1 ≈ 2 ≈ 3 ≈ 4 , 2 .
(17)
Table 2 Weight gain of rats
Group 1 (beef, high) Group 2 (beef, low) Group 3 (cereal, high) Group 4 (cereal, low)
Sample size
Mean
Standard deviation
10 10 10 10
100.0 79.2 85.9 83.9
15.1 13.9 15.0 15.7
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
6375
The prior for the unconstrained model that will be used in Section 6.2, is the product of independent conjugate prior distributions for each of the parameters, specified using the guidelines of Section 3, i.e. (μ, |M1 ) = Inv- ( |1; 236.4) × 2
2
2
4
N (k |89.6; 123.8).
(18)
k=1
In (18), Inv- 2 (2 |1; 236.4) denotes a scaled inverse Chi-square distribution with one degree of freedom and a data based scale factor, and, N (k |·) denotes a Normal distribution with certain mean and variance. The specification of the hyperparameters of this Normal distribution is based on the 99% credibility intervals of each element k . The smallest lower bound and the largest upper bound of the four intervals provide one interval l − u. The prior distribution for each k is chosen such that the mean of the prior minus two standard deviations equals l and the mean plus two standard deviations equals u. Note, that any other method to specify a reasonable and diffuse encompassing prior can also be used, e.g. Jeffreys (1961) priors, or Zellner–Siow (1980) priors. For specific models (in this example M2 and M3 ) any encompassing prior will obtain the same results as long as the data dominate the prior, that is, the results are independent of the specification of the encompassing prior. For other models, the specification of the encompassing prior does influence the results. The Bayes factors comparing each nested model with M1 using the encompassing prior in (18), are presented in Section 6.2. Sensitivity of Bayes factors to different encompassing priors is examined and discussed in Section 6.3. 6.2. Estimation and evaluation of Bayes factors As can be seen in (11), the Bayes factor for a constrained model with the encompassing model reduces to the ratio ct /dt . The proportion of the encompassing prior that is in agreement with the constraints of model Mt provides the value 1/ct , the proportion of the encompassing posterior in agreement with the constraints of Mt is 1/dt . As we showed in Section 3.2.1, for many models 1/ct can be computed exactly. In other cases the encompassing prior is sampled and 1/ct is estimated by the proportion of the sample that is in agreement with the constraints of model Mt . To obtain 1/dt , the encompassing posterior is sampled using a Gibbs sampler (see for instance Smith and Roberts, 1993) and the proportion of that sample in agreement with Mt provides an estimate for 1/dt . Note that, with one posterior sample and, ˆ t1 = cˆt /dˆt for each nested model (t = 2, . . . , T ) if 1/ct cannot be obtained exactly one prior sample, the values for BF with the encompassing model, can be computed. For models where 1/ct is computed exactly, the MCE of the estimated Bayes factor, is computed using Qt 1/dˆt Qt −1 2 ˆ = ct2 VAR(1/dˆt ) ≈ ct2 , (19) (1 − )Q MCE (BF t1 ) = VAR 1/ct Q Q where Q is the total number of parameter vectors sampled from the posterior, and Qt is the number of parameter vectors in agreement with the constraints of model Mt . When 1/ct is estimated using a sample from the prior and 1/cˆt and 1/dˆt are assumed to be independent, the delta method gives an approximation of the variance of the ratio of 1/dˆt and 1/cˆt (Johnson et al., 1993, p. 55): 2 E(1/dˆt ) VAR(1/dˆt ) VAR(1/cˆt ) 1/dˆt ≈ + VAR 1/cˆt E(1/cˆt ) [E(1/cˆt )]2 [E(1/dˆt )]2 Qt /Q(1 − Qt /Q)Q−1 Rt /R(1 − Rt /R)R −1 , (20) + ≈ (Qt /Q/Rt /R)2 [Qt /Q]2 [Rt /R]2 where R is the total number of parameter vectors sampled from the prior, and Rt is the number of parameter vectors in agreement with the constraints of model Mt . The Bayes factors and MCEs for the set of models specified in (17) are provided in Table 3. For both BF21 and BF31 the value for 1/ct is easily calculated, i.e. 1/c2 =1/c3 =1/4!=.04167. Although the value for 1/c4 can also be obtained analytically, here it is estimated based on a sample from the prior. The number of iterations is large (R = 1, 000, 000), because the proportion of the encompassing prior in agreement with about equality constraints is often very small. Note however, that deriving the value for 1/c4 is still very fast, since no data and no complex calculations are involved.
6376
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
Table 3 Evaluation of the encompassing prior approach BF21
(MCE)
BF31
(MCE)
BF41
(MCE)
3.70
(.09)
1.24
(.05)
.33
(.06)
Table 4 Prior sensitivity of Bayes factors
(k )
(2 )
BF21 (MCE)
BF31 (MCE)
BF41 (MCE)
N (90, 125) N(0, 500) N(50, 2000) U (60, 120) U (0, 300)
Inv- 2 (1, 250) Inv- 2 (1, 250) Inv- 2 (1, 250) U (0, 900) U (0, 900)
3.73 (.09) 3.70 (.09) 3.65 (.09) 3.68 (.09) 3.69 (.09)
1.30 (.05) 1.27 (.05) 1.17 (.05) 1.21 (.05) 1.20 (.05)
0.33 (.05) 1.54 (.33) 8.18 (2.14) 0.73 (.18) 79.81 (27.3)
The estimates of 1/dt for all models are based on one sample of Q = 10, 000 iterations (after an initial burn-in period of 5000 iterations). As can be seen, the MCEs for the Bayes factors are small. It can be concluded that the numbers of iterations are large enough to give stable estimates. 6.3. Sensitivity to the prior of the unconstrained model The results in the previous section are based on the data-based encompassing prior provided in (18). Note once more that for specific inequality constrained models the values for the Bayes factor for the constrained with the unconstrained, encompassing model are hardly influenced by the specification of the encompassing prior as long as it is dominated by the data. This is illustrated in Table 4, where the first prior used is about the same as the prior in Table 3 but subsequently other encompassing priors are specified and the resulting Bayes factors are compared. The first three rows give the results of independent conjugate prior distributions for all model parameters, with different values for the hyperparameters. The fourth and fifth row provide the results of independent bounded uniform prior distributions for all model parameters, with differing bounds. The number of iterations from prior (for BF41 ) and posterior are respectively R = 1, 000, 000 and Q = 10, 000. It can be seen that, as expected, the values for BF21 and BF31 are virtually independent of the specification of the encompassing prior. However, the value for BF41 is strongly affected by the prior. This is reflected in different values for 1/c4 for the different encompassing priors. For instance, for the prior on the first row of Table 4, 1/c4 = .01, whereas for the last prior 1/c4 = .00002. The fit of model M4 can only be interpreted conditional on the specification of the encompassing prior. The MCEs are mainly small and for the models where only 1/dt is estimated, similar for different encompassing priors. However, for BF41 where both 1/d4 and 1/c4 are estimated the MCEs become larger for more diffuse encompassing priors. The MCEs can be reduced by taking a larger sample from the encompassing prior (apparently and not surprisingly, R = 1, 000, 000 is not a lot when priors as diffuse as N (50, 2000) and U (0, 300) are explored). 7. Example: contingency tables with ordered odds ratios 7.1. The data The data (obtained from Agresti, 1984) presented in Table 5 are an example of a 2 × 2 × 2 contingency table containing counts of death penalties (p) for each of the four combinations of defendant’s race (d) and victim’s race (v). The 326 subjects cross-classified according to these variables were defendants in homicide indictments in 20 Florida counties during 1976–1977.
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
6377
Table 5 Death penalty verdict by defendant’s race and victim’s race Defendant’s race
Victim’s race
White
White Black White Black
Black
Death penalty Yes
No
19 0 11 6
132 9 52 97
Analysis of these data (x) is based on a multinomial likelihood
xdvp dvp , L(x | ) ∝
(21)
d=0,1 v=0,1 p=0,1
where 0 denotes ‘White’ or ‘Yes’ and 1 denotes ‘Black’ or ‘No’. Four models are compared. The first model is the encompassing model (M1 ). The second model represents the expectation that, both for white and black victims, the death penalty is imposed more often when the defendant is black than when the defendant is white. For white victims this expectation is represented by the restriction that the odds ratio 000 101 /100 001 should be smaller than one. Likewise, for the black victims the constraint is: 010 111 /110 011 < 1. In M2 , these constraints are written in the form (14). The third model represents the expectation that, both for white and black defendants, the death penalty is imposed more often when the victim is white than when the victim is black (M3 ). The final model is a combination of both expectations (M4 ). M1 M2 M3 M4
: 000 , 001 , 010 , 011 , 100 , 101 , 110 , 111 , : 000 101 < 100 001 ∧ 010 111 < 110 011 , : 000 011 > 010 001 ∧ 100 111 > 110 101 , : 000 101 < 100 001 ∧ 010 111 < 110 011 ∧ 000 011 > 010 001 ∧ 100 111 > 110 101 .
(22)
Agresti and Coull (2002) present an overview of methods for the analysis of contingency tables under inequality constraints. Apparently, there are no classical alternatives for the Bayesian analysis executed in the sequel. Since all nested models in the set above are of the form (14), the Bayesian model selection is not sensitive to the encompassing prior, if the prior is dominated by the data. This will be illustrated in the next subsection. 7.2. Sensitivity to the prior of the unconstrained model The encompassing prior used, is the conjugate Dirichlet distribution
x −1 0 g() ∝ dvp , d=0,1 v=0,1 p=0,1
An uniform density is obtained by x0 = 1; the distribution where x0 denotes the parameter of the prior distribution. assigns equal density to any vector satisfying d,v,p dvp = 1 (Gelman et al., 1995, p. 76). However, the amount of information in the data (note that one cell has zero observations) does not dominate the information in the prior. Consequently, we expect that the model selection will to some extent be affected by the encompassing prior. The results are compared with three other encompassing priors, by specifying the values .5, .1 and .01 for x0 . It is easily seen that 1/c2 = 1/c3 = .25. The value for 1/c4 is estimated based on R = 100, 000 iterations from the encompassing prior. The values for 1/dt , for t = 2, 3, 4 are obtained by Q = 100, 000 iterations from the encompassing posterior. In Table 6 the values of the Bayes factors and corresponding MCEs are provided, for each of the four encompassing priors. The last two columns provide the values for 1/c4 and 1/d4 . The resulting Bayes factors differ for the four encompassing priors, although the order in which the four models are supported is similar (i.e. M4 , M3 , M2 , M1 ). For the encompassing prior with x0 = 1, the Bayes factors are closer to one than for priors with smaller values for x0 . In this example where some cells have small observed frequencies, and
6378
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
Table 6 Prior sensitivity of Bayes factors x0
BF21 (MCE)
BF31 (MCE)
BF41 (MCE)
1/c4
1/d4
1 .5 .1 .01
1.62 (.02) 2.34 (.006) 3.05 (.005) 3.22 (.005)
2.94 (.006) 3.49 (.004) 3.88 (.002) 3.96 (.001)
4.81 (.05) 7.05 (.08) 9.30 (.10) 9.68 (.10)
.083 .082 .082 .083
.401 .581 .761 .805
more specifically where one cell has zero observations, the data do not dominate the prior, and therefore the value for 1/dt varies for the different priors. This is illustrated in the last two columns of Table 6, where the estimates for 1/c4 and 1/d4 are provided. As can be seen 1/c4 is about equal for the four priors, but the estimates for 1/d4 are increasing with a decreasing amount of information in the encompassing prior. 8. Discussion In this paper Bayesian model selection for inequality constrained models is discussed. The approach described is based on the specification of just one prior, that is the prior for the unconstrained, encompassing model. The priors for the constrained (nested) models follow from this so-called encompassing prior by restriction of the parameter space. In this set-up, the Bayes factor of any nested model with the encompassing model has a nice interpretation, that is, it is the ratio of the proportions of the encompassing posterior and prior, respectively, that are in agreement with the constraints of the nested model (respectively, 1/dt and 1/ct ). These proportions can be estimated using samples from encompassing prior and posterior. So, the computation of Bayes factors does not include the estimation of marginal likelihoods, which can be burdensome. Furthermore, with our approach with just one prior and one posterior sample the Bayes factors for all pairs of models are obtained. It is also shown that for some classes of models the value for 1/ct can be derived exactly, making the estimation even more efficient. Since the estimation of Bayes factors reduces to the estimation of one or two proportions, Monte Carlo Errors are easily obtained. Applying some general and intuitive properties for the specification of encompassing priors, allows virtually objective model selection for specific classes of models, that is, the resulting Bayes factors are (almost) independent of the actual specification of the encompassing prior. For other classes of models, the model selection is strongly affected by the encompassing prior. An example is a model with about equality constraints between model parameters. The evidence in favor of the constrained model increases with the amount of vagueness of the encompassing prior, a phenomenon known as Bartlett’s or Lindley’s paradox. Our approach does not solve this well-known problem, but it does give some insight in the process lying underneath. For this kind of models, a subjective choice for the specification of the encompassing prior must be made. The resulting Bayes factor can only be interpreted conditional on this choice. References Agresti, A., 1984. Analysis of Ordinal Categorical Data. Wiley, New York. Agresti, A., Coull, B.A., 2002. The analysis of contingency tables under inequality constraints. J Statis. Plann. Inference 1-2, 45–73. Akaike, H., 1987. Factor analysis and AIC. Psychometrika 52, 317–332. Barlow, R.E., Bartholomew, D.J., Bremner, J.M., Brunk, H.D., 1972. Statistical Inference Under Order Restrictions. Wiley, New York. Berger, J.O., Delampady, M., 1987. Testing precise hypotheses. Statist. Sci. 2, 317–352. Berger, J., Pericchi, L., 1996. The intrinsic Bayes factor for model selection and prediction. J. Amer. Statist. Assoc. 91, 109–122. Berger, J., Pericchi, L., 2001. Objective Bayesian methods for model selection: introduction and comparison. In: Lahiri, P. (Ed.), Model Selection. Monograph Series, vol. 38, Institute of Mathematical Statistics Lecture Notes, Beachwood, Ohio, pp. 135–207. Berger, J., Sellke, T., 1987. Testing of a point null hypothesis: the irreconcilability of significance levels and evidence (with Discussion). J. Amer. Statist. Assoc. 82, 112–139. Bernardo, J.M., Smith, A.F.M., 1994. Bayesian Theory. Wiley, Chichester. Bozdogan, H., 1987. Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52, 345–370. Carlin, B.P., Chib, S., 1995. Bayesian model choice via Markov Chain Monte Carlo methods. J. Roy. Statist. Soc. Ser. B 57, 473–484. Carlin, B.P., Louis, T.A., 2000. Bayes and Empirical Bayes Methods for Data Analysis. second ed. Chapman & Hall, London. Casella, G., Berger, R.L., 1987. Reconciling Bayesian and frequentist evidence in the one-sided testing problems (with discussion). J. Amer. Statist. Assoc. 82, 106–111.
I. Klugkist, H. Hoijtink / Computational Statistics & Data Analysis 51 (2007) 6367 – 6379
6379
Chen, M.H., Shao, Q.M., 1997. On Monte Carlo methods for estimating ratios of normalizing constants. Ann. Statist. 25, 1563–1594. Chib, S., 1995. Marginal likelihood from the Gibbs output. J. Amer. Statist. Assoc. 90, 1313–1321. Dawid, A.P., Lauritzen, S.L., 2001. Compatible prior distributions. In: George, E.I. (Ed.), Bayesian Methods with Applications to Science, Policy and Official Statistics. Selected Papers from ISBA 2000: The Sixth World Meeting of the International Society for Bayesian Analysis. Eurostat, Luxembourg, pp. 109–118. Dykstra, R.L., Robertson, T., Silvapulle, M.J. (Eds.), 2002. Statistical inference under inequality constraints [Special issue]. J. Statist. PLann. Inference 107, 1–2. Gelfand, A.E., Dey, D., 1994. Bayesian model choice: asymptotics and exact calculations. J. Roy. Statist. Soc. Ser. B 56, 501–514. Gelfand, A.E., Smith, A.F.M., Lee, T., 1992. Bayesian analysis of constrained parameter and truncated data problems using Gibbs sampling. J. Amer. Statist. Assoc. 87, 523–532. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B., 1995. Bayesian Data Analysis. Chapman & Hall, London. Green, P.J., 1995. Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732. Han, C., Carlin, B.P., 2001. Markov Chain Monte Carlo methods for computing Bayes factors: a comparative review. J. Amer. Statist. Assoc. 96, 1122–1132. Jeffreys, H., 1961. Theory of Probability. third ed. Oxford University Press, London. Johnson, N.L., Kotz, S., Kemp, A.W., 1993. Univariate Discrete Distributions. Wiley, New York. Kass, R.E., Raftery, A.E., 1995. Bayes factors. J. Amer. Statist. Assoc. 90, 773–795. Klugkist, I., Kato, B., Hoijtink, H., 2005. Bayesian model selection using encompassing priors. Statist. Neerlandica 59, 5769. Lavine, M., Schervish, M.J., 1999. Bayes factors: what they are and what they are not. Amer. Statist. 53, 119–122. Leucari, V., Consonni, G., 2003. Compatible Priors for Causal Bayesian Networks. In: Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M. (Eds.), Bayesian Statistics, vol. 7. Clarendon Press, Oxford, pp. 597–606. Lindley, D.V., 1957. A statistical paradox. Biometrika 44, 187–192. Moreno, E., 2005. Objective Bayesian methods for one-sided testing. Test 14, 181–198. Newton, M.A., Raftery, A.E., 1994. Approximate Bayesian inference by the weighted likelihood bootstrap. J. Roy. Statist. Soc. Ser. B 42, 213–220. Robertson, T., Wright, F.T., Dykstra, R.L., 1988. Order Restricted Statistical Inference. Wiley, New York. Roverato, A., Consonni, G., 2004. Compatible prior distributions for DAG models. J. Roy. Statist. Soc. Ser. B 66, 47–62. Schwarz, G., 1978. Estimating the dimension of a model. The Ann. Statist. 6, 461–464. Silvapulle, M.J., Sen, P.K., 2004. Constrained Statistical Inference. Wiley, New York. Smith, A.F.M., Roberts, G.O., 1993. Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. Roy. Statist. Soc. Ser. B 55, 3–23. Smith, A.F.M., Spiegelharter, D.J., 1980. Bayes factors and choice criteria for linear models. J. Roy. Statist. Soc. Ser. B 56, 501–514. Snedecor, G.W., Cochran, G.C., 1967. Statistical methods. sixth ed. Iowa State University Press, Ames, Iowa. Zellner, A., Siow, A., 1980. Posterior odds ratios for selected regression hypotheses. In: Bernardo, J.M., DeGroot, M.H., Lindley, D.V., Smith, A.F.M. (Eds.), Bayesian Statistics. University Press, Valencia, pp. 585–603.