Uncertainty in Prior Elicitations: a Nonparametric Approach

Comment

Report 3 Downloads 47 Views

Uncertainty in Prior Elicitations: a Nonparametric Approach by Jeremy Oakley and Anthony O’Hagan Department of Probability and Statistics, University of Sheffield, Sheffield, S3 7RH [email protected] Summary A key task in the elicitation of expert knowledge is to construct a specific elicited distribution from the finite, and usually small, number of statements that the have been elicited from the expert. These statements typically specify some quantiles of the distribution, perhaps the mode and sometimes the mean or other moments. Such statements are not enough to identify the expert’s probability distribution uniquely, and the usual approach is to fit some member of a convenient parametric family. There are two clear deficiencies in this solution. First, the expert’s beliefs are forced to fit the parametric family. Second, no account is then taken of the many other possible distributions that might have fitted the elicited statements equally well. We present an approach which tackles both of these deficiencies. Our model is nonparametric, allowing the expert’s distribution to take any continuous form. It also quantifies the uncertainty in the resulting elicited distribution. Formally, the expert’s density function is treated as an unknown function, about which we make inference. The result is a posterior distribution for the expert’s density function. The posterior mean serves as a ‘best fit’ elicited distribution, while

1

the variance around this fit expresses the uncertainty in the elicitation. Two illustrations of our method are given using test examples. KEY WORDS: Expert elicitation, Gaussian process, non-parametric density estimation.

1

Introduction

We consider the elicitation of a single person’s beliefs about some unknown continuous variable θ. We refer to this person as ‘the expert’, and the expert’s ‘beliefs’ is a term that is intended to encompass all the expert’s knowledge, experience and expertise concerning θ. The objective of the elicitation is to identify the underlying continuous density function f (θ) that represents the expert’s beliefs. The field of elicitation is diverse, and addresses a number of related but distinct problems. One common problem is the reconciliation of the beliefs of several experts, whereas we consider a single expert. In Lindley et al. (1979), the aim is to reconcile incoherent expert probability judgments, and prior beliefs about the expert’s probabilities for events {E1 , . . . , Ek } are stated in terms of the analyst’s own probabilities for the events {E1 , . . . , Ek }, and what probabilities the analyst believes the expert will state if a particular event Ei is true. The approach of French (1980), Lindley (1982) and Lindley and Singpurwalla (1986) is similar, in that the analyst receives the expert’s statements as data, to be used to update the analyst’s own beliefs about θ. For convenience of exposition, our approach also features an analyst who receives the expert’s elicited statements, so it is important to emphasise that our objective 2

is quite different from that of these other authors. Our analyst’s role is to assist the expert to specify her prior distribution for θ. Accordingly, the analyst is making inference about the expert’s density function f (θ), not about θ. Another way to view the distinction is that in the approach of French, Lindley and Singpurwalla it is the analyst who is the decision-maker, and therefore whose distribution for θ is ultimately required, whereas in our method the decision-maker is implicitly the expert, because it is the expert’s distribution for θ that is required. It is assumed that the expert cannot state her density function explicitly; only that she can state certain summaries of her distribution such as the mean or various percentiles. (For convenience and to avoid convoluted language, we let the expert be female and the analyst be male.) A finite set of elicited summaries does not of course identify f (θ) uniquely. A common solution is to choose a density function that fits those summaries as closely as possible from some convenient parametric family of distributions. Some examples of this are Kadane et al. (1980), O’Hagan (1998), Garthwaite and Al-Awahdi (2002) and Oakley (2002). Two deficiencies in this approach are that it forces the expert’s beliefs to fit the parametric family, and that it fails to acknowledge the fact that many other densities might have fitted those same summaries equally well. Practical responses to these criticisms include using a process of feedback to verify that the fitted density is a satisfactory approximation to the decision-maker’s beliefs, and checking the sensitivity of subsequent inferences or decisions to varying the fitted f (θ). We present here an approach that addresses both deficiencies together in a sin3

gle coherent framework. Our fitted estimate of the expert’s f (θ) is nonparametric, thereby avoiding forcing it into a parametric family, and we explicitly quantify the uncertainty around this estimate that results from the expert having stated only a finite set of summaries. We treat the problem of fitting a density function to given summaries as an exercise in Bayesian inference; we have an unknown quantity of interest (the decision-maker’s density function f (θ)), the analyst formulates his prior beliefs about this quantity, then receives data in the form of summaries of the expert’s density function. The analyst then updates to obtain his posterior distribution for f (θ). The analyst’s posterior mean can then be offered as a ‘best estimate’ for f (θ), while his posterior distribution quantifies the remaining uncertainty around this estimate. The approach promises other practical benefits that are outlined in Section 5. We emphasise that the separation of the roles of the analyst and the expert is primarily for convenience of exposition. In principle, they could be the same person. However, the expert rarely has expertise in probability and so the role of a separate analyst is often a practical necessity. In this framework, one should clearly be wary about the analyst adding, through his prior distribution for f (θ), his opinions and beliefs to those of the expert. We nevertheless argue that it is possible, and can be of real practical value, to express prior opinions about someone else’s density function f (θ) without necessarily having knowledge about θ itself. Consider the following simple example: suppose it is given that (possibly as the result of an insubstantial elicitation procedure) the only 4

statement made by an expert is that she is sure an unknown quantity lies in some interval [a, b]. In this instance we would not believe that all density functions with support confined to this interval are equally likely. For example, we would not expect the expert to have negligible probability in the interval [a, (a + b)/2], otherwise it is likely that she would have stated that she was sure that the unknown quantity lies in the interval [(a+b)/2, b]. We might also expect her density function to be smooth, and perhaps more likely to be unimodal rather than bimodal, as the expert might then have stated two intervals rather than one. These opinions can be expressed without explicitly stating our beliefs about θ, or even without knowledge of what θ represents. Our formulation of the analyst’s prior for f (θ) will explicitly incorporate such beliefs. We discuss further the non-neutrality of our analyst in Section 5. There is much more to the process of eliciting expert knowledge than is addressed in this article. Questions should be carefully formulated in terms that the expert understands, and so as to avoid known ways in which experts tend to misjudge probabilities. For instance, the parameter θ should ideally be a quantity that is observable by the expert, so that the expert can make probability judgements about θ directly. Kadane and Wolfson (1998) provide a good review of these issues, and in particular have argued that experts should only be asked for judgements about observable quantities. In the next section we describe the analyst’s prior beliefs about the unknown density function. Posterior inference for this density is described in Section 3. Two test examples are given in Section 4. 5

2

A prior distribution for the unknown density function.

We now require to formulate the analyst’s prior distribution for the expert’s density function f (θ). In subsection 2.1 we first formulate the analyst’s prior in general terms as a Gaussian process, and in subsection 2.2 we consider how to specify prior distributions for the hyperparameters of the model. Then in subsection 2.3 we discuss the appropriateness of the Gaussian process and other alternative specifications.

2.1

The scaled Gaussian process

A very useful and flexible prior model for an unknown function is the Gaussian process, and we will assume that the analyst’s prior beliefs about f (θ) can be represented by a Gaussian process. In particular, the analyst’s prior distribution for any finite set of points on this function is multivariate normal. Gaussian process priors for functions have been proposed in various different settings, including regression (O’Hagan, 1978, and Neal, 1999), classification (Neal, 1999) and numerical analysis (O’Hagan, 1992). The Gaussian process is specified by giving its mean function and variancecovariance function. We will model these hierarchically in terms of a vector α of hyperparameters. First let the analyst’s prior expectation of f (θ) be some member g(θ | u) of a suitable parametric family with parameters u. Thus E{f (θ) | α} = g(θ | u). 6

(1)

Now it would not be realistic to suppose that the variance of f (θ) would be the same for all θ. In general, where the analyst expects f (θ) to be smaller his prior variance should be smaller in absolute terms. We reflect this in our model by supposing that the variance-covariance function has the scaled stationary form Cov{f (θ), f (φ) | α} = g(θ | u) g(φ | u) σ 2 c(θ, φ),

(2)

where c(θ, φ) is a correlation function that takes the value 1 at θ = φ and is a decreasing function of |θ − φ|. In general, the function c(., .) must ensure that the prior variance-covariance matrix of any set of observations of f (.) (or functionals of f (.)) is positive semi definite. Here we choose the function c(θ, φ) = exp{−

1 (θ − φ)2 }. 2b

(3)

This will be seen to be a mathematically convenient choice, and implies that f (.) is infinitely differentiable with probability 1. This formulation was given in Kennedy and O’Hagan (1996), who were interested in quadrature for computationally expensive density functions. Our model represents a belief that the expert’s density function f (θ) will, to some extent, approximate to a member of the parametric family g(θ | u). However, the model is nonparametric and allows the true f (θ) to have any form at all. The hyperparameter σ 2 specifies how close the true density function will be to its prior mean, and so governs how well it approximates to the parametric family. The hyperparameter b controls the smoothness of the true density. If b is large, then two points f (θ) and f (φ) will be highly correlated even if θ and φ are far apart. 7

2.2

Hierarchical prior distribution

The hyperparameters of this model are α = (u, σ 2 , b). In general, we would wish to express noninformative prior distributions for the components of α. Although the analyst may in practice have some idea himself about what values to expect for θ, and so may have prior knowledge about u, our objective is to elicit the expert’s beliefs about θ, i.e. f (θ), and would not wish this to be coloured by the analyst’s beliefs. We would also wish to express noninformative prior beliefs about σ 2 , and so allow the analyst to learn just how well the expert’s density function approximates to the underlying parametric family. However, there are good reasons for wishing to specify some informative beliefs about the smoothness parameter b. The first reason is technical: an improper prior for b can lead to an improper posterior regardless of the data observed. Consider for an example a prior of the form p(b) ∝ b−1 . In the likelihood function, any two sufficiently small values of b, say b1 and b2 cannot be distinguished by the data. This is because they will both imply virtual independence between f (θ) and f (φ), unless θ and φ are very close. The difference in behaviour of f (θ) implied by b1 and b2 will occur at distances |θ − φ| not observed in the data. The likelihood will tend to a non-zero constant as b tends to zero, which when combined with the prior p(b) ∝ b−1 gives an improper posterior. Furthermore, the analyst will not expect the true density to be highly multimodal. If u contains a variance parameter v, then it is reasonable to suppose that b is of the same order of magnitude as v, since this then expresses a belief in moderate 8

smoothness of the true density over the range for which f (θ) is non-negligible. This implies prior beliefs about the ratio b/v. In the following sections we will consider the particular case where u = (m, v) and g(. | u) is the normal density with mean m and variance v. We re-parameterise the covariance function as follows: c(θ, φ) = exp{−

1 (θ − φ)2 }, 2vb∗

(4)

where b∗ = b/v. We now let the prior distribution of α = (m, v, σ 2 , b∗ ) have the form p(m, v, σ 2 , b∗ ) ∝ v −1 σ −2 p(b∗ ),

(5)

where p(b∗ ) will be an informative prior for b∗ expressing the above beliefs. To give us an indication of what range of values of b∗ is plausible in practice, we fit a Gaussian process to points on an N (0, v) density function, and look at the likelihood function of log(b/v). The number of observations is increased, until the prior variance-covariance matrix of the outputs in the data becomes ill-conditioned. In figure 1, we show the likelihood based on twelve points from a N (0, 1) density. From the plot a prior log b∗ ∼ N (0.65, .1252 ) would seem appropriate, though we prefer to allow for more prior uncertainty, and so we set log b∗ ∼ N (0.65, .252 ).

2.3

Alternative models

One obvious problem with the Gaussian process is that is allows f (θ) to be negative, whereas of course it must be positive for all θ. We will show that by eliciting enough information from the expert, though there will always be some probability that 9

30

4

x 10

3.5

3

2.5

2

1.5

1

0.5

0 0.2

0.3

0.4

0.5

0.6 log(b/v)

0.7

0.8

0.9

1

Figure 1: Likelihood function for log(b/v) based on a substantial quantity of data from a N (0, v) density function f (θ) is negative at any θ, this probability will be trivial at values of θ of interest. However, an appreciable probability of the function being negative in the tails of the expert’s distribution will be unavoidable. Consequently, using this model we cannot accurately report the analyst’s uncertainty about the tails of the expert’s density function. When using f (θ) to derive any subsequent inferences or decisions, we should therefore investigate the effect of non-positivity in the tails. It is also known that

R∞ −∞

f (θ)dθ = 1, and although equation (1) ensures that this

holds for the prior expectation of f (θ) it will not hold for realisation of the Gaussian process. We apply this constraint as part of the data in Section 3. Effectively it is applied also to the prior distribution, and this implicitly will add a further component of correlation to (3). 10

We have described a model for the unknown density function in which f (.) is allowed to deviate nonparametrically from a parametric density function g(. | u). This is a common theme in Bayesian nonparametric inference; see for example Leonard (1978), who modelled the log of the unknown density function as a Gaussian process, and more recently Hjort (1996) and Walker et al. (1999). In general these alternative modelling approaches guarantee a positive density function, but they were developed for the problem of density estimation, where inference is required for an unknown distribution based on a sample of observations from that distribution. The elicitation context is quite different,and our data comprise elicited summaries of the distributions, such as quantiles. These alternative models would become completely intractable for such data. For instance, if ln f (θ) has a Gaussian process then we cannot even write down the likelihood for data which comprise integrals of f (θ).

West (1988) presents a prior distribution for an expert’s quantile function Q(U ) for U ∈ [0, 1]. His formulation leads to a Dirichlet model for the expert’s distribution. As in Lindley et al. (1979), this is in the context where the analyst is the decisionmaker and so it is the analyst’s beliefs about θ that are of interest, whereas we are interested in the expert’s beliefs. Note that a possibility here would have been to model the quantile function as a Gaussian process; this would solve one problem in that Q(U ) can be negative, but create a new one in that Q(U ) must be nondecreasing. 11

3

Prior to posterior updating

Our model for the analyst’s prior distribution is designed to handle the kinds of information commonly elicited from an expert. This includes quantiles of the distribution and simple moments. Since f (θ) has a Gaussian process distribution conditional on the hyperparameters α, the distribution of any linear functional of f (θ) (which includes moments and other expectations such as the probabilities that define given quantiles) is normal. Formally, let the data comprise a vector D of elicited summaries of this form. Remember that D will also include the information that

R∞ −∞

f (θ)dθ = 1. Then the

joint distribution of D and any finite set of points on the function f (θ) is multivariate normal. Denote the mean vector and variance-covariance matrix of D by H and σ 2 A respectively, and denote the covariance between D and f (θ) by σ 2 t(θ). Note that H will be a function of m and v, while A and t(θ) will be functions of m, v and b∗ . The forms of these quantities for some common summaries are given explicitly in the Appendix. It then follows immediately from properties of the multivariate normal distribution that f (θ)|D, m, v, b, σ 2 also has a normal distribution with E{f (θ) | α} = g(θ) + t(θ)T A−1 {D − H}, Cov{f (θ), f (φ) | α} = σ 2 {g(θ | u)g(φ | u)c(θ, φ) − t(θ)T A−1 t(φ)},

(6) (7)

In fact, conditional on α the analyst’s posterior distribution of f (θ) is again a Gaussian process, with these as its mean and variance-covariance functions. 12

3.1

Removing the conditioning on α

The posterior distribution of α = (m, v, σ 2 , b∗ ) is easily found from the multivariate normal likelihood for D given α:

1

p(m, v, σ 2 , b∗ | D) ∝ v −1 |A|− 2 σ −(n+2) ½

¾

1 × exp − 2 (D − H)T A−1 (D − H) , 2σ

(8)

where n denotes the number of elements in D. The posterior marginal distribution of f (θ) is then the result of integrating its Gaussian process conditional distribution with respect to this marginal posterior of the hyperparameters. The conditioning on σ 2 can be removed analytically to obtain

n

1

p(m, v, b∗ | D) ∝ v −1 (ˆ σ 2 )− 2 |A|− 2 ,

(9)

1 (D − H)T A−1 (D − H). n−2

(10)

where σ ˆ2 =

We can then use MCMC to obtain a sample of values of {m, v, b∗ } from their joint posterior. Finally, given m, v and b∗ , we can sample a density function f (.) at a finite number of values of θ from the posterior p{f (.) | D, m, v, b∗ }. (The joint posterior distribution of {f (θ1 ), . . . , f (θn )} conditional on D, m, v and b∗ is multivariate t, for any set {θ1 , . . . , θn }). Repeating this will give us a sample of functions from the posterior p{f (.) | D}, and estimates and pointwise bounds for f (θ) can then be reported. 13

4

Examples

We will illustrated the method with two synthetic examples.

4.1

Example 1: bimodal variable

Suppose that the expert has the following density function for θ: 0.4 1 0.6 1 f (θ) = √ exp{− (θ + 2)2 } + √ exp{− (θ − 1)2 }. 2 4 2π 4π

(11)

It is further assumed that the expert can state P (θ < x) for any x. The expert is asked to give probabilities for the following x: {-3,-2,-1,0,1,2,3}. We do not ask for the mean in this example, though we assume that the expert has given us P (θ < ∞) = 1. Weak priors are assumed for m, v and σ 2 as in (5). The following proposal distributions are chosen for the Metropolis-Hastings sampler: mt |mt−1 ∼ N (mt−1 , 0.01) ln vt |mt−1 , mt , vt−1 ∼ N (ln vt−1 , 0.1 × (1 + |mt − mt−1 |/0.2) ln b∗t |b∗t−1 ∼ N (ln b∗t−1 , 0.01),

(12) (13) (14)

so that large jumps in mt are more likely to be accompanied by large jumps in vt . We find the posterior mode of m, v, b∗ , and then choose starting values a reasonable distance away from the mode. The chain is run for 20,000 iterations, and the first 1000 runs are discarded. For each sampled value of (m, v, b∗ ), we generate one random density function. We then plot the pointwise mean, 2.5th and 97.5th 14

percentiles from the distribution of the density function. The MCMC output is shown in figure 3. 0.2

0.15

0.1

0.05

0

−0.05 −6

−4

−2

0

2

4

6

Figure 2: The mean and pointwise 95% intervals for the expert’s density function (solid lines), and the true density function (dotted line)

We can see that in the tails there is significant probability that f (θ) is negative. If we were to provide a sample of prior densities to be used in some analysis, we would not want to provide negative density functions. One option is to reject any generated density functions in the sample that are negative for any value of θ in some finite range. We choose the interval for θ by finding where the pointwise median approaches zero, so in this case we would look at f (θ) for θ ∈ (−6, 6). After rejecting these functions, we re-compute the pointwise mean, 2.5th and 97.5th percentiles. These are plotted in figure 4 The rejection procedure has had little effect, with the obvious exception of the 15

4

25

3

20

2 15 1 10 0 5

−1 −2

0

0.5

1 m

1.5

0

2

0

0.5

1 v

4

x 10

3.5

1.5

1

3

m v b*

0.8

2.5 2

0.6

1.5

0.4

2 4

x 10

1 0.2

0.5 0

0

0.5

1 b*

1.5

0

2

0

20

40

60

80

100

4

x 10

Figure 3: MCMC output and autocorrelation plot, example 1. tails. Consequently, we cannot claim to accurately measure our uncertainty about f (θ) in the tails.

4.2

Feedback

The analyst can show the expert his posterior distribution of her density function, although with the exception of features such as number of modes and skewness, she may not to be able to assess whether or not the analyst has a good representation of her beliefs. However, if the analyst present his posterior distribution of her distribution function, she can then see if values of p(θ < x) for new values of x suggested by his posterior match her own beliefs. We plot the mean and pointwise 95% interval for the distribution function in figure 5 16

0.2

0.18

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0 −6

−4

−2

0

2

4

6

Figure 4: The mean and pointwise 95% intervals for the expert’s density function (solid lines) using positive sampled density functions only, and the true density function (dotted line).

4.3

Example 2: strictly positive variables

The choice of a normal density for the prior mean function will not be appropriate in the case where θ is known to be positive. We could replace the normal density with some other density that only gives support for θ > 0, but we would lose the computational convenience of the normal density. Hence we instead transform θ to the log scale, and then model the density function of log θ as a Gaussian process, with mean given by a normal density function. The same prior as before is used for b∗ . It is then straightforward to transform back to the original scale when reporting the posterior distribution of f (θ). We illustrate this by fitting the model to percentiles from a Gamma (1,1) density 17

1

0.9

0.8

0.7

F(θ)

0.6

0.5

0.4

0.3

0.2

0.1

0 −6

−4

−2

0 θ

2

4

6

Figure 5: The mean and pointwise 95% intervals for the expert’s distribution function (solid lines), and the true distribution function (dotted line) function. We are given P (θ ≤ x) for x = 1, 2, 3, 4, 5. We use the priors for m, v and b∗ as in the previous example. We again obtain a sample of 20,000 density functions using MCMC, with the same proposal distributions. We again reject any generated density functions that are negative for values of θ of interest. In figures 6 and 7 we show the posterior distribution of the density and distribution functions. The MCMC output is shown in figure 8

5

Conclusions

We have presented a means of eliciting an individual’s prior density function f (.) that both avoids having to assume a parametric form for f (.), and allows us to measure our uncertainty about f (.) given a limited number of judgements from the 18

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0

1

2

3

4

5

6

7

8

Figure 6: The mean and pointwise 95% intervals for the expert’s density function (solid lines), and the true density function (dotted line)

individual. In the two examples considered, we could provide posterior estimates of f (.) that were very close to the true function. In addition, the pointwise intervals for f (θ) showed us that we had a sufficient number of judgements to accurately represent the expert’s beliefs about θ. A number of potential practical advantages of this approach offer directions for its further development. In practice, elicitation is a dialogue between the expert and the analyst. For instance, after eliciting a set of summaries the analyst can consider his posterior uncertainty about the expert’s density function before offering a final inference. He may for example choose to elicit more summaries if his posterior uncertainty is too large. Our model may then help to identify which additional summaries would be most informative. The dialogue can also be used to check the 19

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

1

2

3

4

5

6

7

8

Figure 7: The mean and pointwise 95% intervals for the expert’s distribution function (solid lines), and the true distribution function (dotted line)

validity of the analyst’s prior model for f (θ). The analyst believes that f (θ) is likely to be smooth and reasonably close to an unknown member of some parametric family. In particular, he may model it as approximating to a normal distribution in shape, so that it is likely to be fairly symmetric and unimodal. The second example uses instead a lognormal distribution to account for a situation where θ must be positive and a skew distribution is expected. Notice first that this is already far weaker than the assumption that f (θ) is exactly a member of the assumed parametric family, that is implied by the usual fitting approach to elicitation. Furthermore, given enough summaries from the expert, this prior information will have negligible impact. For instance, if the expert’s true density is highly skewed or multimodal then even if the expert’s prior 20

3.5

12

3

10

2.5

8

2 6 1.5 4

1

2

0.5 0

0

0.5

1 m

1.5

0

2

0

0.5

1 v

4

x 10

5

1

4

0.8

1.5

m v * b

0.6

3

2 4

x 10

0.4

2

0.2 1 0 0

0

0.5

1 b*

1.5

2

0

20

40

60

80

100

4

x 10

Figure 8: MCMC output and autocorrelation plot, example 2 is based on a normal g(θ | u) this skewness or multimodality will be reflected in the analyst’s posterior, given sufficient data. This is a significant benefit of our nonparametric formulation. At an earlier stage of the elicitation, when the analyst’s prior is still influential, feedback should allow the expert an opportunity to disagree with its implications. This will help to ensure that adequate summaries are elicited to reach an accurate posterior distribution for f (θ). At the end of the process, there will inevitably be some remaining posterior uncertainty about f (θ), but the fact that this is expressed formally in the analyst’s posterior distribution makes it easier to check the sensitivity of any inferences or decisions that may subsequently be based on the elicited f (θ). We could consider using a sample of density functions drawn from the analyst’s posterior, rather than in a single estimate, when making those inferences or decisions, to see whether or 21

not this remaining uncertainty is important.

A

Appendix: Likelihood for the expert’s summaries

In Section 3 it is stated that the data D comprising the expert’s summaries are normally distributed. We give here details of the means, variances and covariances for summaries of particular forms when g(θ | u) is the normal density with mean m and variance v. We also discuss some numerical difficulties that can arise when evaluating the posterior density of m, v and b∗ .

A.1

Expectation of θ

We define µ =

R∞ −∞

θf (θ)dθ. Then under the Gaussian process prior µ is normally

distributed with

Ef (µ|m, v, b∗ , σ 2 ) = m,

(15)

Covf {f (θ), µ|m, v, b∗ , σ 2 } = σ 2 g(θ)

Z ∞ −∞

φc(θ, φ)g(φ)dφ (

(θ − m)2 = σ g(θ) exp − 2v(b∗ + 1)

)

2

V arf (µ|m, v, b∗ , σ 2 ) = σ 2

Z ∞ Z ∞ −∞

= σ2

3

(1 + b∗ ) 2

(16)

θφc(θ, φ)g(θ)g(φ)dθdφ

√ −∞ 2 b∗ (2m

+ b∗ m2 + v) 3

(2 + b∗ ) 2 22

√ (θ + mb∗ ) b∗

.

(17)

A.2

Percentiles of θ

A value x is chosen, and the expert gives their probability P (θ ≤ x). We write Px =

Rx −∞

f (θ)dθ. This also has a normal distribution, with µ

x−m E(Px |m, v, b ) = Φ v

¶

∗

∗

(18)

2

Z x

2

Cov(Px , f (θ)|m, v, b , σ ) = σ g(θ) Ã

= σ 2 g(θ)

∗

b 1 + b∗

!1

(

2

exp

−∞

2

)

c(θ, φ)g(φ)dφ Ã 

(θ − m) θ + mb Φ x− ∗ ∗  v(b + 1) b +1

∗

!s



b∗ 

1+ , (19) vb∗ 

where Φ(.) denotes the c.d.f of the standard normal distribution. Note that the use of the normal density g(θ) here has ensured that the prior expectations of µ and Px conditional on the various hyperparameters are finite. If we simply modeled f (.) as a Gaussian process with a constant mean, these prior expectations would be infinite. The covariance between any two percentiles is given by Cov(Px , Py |m, v, b∗ , σ 2 ) = σ 2 ½

Z x Z y −∞

−∞

g(θ)g(φ)c(θ, φ)dθdφ ¾

σ2 Z x Z y 1 1 1 2 2 exp − (θ − φ) − (θ − m) − (φ − m)2 dθdφ 2πv −∞ −∞ 2vb∗ 2v 2v 1 ½ ¾ |S| 2 σ 2 Z x Z y 1 − 12 −1 T = |S| exp − (θ − m, φ − m)S (θ − m, φ − m) dθdφ, (20) 2πv −∞ −∞ 2 where

   

S −1 = 

 1+b∗ vb∗

− vb1∗

− vb1∗

1+b∗ vb∗

  . 

(21)

Making the transformation s

z1 = s

z2 =

b∗ + 2 (θ − m), v(b∗ + 1) b∗ + 1 (φ − m) − b∗ v 23

s

(22) 1 b∗ v(b∗

+ 1)

(θ − m),

(23)

we have σ2 Cov(Px , Py |m, v, b∗ , σ 2 ) = 2π

s

σ2 = √ 2 π

½ ¾ b Z l1 Z l2 1 exp − (z12 + z22 ) dz1 dz2(24) b + 2v −∞ −∞ 2

s

¾ ½ b Z l1 1 2 exp − z1 Φ(l2 )dz1 b + 2v −∞ 2

(25)

where s

l1 = s

l2 =

b∗ + 2 (x − m) v(b∗ + 1)

(26)

1 b∗ + 1 (y − m) − q z1 , ∗ bv b∗ (b∗ + 2)

(27)

(noting that l2 is a function of z1 ). We also need the covariance between percentiles and the mean. This is given by:

Cov(Py , µ|m, v, b∗ , σ 2 ) = 2

σ 2π

A.3

s

 s ½ ¾ Z ∞ Z l2   ∗ v(b + 1) 1 2 2 z1 + m exp − (z1 + z2 ) dz1 dz2 . (28)  b∗ + 2 −∞ −∞  b∗ + 2 2

b∗

Numerical issues

In evaluating the posterior density (up to proportionality), numerical problems can arise due to the need to take the determinant of and invert the matrix A, which can be ill-conditioned for certain values of m, v and b. Firstly, we can reduce the condition number of A by converting the data from the form P (θ ≤ xi ) to P (xi < θ ≤ xj ). The variance matrix that this results in is the original matrix with the first n − 1 columns subtracted from the last n − 1 columns. 24

A.3.1

Large values of b∗ v

As the product b∗ v → ∞, c(θ, φ) → 1∀θ 6= φ. Then the covariance between any √ √ two percentiles Px and Py will tend to the product Φ{(x − m)/ v}Φ{(y − m)/ v}, the variance of µ will tend to m2 , and the covariance between µ and Px will tend √ to mΦ{(x − m)/ v}. Thus the i, j-th element of A tends to the product of the i-th and j-th elements of the same vector, and this will result in a matrix A with determinant zero. Small values of b∗ v can also cause problems. As b∗ v → 0, all elements of A tend to zero. This can be seen by noting that for small b∗ v, the integrands in expressions (17), (25) and (28) all tend to zero.

A.3.2

Large values of |m| and small values of v

In both cases, this can result in g(θ | u) tending to zero for all values of θ sufficiently far from m. For certain percentiles Px , this will result in the corresponding row and column in A tending to a vector of zeros, again giving a matrix A with determinant zero.

References French, S. (1980). Updating of belief in the light of someone else’s opinion, Journal of the Royal Statistical Society, Series A, 143: 43–48.

Garthwaite, P. H. and Al-Awahdi, S. A. (2002). Non-conjugate prior distribution 25

assessment for multivariate normal sampling, J. Roy. Statist. Soc. Ser. B , 50: 462–474.

Hjort, N. L. (1996). Bayesian approaches to non- and semiparametric density estimation, in Bayesian Statistics 5 , edited by Bernardo, J. M., Berger, J. O., Dawid, A. P. and Smith, A. F. M., pp. 223–254, Oxford: University Press.

Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S. and Peters, S. C. (1980). Interactive elicitation of opinion for a normal linear model, J. Am. Statist. Assoc, 75: 845–854.

Kadane, J. B. and Wolfson, L. J. (1998). Experiences in elicitation (with discussion), The Statistician, 47: 3–19.

Kennedy, M. C. and O’Hagan, A. (1996). Iterative rescaling for Bayesian quadrature, in Bayesian Statistics 5 , edited by Bernardo, J. M., Berger, J. O., Dawid, A. P. and Smith, A. F. M., pp. 639–645, Oxford: University Press.

Leonard, T. (1978). Density estimation, stochastic processes and prior information (with discussion), J. Roy. Statist. Soc. Ser. B , 40: 113–146.

Lindley, D. V. (1982). The improvement of probability judgments, Journal of the Royal Statistical Society, Series A, 148: 117–126.

Lindley, D. V. and Singpurwalla, N. D. (1986). Reliability (and fault tree) analysis using expert opinions, Journal of the American Statistical Association, 81: 87–90. 26

Lindley, D. V., Tversky, A. and Brown, R. V. (1979). On the reconciliation of probability assessments (with discussion), J. R. Statist. Soc A, 142: 146–180. Neal, R. (1999). Regression and classification using gaussian process priors, in Bayesian Statistics 6 , edited by Bernardo, J. M., Berger, J. O., Dawid, A. P. and Smith, A. F. M., pp. 69–95, Oxford: University Press. Oakley, J. E. (2002). Eliciting Gaussian process priors for complex computer codes, The Statistician, 51: 81–97. O’Hagan, A. (1978). Curve fitting and optimal design for prediction (with discussion), J. Roy. Statist. Soc. Ser. B , 40: 1–42. O’Hagan, A. (1992). Some Bayesian numerical analysis, in Bayesian Statistics 4 , edited by Bernardo, J. M., Berger, J. O., Dawid, A. P. and Smith, A. F. M., pp. 345–363, Oxford: University Press. O’Hagan, A. (1998). Eliciting expert beliefs in substantial practical applications (with discussion), The Statistician, 47: 21–35. Walker, S. G., Damien, P., Laud, P. W. and Smith, A. F. M. (1999). Bayesian nonparametric inference for random distributions and related functions (with discussion), J. Roy. Statist. Soc. Ser. B , 61: 485–527. West, M. (1988). Modelling expert opinion (with discussion), in Bayesian Statistics 3 , edited by Bernardo, J. M., DeGroot, M. H., Lindley, D. V. and Smith, A. F. M., pp. 493–508, Oxford: University Press. 27