1
A Study of Expert Overconfidence Shi-Woei Lin and Vicki M. Bier Abstract Overconfidence is one of the most common (and potentially severe) problems in expert judgment. To assess the extent of expert overconfidence, we analyzed a large data set on expert opinion compiled by Cooke and colleagues at the Technical University of Delft and elsewhere. This data set contains roughly five thousand 90% confidence intervals of uncertain quantities for which the true values are now known. Our analysis assesses the overall extent of overconfidence in the data set. Significant differences in the extent of overconfidence were found among studies, among experts, and among questions within a study. Moreover, replications (multiple realizations for the same question) allowed a preliminary assessment of whether the question effect is due largely to question difficulty, or merely to random noise in the realizations of the uncertain quantities. The results of this analysis suggest that much of the apparent question effect may be due to noise rather than systematic differences in the difficulty of achieving good calibration for different questions. The results support the differential weighting of experts, since there are significant differences in expert calibration within studies. Index Terms expert opinion, calibration, overconfidence, expert aggregation
I. Introduction
E
XPERT opinion consists of the judgments of one or more experts regarding some particular subject. It is widely used in military intelligence, probabilistic risk analysis,
and other fields in which empirical data are typically sparse or difficult to obtain. Expert Shi-Woei Lin is with the Department of Business Administration, Yuan Ze University, Chung-Li, 320, Taiwan, Phone: 886-3-463-8800 Ext. 2625, email:
[email protected] Vicki M. Bier is with the Department of Industrial Engineering, University of Wisconsin – Madison, Madison WI, 53706, USA, Phone: (608) 262-2064, email:
[email protected] 2
opinion can also be used to provide estimates regarding new, rare, complex, or poorly understood problems or phenomena. Cooke (1991a) has reviewed many of the developments in and uses of expert judgment over the years. In subjective Bayesian inference, elicitation of expert opinion to represent a prior belief about an unknown quantity of interest is a critical step. In risk analysis, people likewise often rely on subjective probability assessments provided by experts, since component failure rate data are generally sparse. According to Walley (1991), issues related to the elicitation and use of interval-value probabilities have also received an increasing amount of attention. The term “calibration” is used by cognitive psychologists to discuss how closely human cognition regarding probability judgments corresponds to the ideal (Lichtenstein and Fischhoff, 1977). In other words, it is a measure of agreement between an expert’s assessed probabilities and the corresponding observed relative frequencies. For 90% probability intervals, we would expect, in the long run, that about 90% of the probability intervals provided by experts should include the true values of the corresponding quantities. However, in a study by Russo and Schoemaker (1992), when business managers were asked to provide 90% confidence intervals for unknown quantities of interest, only about 40% to 60% of the intervals contain the true value. Other studies concerning experts’ ability to make probabilistic assessments have also often reported evidence of poor calibration; see, for example, Cooke (1991a) and Shlyakhter et al. (1994). In particular, the confidence intervals estimated by experts are typically too narrow. This phenomenon is known as expert overconfidence. Similar results are discussed by Lichtenstein and Fischhoff (1977), who find lay people estimated that their answers had on average a 65 to 70 percent confident of being right, when they were actually correct only about 50% of the time. Moreover, Lichtenstein and Fischhoff (1977) note that subjects with greater expertise were not in general less overconfident than
3
others, and that “overconfidence is most extreme with tasks of great difficulty.” Lichtenstein et al. (1982), Morgan and Henrion (1990), and Bier (2004) summarize the results from numerous calibration studies, and conclude that both lay people and experts are systematically overconfident about the accuracy of their judgments. For example, the “surprise index” (the fraction of true values falling outside stated 98% probability intervals) is typically in the range of 20% to 45%, when ideal calibration would yield a surprise index of only 2%. Expert overconfidence has also been studied for confidence intervals in published scientific articles and technical reports. For example, in a study of “historical measurements and recommended values for the fundamental physical constants” (speed of light, gravitational constant, particle masses, etc.), Henrion and Fischhoff (1986) find values of the surprise index ranging from 0% to 57%, with most values falling between 5% and 15%, but several studies yielding surprise indices well over 25%. Interestingly, the surprise index of 57% (which is high even relative to values observed for lay people) is based on “‘recommended values’ for physical constants [proposed] after careful consideration of all the most precise measurements” from 1952 to 1973. Since such reviews are expected to take into account factors such as “inconsistencies among studies” and possible systematic error in measuring devices and experimental protocols, the finding of a large surprise index in these estimates is particularly disturbing. Recent research has cast doubt on whether overconfidence represents a stable psychological phenomenon, or is primarily an artifact of the study designs used to measure overconfidence; see for example Klayman et al. (1999) and Juslin et al. (2000). Two major lines of argument have been put forward. First, some have suggested that overconfidence is a function of question selection, and that calibration researchers have typically chosen non-representative questions (Gigerenzer et al., 1991; Juslin et al., 1999, 2000; Klayman
4
et al., 1999). Second, researchers have proposed models that attempt to explain observed overconfidence as an artifact resulting from unbiased errors in estimation or “unsystematic imperfections in judgment” (Klayman et al., 1999), rather than a psychological bias (Budescu et al., 1997a,b; Erev et al., 1994; Juslin et al., 1999, 2000; Klayman et al., 1999; Soll and Klayman, 2004). However, these arguments do not reduce the importance of overconfidence as a factor to consider in practical applications of expert judgment in decision analyses and risk assessments (as discussed, for example, by Cooke (1991a) and Vick (2002)), for several reasons. First, experts are likely to be consulted precisely for the most difficult and speculative questions, for which other data sources are not available. Thus, while studying performance on questions that are randomly sampled from a particular domain may help in understanding overconfidence as a general psychological phenomenon, such data sets will not necessarily be representative of the types of questions for which we wish to use expert judgment in applied studies. In any case, random sampling of questions does not eliminate or even substantially reduce overconfidence in interval estimates (Soll and Klayman, 2004). Second, the idea that overconfidence may be due in part to unbiased errors in estimation explains a larger fraction of the observed overconfidence in binary choice questions (e.g., “What is the probability that country A has larger population than country B?”) than in interval estimation tasks (e.g., “Please provide a 90% confidence interval for the population of country A”) (Juslin et al., 1999; Klayman et al., 1999; Soll and Klayman, 2004). (In fact, psychologists also have a better understanding of overconfidence in binary questions than in the types of interval estimates that are often used in decision analyses and risk assessments.) Finally, regardless of the reasons for the observed overconfidence, the fact remains that stated confidence intervals do tend to be overly narrow (Soll and Klayman, 2004).
5
In fact, overconfidence has been described as one of the most common (and potentially severe) problems in expert judgment. Plous (1993) even states that “no problem in judgment and decision making is more prevalent and more potentially catastrophic than overconfidence.” Moreover, modest levels of calibration feedback and exhortations to “Spread out those distributions!” unfortunately seem to have little effect (Plous, 1993). For example, a study by Alpert and Raiffa (1982) found that after one round of feedback, “The percentage of times the true values fell outside the extreme value (i.e., the .01 and .99 ranges) fell from a shocking 41% to a depressing 23%.” In this article, we analyze a large data set on expert opinion (compiled by Cooke and colleagues at the Technical University of Delft and elsewhere), which contains roughly five thousand 90% confidence intervals of uncertain quantities (“seed variables”) for which the true values are now known. This data set is uniquely relevant to the study of expert overconfidence in real-world decision analysis and risk assessments, since most of the expert opinion elicitations were performed in the course of applied research in the experts’ domains of expertise, not primarily in laboratory studies of overconfidence. In all cases, the respondents were experienced professionals giving judgments on important real-world problems within their own domains of expertise, not students or lay people answering almanac or general knowledge questions. (Experts were selected either via a formal round-robin protocol, or based on recommendations from the sponsor for each particular project.) Moreover, in all cases, the experts knew that the weights assigned to their judgments on the “study questions” (for which empirical information was lacking) would depend on how well-calibrated their answers were on the seed questions, so that poor calibration could result in their judgments receiving low or even zero weight. This can be expected to motivate experts to achieve good
6
calibration. Our analysis assesses the overall extent of overconfidence in the data set, as well as differences among studies, among experts, and among questions within a study. The intent of this study is not to add to the voluminous body of knowledge on overconfidence as a psychological phenomenon (e.g., to shed light on the underlying causes of overconfidence, or to help understand the factors leading some individuals to be better calibrated than others). These issues may be important for both theoretical and practical reasons, but are likely to be better studied in the laboratory than using field data. Rather, the intent is to help address several important practical issues regarding the use of expert judgment. First, there has been some difference of opinion over whether it is appropriate or desirable to broaden the confidence intervals or probability distributions provided by experts, to help reduce overconfidence (e.g., Apostolakis et al., 1980; Apostolakis, 1982 and 1985; Clemen and Lichtendahl, 2002; Martz, 1984; Shlyakhter et al., 1994; Shlyakhter, 1994). Data on the variability among experts in their extent of overconfidence would help to determine whether such broadening is likely to be necessary, or if well-calibrated judgments can usually be obtained simply by differential weighting of experts (e.g., ignoring or putting less weight on those experts who are most poorly calibrated). Moreover, understanding the relative magnitude of the influences contributing to overconfidence in realistic applications can help to determine how well Cooke’s approach (in which the weights assigned to different experts on the study questions of interest depend on their performance on the seed questions) is likely to perform in practice. In particular, his method can be expected to perform well when there are large differences in the extent of calibration from one expert to another, but relatively small differences in the extent of calibration from one question to another. By contrast, if the differences among experts are small, differential weighting of experts will not provide much benefit. Similarly, if the vari-
7
ability among questions is too large, performance on a limited number of seed questions may not provide a good characterization of an expert’s overall level of calibration. Large variations in calibration from one question to another would also raise concerns about whether performance on seed questions will be a good predictor of performance on study questions (which of course are likely to differ in systematic ways from the seed questions, for which empirical information is available). Section II below describes the data used in this article, and provides preliminary graphical analyses of the data set. In Section III, we present the generalized linear mixed model (GLMM) used for statistical analysis of the data set. The actual analysis of the expert calibration data and details on model fitting appear in Section IV, along with some sensitivity analysis. We summarize our findings and conclusions in Section V. II. The Data Our data set includes 17 clusters of expert judgment studies performed by Cooke and colleages at the Technical University of Delft and elsewhere. These include 27 different expert panels addressing a variety of topics. A listing of these studies is given in Table 1. For each elicitation question, experts were asked to provide at least three quantile points (5%, 50%, 95%) representing their uncertainty, and in some cases also the 25% and 75% quantiles, but not actual distribution shapes. Although the instructions varied from one study to another, and in some studies involved a graphical rather than numerical response mode, typical instructions were simply to “Give the 5, 25, 50, 75, and 95% quantiles” for the uncertain quantities of interest. In each study, the seed questions (or seed variables), for which the true values are now known, were selected to resemble as closely as possible the questions of interest that the study was intended to address. The data set includes, in total, 203 experts and 519 seed questions, for a total of 4,562 90% confidence intervals.
Table 1. Expert Panels # experts # seeds
Expert panel
Description
Acrylonitrile
Dose response relationships for acrylonitrile Dose response relationships for ammonia Dose response relationships for sulphur trioxide Option trading Prime rent for real estate Thermal comfort in buildings Dike ring safety Reliability of movable water barriers River channel dredging Crane risk (irregularities of flange connections) Crane risk Propulsion system of rockets Space debris Safety analysis of composite materials Atmospheric dispersion of radioactive materials Atmospheric dispersion of radioactive materials (pilot study – Netherlands Organization for Applied Scientific Research) Atmospheric dispersion of radioactive materials (pilot study – Delft panel) Internal dosimetry related to radioactive materials Early health effects related to radioactive materials Wet deposition of radioactive materials Dry deposition of radioactive materials Deposition of radioactive materials (pilot study) Food chain risk (animal transfer and behavior regarding radioactive materials) Food chain risk (plant/soil transfer and processes regarding radioactive materials) Groundwater transport Failure frequency of underground gas pipelines Montserrat volcano risk
Ammonia Sulphur trioxide Option trading Real estate Building temperature Dike ring Movable barriers River dredging Crane risk (flanges) Crane risk Rocket propulsion Space debris Composite materials Atmospheric dispersion Atmospheric dispersion (TNO) Atmospheric dispersion (Delft) Radiation dosimetry Early health effects Wet deposition Dry deposition Radioactive deposition (Delft) Radiation in food Soil transfer Groundwater Gas pipelines Montserrat
Reference
7
10
Goossens et al. (1992, 1998a)
6 4
10 7
Goossens et al. (1992, 1998a) Goossens et al. (1992, 1998a)
9 4 6 17 8 6 10
38 31 48 47 14 8 8
Van Overbeek (1999) Qing (2002) Wit (2001) Frijters et al. (1999) Van Elst (1997) Willems (1998) Cooke (1991a)
8 4 7 6 8
12 13 26 12 23
Cooke (1991a) Meima (1990) Offerman (1990) Goossens and Harper (1998), Harper et al. (1995)
7
36
Cooke (1994)
11
36
Cooke (1991b)
8
55
Goossens and Harper (1998), Goossens et al. (1998b)
9
15
Goossens and Harper (1998), Haskin et al. (1997)
7 8 4
19 14 24
Goossens and Harper (1998), Harper et al. (1995) Goossens and Harper (1998), Harper et al. (1995) Cooke (1991b)
7
8
Goossens and Harper (1998), Goossens et al. (1997)
4
31
Goossens and Harper (1998), Goossens et al. (1997)
7 15
10 28
Claessens (1990) Cooke and Jager (1998)
11
8
Aspinall (1996) 8
9
In our study, we consider only the 5% and 95% quantile estimates provided by the experts (since most studies did not include 25% and 75% quantile estimates). Since the true values for the seed questions are now known, a score of 1 is assigned to those 90% probability intervals that contain the corresponding true values, and a score of 0 to all other intervals. If an expert is well-calibrated, his or her average calibration score should be close to 0.9 (provided that there are enough seed variables to give a reasonably stable estimate of calibration). Our analysis assesses the overall extent of overconfidence in the data set, as well as differences among studies, among experts, and among questions within a study. A. Preliminary Graphical Analysis A plot of the average calibration scores for all studies is shown in Figure 1. This figure shows that average calibration scores vary a great deal between studies, with calibrations ranging from 0.3 to more than 0.8 (close to perfect calibration).
Study
Fig. 1. Mean Calibration Scores for All Studies
Option trading Radioactive deposition (Delft) Building temperature Dry deposition Wet deposition Real estate Atmospheric dispersion Rocket propulsion Montserrat Atmospheric dispersion (TNO) Dike ring Acrylonitrile River dredging Atmospheric dispersion (Delft) Groundwater Radiation dosimetry Radiation in food Ammonia Early health effects Crane risk (flange) Sulphur trioxide Composite materials Gas pipeline Space debris Soil transfer Crane risk Movable barriers
0.0
0.2
0.4
0.6
Calibration Score
0.8
1.0
10
Figure 2 shows the average calibration scores for all experts, grouped by study. This figure shows that there is wide variation among the average calibration scores of individual experts. Similarly, the plot of the average calibration scores for all questions (grouped by study) also shows considerable variation among the average scores for individual questions within many of the studies; see Figure 3. This suggests that calibration varies not only between fields (e.g., between fields that rely heavily on experimentation, and those that rely more on theory or computer modeling), but also among questions within a field. Thus, it appears easier to achieve good calibration on some questions than others, and in fact some questions even produced under-confidence (e.g., calibration scores of 1.0). However, these observations are not terribly reliable, due to the relatively small numbers of experts used in most studies (typically fewer than 10 experts per study), and therefore need to be tested in a more statistically rigorous manner. Finally, the dike ring reliability study included several realizations for each question answered by the experts. The questions for the dike ring study are listed in Table 2. For example, for question ZD, each expert would have given a single confidence interval for the water level at Dordrecht, but this confidence interval is compared against six different data points or “realizations” of the water level at different points in time. To explore the differences among realizations, we plot the average calibration scores for different realizations of the same question in Figure 4. (Two other studies, on real estate and building temperature, also included multiple realizations, but in these studies there were only two realizations per question, so the results are not displayed here.) Some questions in the dike ring study (in particular, questions Ts and Hs) do seem to have been much easier than the others, in the sense that the experts got good calibration scores on all realizations for those questions. However, for the other questions in the dike ring study, there was a wide range of variability in
11
study
Fig. 2. Mean Calibration Scores for Experts
Wet deposition Sulphur trioxide Space debris Soil transfer Rocket propulsion River dredging Real estate Radioactive deposition (Delft) Radiation in food Radiation dosimetry Option trading Movable barriers Montserrat Groundwater Gas pipelines Early health effects Dry deposition Dike ring Crane risk (flange) Crane risk Composite materials Building temperature Atmospheric dispersion (TNO) Atmospheric dispersion (Delft) Atmospheric dispersion Ammonia Acrylonitrile
0.0
0.2
0.4
0.6
0.8
1.0
Calibration Score
the calibration scores achieved for different realizations, suggesting that realizations yielding unusually high or low calibration scores may reflect noise or outliers in the data rather than any systematic differences in the difficulty of those questions.
12
Study
Fig. 3. Mean Calibration Scores for Questions
Wet deposition Sulphur trioxide Space debris Soil transfer Rocket propulsion River dredging Real estate Radioactive deposition (Delft) Radiation in food Radiation dosimetry Option trading Movable barriers Montserrat Groundwater Gas pipeline Early health effects Dry deposition Dike ring Crane risk (flange) Crane risk Composite materials Building temperature Atmospheric dispersion (TNO) Atmospheric dispersion (Delft) Atmospheric dispersion Ammonia Acrylonitrile
0.0
0.2
0.4
0.6
0.8
Calibration Score
Fig. 4. Mean Calibration Scores for Questions in Dike Ring Safety Study
mo3
mo2
question
mo1
Ts
Hs
ZD
ZG
ZR
0.0
0.2
0.4
0.6
calibration score
0.8
1.0
1.0
13
Table 2. Questions for Dike Ring Study ZR
The question concerns the local water level at Raknoord. The seven data points represent seven realizations based on seven randomly chosen points in time.
ZG
The question concerns the local water level at Goidschalxoord. The eight data points represent eight realizations based on eight randomly chosen points of time.
ZD
The question concerns the local water level at Dordrecht. The six data points represent six realizations based on six randomly chosen points in time.
Hs
For a randomly chosen occurrence, what is the ratio of the actual wave height to the wave height calculated using the Bretschneider model?
Ts
For a randomly chosen occurrence, what is the ratio of the actual wave period to the wave period calculated using the Bretschneider model?
Mo1
What is the actual flow rate on occasions when the calculated flow rate is 1 liter per second per meter?
Mo2
What is the actual flow rate on occasions when the calculated flow rate is 10 liters per second per meter?
Mo3
What is the actual flow rate on occasions when the calculated flow rate is 100 liters per second per meter? Note: The questions were translated from the Dutch by Robert Clemen and Jules van Binsbergen of Duke University.
14
III. Generalized Linear Mixed Model As noted above, the data exhibit a hierarchical structure, with some “treatments” nested within others. In particular, the expert and question effects are nested within each study, and the realization effect is nested within a given question. It is reasonable to assume that this structure would give rise to correlation among observations (for example, among the answers to several different questions given by a single expert). This data structure calls for a multi-level (or hierarchical) model capable of handling such nested effects. Moreover, because we are using a binary response variable to represent whether each observed realization fell within the corresponding confidence interval given by a particular expert, our data are clearly not normally distributed, and a logistic regression model is a natural approach for performing the data analysis. However, the nested, hierarchical structure of the data suggests that different effects should be associated with different levels of the hierarchy. Generalized linear mixed models (see, for example, McCulloch and Searle, 2000) are therefore needed to analyze this data set. A brief description of this type of model as applied to our data set is given below. Consider the answer provided by the jth expert in the ith study to the kth question in that study, and the lth realization of that question. We will use the notation b(a) to describe the nesting of effect b within effect a. Using this notation, the value yl(kj(i)) achieved by the jth expert on the lth realization of the kth question in study i is either 1 or 0. Thus, the distribution of yl(kj(i)) is Bernoulli(πl(kj(i)) ), where πl(kj(i)) is the (unknown) probability of the realization falling within the interval formed by the 5% and 95% quantile estimates provided by the expert. Using the logit function g(π) = log[π/(1 − π)], a generalized linear
15
mixed model for these data can be given by πl(kj(i)) ηl(kj(i)) = g(πl(kj(i)) ) = log = (study effect)i 1 − πl(kj(i)) +(expert effect)j(i) + (question effect)k(i) +(realization effect)l(kj(i))
(1)
This model implicitly expresses the probability πl(kj(i)) as a function of the study, expert, question, and realization effects. We model the study effects as fixed effects, thus estimating a specific effect size for each study. Even though the experts and questions were not randomly sampled from a larger population (as suggested, for example, by Yandell (1997)), we assume that the expert, question, and realization effects are random, with distributions 2 2 2 N(0, Iσexpert ), N(0, Iσstudy ), and N(0, Iσrealization ), respectively. We are interested in esti2 2 2 mating the variance components σexpert , σquestion , and σrealization , which can be viewed as
measures of the sizes of the expert, question, and realization effects. This multi-level random-effect model provides appropriate estimates of standard errors, and thus appropriate confidence intervals and significance levels. By contrast, analyzing the data as a complete randomized design (i.e., as a one-level model) would be too liberal in making inferences about the study and question effects. For example, a one-level model would not be able to correct for cases in which apparently good or poor calibration in a particular study could actually be explained merely by that study happening by chance to involve unusually well or poorly calibrated experts. In principle, we could model the question effect as either a fixed or a random effect. However, when modeling question effects as fixed effects, we would likely encounter convergence problems during model fitting, because many questions have average calibration scores of 1 or 0. We avoid this problem by modeling question effects as random effects. Furthermore, since our analysis involves estimating question effects for a total of 519 questions, with only
16
small to moderate numbers of experts answering each question, modeling question effects as fixed effects would likely lead to over-fitting of the data set. By contrast, the random-effect estimates incorporate “shrinkage” of the individual question-effect estimates toward the corresponding study mean, with the extent of shrinkage decreasing in the sample size for each question (i.e., the number of experts in that study, times the number of realizations of that question). This application of random-effect models is known as small-area estimation; see, for example, Ghosh and Rao (1994). Random-effect models here serve as a mechanism for improving on the observed sample proportions for each question (which might be unreliable due to small sample sizes) in estimating the true magnitude of question effects. Maximum likelihood estimates for GLMM models usually involve high-dimensional integrals, which can be analytically intractable. When the dimensions of these integrals are not too large, numerical integration can be used to closely approximate the likelihood. This is not the case with our data, however, since the realization random effects are nested within the question random effects. Therefore, we sacrifice exactness for feasibility of implementation, and use the SAS GLIMMIX macro, which is based on a “pseudo-likelihood” approach proposed by Wolfinger and O’Connell (1993), to fit our model. IV. Results A. Model Fitting We first present results using a multi-level logit model, but ignoring the nested structure of realizations within questions. Next, we consider the full model, including the realization random effects, and highlight the differences due to the inclusion of realization random effects. These results are shown in Table 3. As can be seen from that table, using the SAS GLIMMIX 2 2 in and 1.360 for σquestion macro, we obtain variance component estimates of 1.281 for σexpert
the absence of realization effects; both variance components are significantly different from
17
zero. This would seem to suggest that the magnitudes of the expert and question random effects are about the same. Note that since the logit function g(π) = log[π/(1 − π)] is used in our analysis, we are essentially estimating the variance of the expert and question effects on a “logit scale.” Therefore, the variance components in our case are more difficult to interpret then in the ordinary linear mixed-effect model. In particular, the random effects in our model act linearly on g(π) = η, and thus nonlinearly on the estimate of the calibration score π, which can be obtained as π = eη /(1 + eη ). Table 3. Parameters (and standard errors) of the models Reduced Model Fixed Effects Overall Test Random Effects 2 σexpert 2 σquestion 2 σrealization
Full Model
p = 0.0009
p = 0.0021
1.281 (0.178) 1.360 (0.146) ——
1.475 (0.201) 0.633 (0.176) 1.131 (0.183)
The relatively large question effect in this simplified model suggests that there may be large differences in the difficulty of achieving good calibration for different questions, even within a single domain or study. (In other words, differences in calibration arise not merely due to differences in the types of thinking or training involved in different domains, but also due to differences among questions within a domain.) However, when we consider the full model, including realization effects, we obtain variance component estimates of 1.475 for 2 2 2 σexpert , 0.633 for σquestion , and 1.131 for σrealization . The question random effect thus becomes
much smaller after correcting for the realization random effect, although all three effects
18
(expert, question, and realization) are still significantly different from zero. This suggests that most of the question effect may be due to chance differences between realizations, not inherent differences in question difficulty. (For example, when there is only a single realization for a given question, that question may appear to elicit good or poor calibration depending on whether the realization for that question happens to be an outlier, without necessarily implying much about the difficulty of the question.) This is potentially helpful in practice, since with large realization effects, reliable estimates of calibration can likely be obtained just by averaging across enough questions to minimize the effects of noise in the realizations, whereas if question difficulty is critically important, greater guidance may be needed to facilitate identification of suitable seed questions. The overall test of the study fixed effects is statistically significant. In other words, some studies do have significantly higher or lower calibration scores than others, even after correcting for the effects of variability among experts, questions, and realizations. According to individual hypothesis tests for each of the 27 studies, four of them–radioactive deposition (Delft), dry deposition, building temperature, and options trading–have significantly better than average calibration at the 5% significance level, achieving calibration scores of well above 70% (compared to an average of about 58%). By contrast, the studies on space debris and movable barriers are significantly worse than average, achieving calibration scores of well under 40%. Four other study effects are significant at the 10% level (the study on wet deposition marginally better than average, and the crane risk, soil transfer, and gas pipelines studies marginally worse). Although the SAS GLIMMIX macro has been perhaps the most popular software for fitting generalized linear mixed models in recent years, it has been reported that the software’s penalized quasi-likelihood (PQL) estimators can sometimes behave poorly. Therefore, we
19
re-fit our model using a Bayesian approach with a non-informative prior; the PQL estimates of the expert, question, and realization effects were 10% to 20% smaller than those obtained by Gibbs sampling, and the PQL standard errors estimates were also about 20% smaller. However, the qualitative conclusions regarding effect sizes obtained from the PQL estimates were consistent with those obtained by Gibbs sampling. B. Test of Expert Differences We now explore whether some experts appear to be significantly better or worse than others. We take two different approaches to this question – one using a GLMM model but treating the expert effects as fixed effects, and one using a simple binomial test. The two approaches give generally similar results, so only the first is presented here. Since the nine experts with average calibration scores of either 1 or 0 would cause convergence problems when we model the question effects as fixed effects, we removed the observations associated with those experts and re-fit the model given in equation (1) above. (Note that it was not necessary to remove those observations when using the binomial test.) All assumptions remained unchanged, except that we modeled both the study and expert effects as fixed effects. The variance component estimates are similar to those obtained before (see Table 5 below). In particular, both the question and the realization random effects are still statistically significant, with the realization random effect being much larger than the question effect. As expected, the overall test of the expert fixed effects is highly significant, with a p-value less than 0.0001. We also test the hypothesis that the calibration score for expert j in study i is equal to the overall calibration score for study i. The results indicate that 54 (or 28%) of the 194 experts with average calibration scores other than 0 or 1 have calibration scores significantly higher or lower than their corresponding study means.
20
Table 5. Parameters (and standard errors) of the models Fixed Expert Effect 2 σexpert
——
2 σquestion
0.708 (0.200) 1.228 (0.206)
2 σrealization
Original Model 1.475 (0.201) 0.633 (0.176) 1.131 (0.183)
Because the tests of individual experts described above involve, in total, about 200 comparisons, it is clearly possible to detect some expert differences merely due to chance. In fact, at a 5% significance level (5% rate of false positives), the expected number of false positives would be about 10 in our case, suggesting that roughly 40 of the 54 observed expert effects are likely to be real. Even using the highly conservative Bonferroni method (Milliken and Johnson, 1993) to correct for the large number of comparisons, 13 out of the 194 total expert tests (6.7%) are still found to be statistically significant. Thus, the finding that some experts are significantly better or less well calibrated than others does not seem to be due merely to the large number of comparisons performed. C. Test of Question Differences We are also interested in knowing whether it is significantly harder or easier to achieve good calibration on some questions than others within a given domain. However, we cannot readily test this using the same fixed-effect approach we used to study differences among experts, since about one-fifth of all questions have average calibration scores of 1 or 0. Therefore, we limit our study of question differences to the dike ring data set, in which all questions have multiple realizations, and none have calibration scores of 1 or 0. When we fit a GLMM model with fixed question effects and random expert and realization effects to
21
the data from the dike ring safety study, we find that questions ZG and ZD have average calibration scores significantly worse than the mean calibration of the dike ring safety study as a whole, and questions Hs and Ts are significantly better. D. Sensitivity Analysis Unfortunately, only three studies in the data set contained multiple realizations for each question. Therefore, we now conduct sensitivity analysis to explore whether our estimate of the realization effect is overly influenced by a few data points. As can be seen graphically from Figure 4, questions ZR, ZG, and ZD seem to have much more variability among their realizations than the other questions in the dike ring safety study. Therefore, we perform a sensitivity analysis in which these three questions are deleted from the data set. The resulting estimates of variance components are given in Table 7. If we delete these three questions, the conclusions change dramatically. Now, the estimated question effect is much larger than the realization effect, suggesting that the differences in calibration among questions may be largely due to inherent differences in question difficulty. Table 7. Estimates of Random Effects (Sensitivity Analysis) node 2 σexpert 2 σquestion 2 σrealization
Original
Delete ZR, ZG, ZD
1.475 0.633 1.131
1.678 1.238 0.297
Ideally, it would be nice to explore whether this finding holds up when we have larger numbers of questions with multiple realizations, drawn from a larger number of studies. Otherwise, the apparent dominance of the realization effect over the question effect (as seen in Table 3) may be just a fluke, due to the small number of questions with multiple realizations available to use in measuring the realization effect. This is especially a concern
22
because the questions for which multiple realizations are available cannot really be regarded as a random sample of all questions. V. Implications for the Aggregation of Expert Opinions According to Clemen and Winkler (1999), Mosleh et al. (1988), and others, methods for the elicitation and aggregation of expert opinion can be categorized into mathematical and behavioral approaches. In the mathematical approaches, the experts’ assessments are combined into a single probability distribution using a mathematical method. By contrast, behavioral approaches typically rely on group discussion to obtain a “consensus” distribution. Since it is generally agreed that mathematical approaches for aggregating expert opinions tend to yield better results (e.g., Mosleh et al., 1988; Winkler, 1981; Chhibber and Apostolakis, 1993), we focus our discussion here only on mathematical approaches. In particular, we begin by discussing the implications of our findings for the performance-based weighted-average model of Cooke (1991a), which was used to generate our data set. The basic idea of Cooke’s model is to combine expert opinions via a weighted average, using weights based on the experts’ calibration scores on a set of seed questions (for which the true values are knowable in reasonable amounts of time and effort). In our analysis, significant differences were found among studies, among experts, and among questions. Moreover, replications (with multiple realizations of the same question) allowed us to make a preliminary assessment of whether the question effect is due largely to question difficulty, or merely to random noise in the realizations of the uncertain quantities. The results of this analysis suggest that much of the apparent question effect may be due to noise (i.e., variability in realizations), rather than to systematic differences in the difficulty of achieving good calibration for different questions.
23
The significant expert effect is not surprising, given past research on expert judgment (e.g., Soll and Klayman, 2004), and bodes well for Cooke’s method, since the method places more weight on better calibrated experts. In general, there are reasons to expect a negative correlation between calibration and informativeness (Cooke, 1991a; see also Yaniv and Foster, 1995); however, Cooke (personal communication) has indicated that the best experts are both highly informative and well calibrated. (In fact, this is true almost by definition for Cooke’s method, since the weight given to an expert is affected to some extent by both calibration and informativeness.) Thus, some experts have the ability to achieve good calibration without sacrificing informativeness; it might be useful in future to identify a pool of such people and study how they accomplish this (e.g., through think-aloud protocols). Interestingly, the experts that perform best on the judgment task studied here are not always those with the most experience in their fields (Cooke, 1991a). The significant study effect is also not surprising, given that other researchers (e.g., Soll and Klayman, 2004) have found that calibration tends to vary by domain. (Cooke has suggested that domains with heavy reliance on real-world experiments or large amounts of empirical data seem to yield better expert calibration than fields that rely more heavily on mathematical modeling.) This would appear to bode ill for Cooke’s method, at least for those studies in which all experts perform poorly. In fact, however, Cooke has indicated (personal communication) that his performance-based weighted-average model actually achieves reasonable calibration in almost all studies, even some in which all individual experts are poorly calibrated. This is also consistent with the findings of Wallsten et al. (1997), Ariely et al. (2000), and Johnson et al. (2001). Nonetheless, it is of course still possible that differential weighting of experts may not always be sufficient to achieve good calibration. This suggests that methods involving broadening of confidence intervals (e.g., Apostolakis et al.,
24
1980; Apostolakis, 1982 and 1985; Clemen and Lichtendahl, 2002; Martz, 1984; Shlyakhter et al., 1994; Shlyakhter, 1994) may occasionally also be useful. The significant question effect would at first glance appear to bode ill for extrapolation from seed questions to study questions. After all, if expert performance differs significantly even among the seed questions, there may be doubts about whether calibration on seed questions will accurately predict calibration on the study questions for which expert estimates are desired. Fortunately, after correcting for realization effects, the question effects appear to be relatively small. However, this conclusion is not robust, and more data would be highly desirable. Finally, the significant realization effect bodes ill for Cooke’s method if only a small number of seed questions is used. However, realization effects can be expected to cancel themselves out as the number of seed questions increases, yielding more reliable estimates of expert calibration if sufficient numbers of independent (or weakly correlated) seed questions are used. The results of our study also provide at least some suggestive implications for the process of expert elicitation. In particular, Cooke has indicated (personal communication) that in several of the studies we found to be significantly worse than average in terms of calibration, the elicitation process was managed and implemented by students. This suggests that expert elicitation is still to some extent an art form, in which better performance can be expected if the interactions with experts are managed by experienced elicitors rather than novices. In future research, it would be desirable to explore in more detail whether particular features of different studies, experts, or questions help to explain observed good or poor calibration. Such observations would likely be only suggestive at this point, especially if based primarily on “opportunity samples” from field data. However, they might nonetheless gener-
25
ate interesting hypotheses that could be investigated more rigorously in future confirmatory research by cognitive psychologists and others. It would also be nice to further explore the magnitudes of the question and realization effects when larger number of questions with multiple realizations are available. In fact, it might be worth making deliberate attempts to include questions with multiple realizations in future applications of Cooke’s method, so that those applications could help advance knowledge not only in their fields of application, but also on the methodology of expert judgment. Acknowledgements The material is based upon work supported in part by the U.S. Army Research Laboratory and the U.S. Army Research Office under grant number DAAD1901-1-0502, and by the U.S. National Science Foundation under grant number DMI-0228204. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the sponsors. We extend our sincere appreciation to Roger Cooke and Willy Aspinall for providing the data used in this study. We also wish to thank Cooke, Robert Clemen, and Kasey Lichtendahl for their contributions and comments in many stimulating discussions, and Clemen and Jules van Binsbergen for translations from the Dutch. References Alpert, M., H. Raiffa. 1982. A progress report on the training of probability assessors. In Judgment under Uncertainty: Heuristics and Biases, edited by D. Kahneman, P. Slovic, A. Tversky. Cambridge University Press, 1287 – 1294. Apostolakis, G. 1982. Data analysis in risk assessments. Nuclear Engineering and Design 71 375 – 381.
26
—. 1985. The broadening of failure rate distributions in risk analysis: How good are the experts? Risk Analysis 5 89 – 91. Apostolakis, G., S. Kaplan, B. Garrick, R. Duphily. 1980. Data specialization for plant specific risk studies. Nuclear Engineering and Design 56 321 – 329. Ariely, D., W. T. Au, R. H. Bender, D. V. Budescu, C. B. Dietz, H. Gu, T. S. Wallsten, G. Zauberman. 2000. The effects of averaging subjective probability estimates between and within judges. Journal of Experimental Psychology: Applied 6(2) 130 – 147. Aspinall, W. 1996. Expert judgment case studies. Cambridge Program for Industry, Risk Management and Dependence Modeling, Cambridge University. Bier, V. M. 2004. Implications of the research on expert overconfidence and dependence. Reliability Engineering and System Safety 85 321 – 329. Budescu, D. V., I. Erev, T. S. Wallsten. 1997a. On the importance of random error in the study of probability judgment. Part I: New theoretical developments. Journal of Behavioral Decision Making 10 157 – 171. Budescu, D. V., T. S. Wallsten, W. T. Au. 1997b. On the importance of random error in the study of probability judgment. Part II: Applying the stochastic judgment model to detect systematic trends. Journal of Behavioral Decision Making 10 173 – 188. Chhibber, S., G. Apostolakis. 1993. Some approximations useful to the use of dependent information sources. Reliability Engineering and System Safety 42 67 – 86. Claessens, M. 1990. An application of expert opinion in ground water transport (in Dutch). DSM Report R 90 8840, Technical University of Delft, the Netherlands. Clemen, R. T., K. C. Lichtendahl. 2002. Debiasing expert overconfidence: A bayesian calibration model. Working paper, Duke University.
27
Clemen, R. T., R. L. Winkler. 1999. Combining probability distributions from experts in risk analysis. Risk Analysis 19(2) 187 – 203. Cooke, R., E. Jager. 1998. A probabilistic model for the failure frequency of underground gas pipelines. Risk Analysis 18(4) 511 – 527. Cooke, R. M. 1991a. Experts in Uncertainty: Opinion and Subjective Probability in Science. Oxford University Press, New York. —. 1991b. Expert judgment study on atmospheric dispersion and deposition report. Technical Report No. 01-81, Faculty of Technical Mathematics and Informatics, Technical University of Delft, the Netherlands. —. 1994. Undertainty in dispersion and deposition in accident consequence modeling assessed with performance-based expert judgment. Reliability Engineering and System Safety 45 35 – 46. Erev, I., T. S. Wallsten, D. V. Budescu. 1994. Simultaneous over- and underconfidence: The role of error in judgment processes. Psychological Review 101(3) 519 – 527. Frijters, M., R. Cooke, K. Slijkuis, J. van Noortwijk. 1999. Expert Judgment Uncertainty Analysis for Inundation Probability. Ministry of Water Management, Bouwdienst, Rijkswaterstaat, Utrecht. Ghosh, M., J. N. K. Rao. 1994. Small area estimation: An appraisal. Statistical Science 9 55 – 76. Gigerenzer, G., U. Hoffrage, H. Kleinbolting. 1991.
Probabilistic mental models: A
brunswikian theory of confidence. Psychological Review 98(4) 506 – 528. Goossens, L., R. Cooke, F. Woudenberg, P. van der Torn. 1992. Probit functions and expert judgment: Report prepared for the Ministry of Housing, Physical Planning and Environment. Technical Report, Technical University of Delft, the Netherlands.
28
Goossens, L. H. J., J. Boardman, F. T. Harper, B. C. P. Kraan, M. L. Young, R. M. Cooke, S. C. Hora. 1998b. Probabilistic accident consequence uncertainty analysis: Internal dosimetry uncertainty assessment, NUREG/CR-6571, EUR 16773. U.S. Nuclear Regulatory Commission and Commission of European Communities, Washington, D.C., and Brussels. Goossens, L. H. J., J. Boardman, F. T. Harper, B. C. P. Kraan, M. L. Young, R. M. Cooke, S. C. Hora, J. A. Jones. 1997. Probabilistic accident consequence uncertainty analysis: Uncertainty assessment for deposited material and external doses, NUREG/CR6526, EUR 16772. U.S. Nuclear Regulatory Commission and Commission of European Communities, Washington, D.C., and Brussels. Goossens, L. H. J., R. M. Cooke, F. Woudenberg, P. van der Torn. 1998a. Expert judgment and lethal toxicity of inhaled chemicals. Journal of Risk Research 1(2) 117 – 133. Goossens, L. H. J., F. T. Harper. 1998. Joint EC/USNRC expert judgment driven radiological protection uncertainty analysis. Journal of Radiological Protection 18(4) 249 – 264. Harper, F. T., L. H. J. Goossens, R. M. Cooke, S. C. Hora, M. L. Young, J. Pasler-Sauer, L. A. Miller, B. C. P. Kraan, C. Lui, M. D. McKay, J. C. Helton, J. A. Jones. 1995. Joint USNRC/CEC consequence uncertainty study: Summary of objectives, approach, application, and results for the dispersion and deposition uncertainty assessment, NUREG/CR6244, EUR 15855. U.S. Nuclear Regulatory Commission and Commission of European Communities, Washington, D.C., and Brussels. Haskin, F. E., L. H. J. Goossens, F. T. Harper, J. Grupa, B. C. P. Kraan, R. M. Cooke, S. C. Hora. 1997. Probabilistic accident consequence uncertainty analysis: early health uncertainty assessment, NUREG/CR-6545, EUR 16775. U.S. Nuclear Regulatory Commission
29
and Commission of European Communities, Washington, D.C., and Brussels. Henrion, M., B. Fischhoff. 1986. Assessing uncertainty in physical constants. American Journal of Physics 54(9) 791 – 797. Johnson, T. R., D. V. Budescu, T. S. Wallsten. 2001. Averaging probability judgments: Monte carlo analyses of asymptotic diagnostic value. Journal of Behavioral Decision Making 14 123 – 140. Juslin, P., P. Wennerholm, H. Olsson. 1999. Format dependence in subjective probability calibration. Journal of Experimental Psychology: Learning, Memory, and Cognition 25(4) 1038 – 1052. Juslin, P., A. Winman, H. Olsson. 2000. Nave empiricism and dogmatism in confidence research: A critical examination of the hard-easy effect. Psychological Review 107(2) 384 – 396. Klayman, J., J. B. Soll, C. Gonzalez-Vallejo, S. Barlas. 1999. Overconfidence: It depends on how, what, and whom you ask. Organizational Behavior and Human Decision Processes 79(3) 216 – 247. Lichtenstein, S., B. Fischhoff. 1977. Do those who know more also know more about how much they know? The calibration of probability judgments. Organizational Behavior and Human Performance 20 159 – 183. Lichtenstein, S., B. Fischhoff, L. D. Philips. 1982. Calibration of probabilities: The state of art to 1980. In Judgment under Uncertainty: Heuristics and Biases, D. Kahneman, P. Slovic, A. Tversky (eds.). Cambridge University Press, Cambridge, England. Martz, H. F. 1984. On broadening failure rate distributions in PRA uncertainty analyses. Risk Analysis 4(1) 15 – 23.
30
McCulloch, C. E., S. R. Searle. 2000. Generalized, Linear, and Mixed Models. Wiley, New York. Meima, B. 1990. Expert Opinion and Space Debris. Technological Designer’s Thesis, Faculty of Technical Mathematics and Informatics, Technical University of Delft, the Netherlands. Milliken, G. A., D. E. Johnson. 1993. Analysis of Messy Data, Volume I: Designed Experiments. Chapman and Hall, London. Morgan, M. G., M. Henrion. 1990. Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge University Press, Cambridge, England. Mosleh, A., V. M. Bier, G. Apostolakis. 1988. A critique of current practice for the use of expert opinions in probabilistic risk assessment. Reliability Engineering and System Safety 20 63 – 85. Offerman, J. 1990. Safety analysis of the carbon fibre reinforced composite material of the hermes cold structure. Technical Report, Technical University of Delft, the Netherlands. Plous, S. 1993. The Psychology of Judgment and Decision Making. McGraw-Hill, New York. Qing, X. 2002. Risk Analysis for Real Estate Investment. Ph.D. thesis, Department of Architecture, Technical University of Delft, the Netherlands. Russo, J. E., P. J. H. Schoemaker. 1992. Managing overconfidence. Sloan Management Review 33 7 – 17. Shlyakhter, A. I. 1994. Improved framework for uncertainty analysis: Accounting for unsuspected errors. Risk Analysis 14 441 – 447. Shlyakhter, A. I., D. M. Kammen, C. L. Broido, R. Wilson. 1994. Quantifying the credibility of energy projections from trends in past data – the United States energy sector. Energy Policy 22(2) 119 – 130.
31
Soll, J. B., J. Klayman. 2004. Overconfidence in interval estimates. Journal of Experimental Psychology: Learning, Memory, and Cognition 30(2) 299 – 314. Van Elst, N. P. 1997. Betrouwbaarheid beweegbare waterkeringen [reliability of movable water barriers] (in Dutch). Report Series 35, Technical University of Delft, the Netherlands. Van Overbeek, F. 1999. Financial Experts in Uncertainty. Master’s thesis, Department of Mathematics, Technical University of Delft, the Netherlands. Vick, S. G. 2002. Degrees of Belief: Subjective Probability and Engineering Judgment. ASCE Press, Reston, VA. Walley, P. 1991. Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, London. Wallsten, T. S., D. V. Budescu, I. Erev, A. Diederich. 1997. Evaluating and combining subjective probability estimates. Journal of Behavioral Decision Making 10 243 – 268. Willems, A. 1998.
Het gebruik van kwantitatieve technieken in risicoanalyses van
grootschalige infrastructuur projecten [the use of quantitative techniques in risk analysis of large infrastructural projects] (in Dutch). Technical Report, Technical University of Delft, the Netherlands. Winkler, R. L. 1981.
Combining probability distributions from dependent information
sources. Management Science 15 361 – 375. Wit, M. S. D. 2001. Uncertainty in Predictions of Thermal Comfort in Buildings. Ph.D. thesis, Department of Civil Engineering, Technical University of Delft, the Netherlands. Wolfinger, R., M. O’Connell. 1993. Generalized linear mixed models: A pseudo-likelihood approach. Journal of Statistical Computation and Simulation 48 233 – 243. Yandell, B. 1997. Practical Data Analysis for Designed Experiments. Chapman & Hall, London, UK.
32
Yaniv, I., D. P. Foster. 1995. Graininess of judgment under uncertainty: An accuracy– informativeness trade-off. Journal of Experimental Psychology: General 124(4) 424 – 432.