Asymptotic Calibration - CiteSeerX

Report 1 Downloads 78 Views
Asymptotic Calibration Dean Foster



Rakesh V. Vohra

y

First draft: May 1991. This version: May 9, 1996

Abstract Can we forecast the probability of an arbitrary sequence of events happening so that the stated probability of an event happening is close to its empirical probability? In other words, on the subset of days where we forecast 2/3 chance of a particular event occurring, about 2/3 of the time that event should occur. We can view this prediction problem as a game played against nature, where at the beginning of the game Nature picks a data sequence and the forecaster picks a forecasting algorithm. If the forecaster isn't allowed to randomize, then Nature wins|there will always be data for which the forecaster does poorly. This paper shows that if the forecaster can randomize, the forecaster wins in the sense that the forecasted probabilities and the empirical probabilities can be made arbitrarily close to each other. keywords: Brier Score. Calibration. Competitive ratio. Individual sequences. Regret. Universal prediction of sequences. Worst case. Dept. of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104. Email:[email protected] y Dept. of Management Science, College of Business, Ohio State University, Columbus, OH 43210. Email:[email protected] 

1

1 Introduction Probability forecasting is the act of assigning probabilities to an uncertain event. It is an activity widely practiced in meteorological circles. For example, since 1965, the U.S. National Weather Service has been in the habit of making and announcing probability of precipitation (PoP) forecasts. Such a forecast, a number between 0 and 1, is interpreted to be the probability that precipitation (de ned to be at least 0:01 inches) will occur in a speci ed time period and area. These PoP forecasts are now popularly accepted by the American public as meaningful and informative. We leave the reader to draw their own conclusions about what, if anything, this says about the great American public. There are many criteria for judging the e ectiveness of a probability forecast (Murphy and Epstein, 1967). In this paper we limit ourselves to consideration of the criterion called calibration (sometimes termed reliability). Dawid (1982) o ers the following intuitive de nition of calibration: \Suppose that, in a long (conceptually in nite) sequence of weather forecasts, we look at all those days for which the forecast probability of precipitation was, say, close to some given value ! and (assuming these form an in nite sequence) determine the long run proportion p of such days on which the forecast event (rain) in fact occurred. The plot of p against ! is termed the forecaster's empirical calibration curve. If the curve is the diagonal p = !, the forecaster may be termed (empirically) well calibrated." We will give a rigorous de nition later. Calibration by itself is not a sucient condition for a forecast to be deemed good. To see this, suppose there are two weather forecasters facing the following weather sequence|dry, wet, dry, wet,: : :. One always forecasts a probability of 1=2 of rain each day and the other alternates 0; 1; 0; 1; : : :. Both forecasters are well calibrated, but the forecasts of the rst are clearly 2

less useful than the second. Now consider two uncalibrated forecasts, the rst of which always forecasts a probability of 1=3 and the second of which alternates 1, 0, 1, 0, : : : (always generating an incorrect forecast). Which of these two is better is a matter of debate{the rst has a lower quadratic error but the second gets the `pattern' of rain correct. In any case, these two seem dominated by the rst two forecasts discussed previously. So, calibration does seem to be an appealing minimal property that any probability forecast should satisfy. The notion of calibration only makes sense if one can construct forecasts that are calibrated. Regrettably, Oakes (1985) has proved (see Dawid (1985) for a di erent proof) that no deterministic forecasting sequence can be calibrated for all possible sequences. Speci cally, Oakes shows that it is impossible to construct a joint distribution for an in nite sequence of events whose posterior mean is guaranteed to have be calibrated for every possible sequence of outcomes. The implication that Dawid (1985) draws is that no statistical analysis, however complicated, can be guaranteed to be a success. A way around this impossibility result is to relax the requirement that a forecast be calibrated against all possible sequences. Perhaps it is sucient that the forecaster be calibrated for some restricted family of distributions. Dawid (1985) argues that this can result in forecasting schemes that are computationally burdensome to execute and in some cases not computable at all. Alternatively, one can reject the notion that calibration is a desirable or useful notion at all. Schervish (1985), for example, o ers two arguments for this view. The rst is that calibration is a long run criterion. In the short run (when we are alive) a forecaster may be doing quite well. The second is that while a malevolent nature may be able to make one forecaster look bad according to the calibration criterion, its harder for it to make many forecasters look bad at the same time. To quote Schervish: \The more di erent forecasts that nature is trying to make look bad, the more exibility all forecasters have to try to look good." 3

Our goal in this paper is to rescue the notion of calibration. We get around the impossibility result of Oakes by broadening the de nition of calibration to include randomized forecasts. By carefully choosing our de nition of calibration for randomized forecast, we show how to construct a forecast which is in fact approximately calibrated. Finally, we generalize our results to the case when what is being forecast is a distribution, not just a point.

2 Notation and De nitions For ease of exposition assume our forecasting method, F , is assigned the task of forecasting the probability of two states of nature, wet or dry. Denote by Xt the outcome in period t: Xt = 1 if it is wet and Xt = 0 if it is dry. Denote by X T the sequence of wet and dry days up to and including period T . Since we can interpret X T to be the rst T terms of an in nite sequence X 1 that has been revealed to us, we will, when there is no ambiguity, write X for X T . In our context a forecasting method is simply a function that associates with each binary sequence (from the space of all binary sequences) a unique number in the interval [0; 1]. A randomized forecasting method would associate with each binary sequence a probability distribution over [0; 1] which governs the selection of a number in [0; 1]. The forecast that F makes in period t will be denoted ft = F (X t? ). Let nT (p; F; X ) be the number of times F forecasts p up to time T . Let T (p; X; F ) be the fraction of those times that it actually rained. In other words, 1

nT (p; F; X )  T (p; F; X ) 

T X

Ift=p t=1 T I X ft=p Xt t=1

nt(p)

where I is the indicator function. In the original de nition of well calibrated it was assumed that F was restricted to selecting forecasts from a nite set 4

xed a priori. Let A represent this nite set. One de nition of calibration is the following: F is well calibrated with respects to X if and only if for each

p2A

lim  (p; X; F ) = p: t!1 t

Another de nition is based on the calibration component of the Brier score (Brier 1950, Murphy (1972, 1973) see Blattenberger and Lad 1985 for an exposition). To introduce this de nition, let the calibration measure of F with respect to X after n periods be denoted Ct(F; X ) where X C (F; X ) = ( (p; X; F ) ? p) nt(p; F; X ) t

p2A

2

t

t

The requirement that F select from a xed set A is not a severe restriction for practical purposes. Many weather forecasters forecast probabilities to only one decimal place. Note that we can easily drop the requirement that A be nite in the de nition of well calibrated. This extended de nition of well calibrated still falls victim to an Oakes-like impossibility result.

3 Rules of the game So that the assumptions underlying our analysis are clear, we frame the analysis in terms of a game between two players. One is the statistician (he) making the probability forecasts and the other is nature (she) who chooses the outcomes. Nature picks the data X and the statistician picks the forecast function F . The payment from the statistician to nature is then Ct(F; X ). Since we are viewing this as a zero-sum game the statistician tries to make this payment as small as possible and nature tries to maximize it. Deterministic model: The theorem of Oakes is based on the assumption that the statistician is restricted to a deterministic rule for generating forecasts. In this model it is not clear who in fact wins{since given any strategy that the statistician picks there exists a strategy Nature can play for 5

which she will win the maximal amount possible, and visa versa. To create a clear cut winner, Oakes further assumes that Nature knows what strategy the statistician will follow. In other words, nature picks a conditional strategy. With this added assumption, Nature wins{she can make sure that the statistician is never close to calibrated. We will discuss her winning strategy later. If the payo function is changed (for example if the Brier Score is used instead of calibration) then the statistician can win at this game (Foster 1991, Littlestone and Warmuth 1994, Vovk 1990 and Feder, Mehrav and Gutman 1992, Cover 1991). Anticipatory model: Suppose the statistician is allowed to randomize{ but that Nature is allowed to know the outcome of his randomization. In other words, Nature can anticipate the next random draw of the statistician. It doesn't matter whether Nature can knows just the next random draw, or the all future randomizations of the statistician{the statistician will lose the game. In this model Nature would win at the calibration game since it is essentially equivalent to the deterministic model once we have conditioned on the randomization. This may be the most natural model of an malevolent Nature, but since we can't prove any more positive results for this model than for the deterministic model, we won't pursue it further. Private randomization model: An alternative model allows the statistician to randomize and not show Nature any results of this randomization. Nature knows the distribution of strategies that the statistician will use, but not the actual value of the randomization. Nature must then pick the data X without knowing the outcome of the randomization. Within this model Nature cannot learn how the statistician will predict from the history of previous plays. This is the easiest model within which to prove that the statistician can do well (see for example, Foster and Vohra 1991). Markov model: A more natural \Markov" like model is to allow Nature to condition on the previous plays that the statistician has made. Thus as time goes on, Nature can learn more about what behavior is, but the statistician can still randomize on the next move so that Nature does not 6

know exactly what he will do. Notice that there are two equivalent ways of viewing the Markov model. First is to allow the statistician to randomize only before the rst move. In other words, he picks a single random forecast. Alternatively, he is allowed to randomize on each successive round. Assuming that Nature can only observe the actions taken by the statistician and not the actual randomization, these two variations are equivalent. This will be the model we use in this paper. It has been used extensively since its original use as a forecasting model by Blackwell 1956 and Hanan 1957. See Freund and Schapire 1995 for a discussion of the di erence between this model and the previous model. Here is a precise statement of the rules we will follow in this paper: 1. The statistician begins by choosing a (randomized) forecasting method (function) F and reveals only the distribution of this forecast to nature. 2. In period t  1, the statistician generates ft(= F (X t? )), simultaneously, nature selects the value of Xt. Nature knows the distribution of strategies that the statistician will use, but not the actual value of the randomization. 1

3. The penalty the statistician incurs after n rounds is Cn (F; X ). Alternatively, we can state our goal without recourse to using a gametheoretic model. Our goal is to nd an F such that for all X and t suciently large Ct(F; X ) <  with probability exceeding 1 ? . We will call such an F -calibrated.

De nition (-Calibration) A randomized forecast rule F is -calibrated if

(for all X )

lim pr(Ct(F; X ) < ) > 1 ? :

t!1

Within the context of a game, identifying an F that is -calibrated is implied by showing that our security level, minF maxX Cn(F; X ), is less than 7

 for all n suciently large. Likewise, if an F exist which is -calibrated then our security level is less than . Since both of these can be shown to go 2

to zero, they are equivalent statements. Suppose that we restrict ourselves to choosing a deterministic forecasting scheme F . Since F is known, nature is in a position to determine ft before xing Xt . Consider the following procedure for generating a sequence X (Dawid 1985): 8 < Xt = : 1 if in period i the forecaster predicts a probability F  0:5 0 otherwise A straightforward calculation establishes that Cn(F; X )  1=4 for all deterministic forecasting methods F . The case of equality occurs when F is the forecast that generates ft = 1=2 for all i. Hence, as long as we restrict ourselves to the set of deterministic forecasting methods, our security level will be  1=4. To show the importance of which model is used, we will now switch from the deterministic model to the Markov model. Consider the randomized forecasting strategy de ned as follows: 8 > 2=3 with probability 2=3 if Xt? = 1 > > > > < 1=3 with probability 1=3 if Xt? = 1 ft = > > 2=3 with probability 1=3 if Xt? = 0 > > > : 1=3 with probability 2=3 if X = 0 t? For this particular strategy one can establish after tedious calculations that max C (F; X ) = 2=9 + op (1), where op (1) tends to zero in probability as X n n tends to in nity. Hence, minF maxX Cn (F; X ) < 1=4 with probability tending to 1. By randomizing in this way we are implicitly operating two forecasting strategies instead of one. Thus, nature nds it harder to miscalibrate F (recall Schervish's comments in the rst part of the paper). The rest of this paper will show how to improve this 2=9 + op(1) down to op (1). But all important features are captured by this simple example which breaks the 1/4 deterministic lower bound via randomization. 1

1

1

1

8

4 Other proofs of the existence of calibration Before we show you the actual algorithm that the statistician should use to \win" the calibration game, we will present a non-constructive existence proof. This lovely proof was emailed to us by Sergiu Hart shortly after we presented this paper in Israel. This existence proof will construct a nite game played between the statistician and Nature. Each will have a nite number of strategies and so the mini-max theorem will hold. Fix the number of times to be forecast at n. Nature's strategy space is then the set of all 2n binary strings. So that the statistician strategy space is also nite, we will restrict him to forecasting in a set of k + 1 strategies namely: 0, 1=k, 2=k, : : :, 1. At each time he has to pick one of these forecasts, so for all 2n ? 1 partial sequences he could have observed he must specify a forecast. Thus his strategy space consist of (k + 1) n? pure strategies. Now suppose that nature has to pick her strategy rst. To achieve her reserve value she will randomize among her choice of pure strategies. We can now assume that the statistician knows the randomization policy that Nature will follow. To use the minimax theorem we now need to specify a strategy for the statistician to follow which will keep his loss less than . If we can do this for all possible strategies Nature can follow then there must exist a strategy that the statistician can follow which will guarantee him a loss less than . How should the statistician behave if he knows what random policy nature will follow? At each point in time he can compute the conditional probability of the next item in the sequence. He can then round this probability to the nearest i=k value which he will then forecasts. Assuming that k is much less than n = his calibration score will be less than 1=k. Thus there exists a strategy which he can follow which will guarantee that he will do better 2

1

1 3

9

than this loss regardless of what strategy Nature actually follows. Our goal in the next section is to exhibit such a calibrated forecast. A constructive version of Hart's proof has been done by Drew Fudenberg and David Levine. They view each step to be forecast as a game and then use the minimax theorem to compute the value of that step. Their proof is targeted at the game theory audience which might nd it easier to understand than the proof we present here. Sergiu Hart and Andreu Mas-Colell have recently shown that Blackwell's approachability theorem can also be used to prove calibration. In particular they show that a no-regret forecast can be found (basically Theorem 4 below). Thus they provide an alternative proof that our algorithm will in fact be calibrated.

5 Constructing an -Calibrated Forecast This section will de ne the algorithm that we will use to generate an calibrated forecast. The next section will prove that this algorithm actually generates an -calibrated forecast. Let A be the set f0; k ; k ,: : :, k?k ; 1g. We will nd a distribution t over the set A such that the random forecast F which forecasts ft = ki with probability it will be -calibrated. First de ne the Expected Brier Score for a randomized forecast, as 1

2

1

EBSt(F; X ) 

k i (Xs ? i )2 t X X s k t s=0 i=0

This score is averaged over the randomization of the forecast (i.e. it) but not over the data X . De ne a new random forecast F i!j to be exactly the same as F except whenever F makes a forecast of ki , F i!j makes a forecast of kj . It might happen that F i!j has a lower Brier score than F and hence it 10

is a better forecast than F . When this happens, the di erence in their Brier scores is called the regret of changing ki to kj .

De nition (Regret) De ne the regret of changing ki to kj to be: 

Rit!j  t EBSt(F; X ) ? EBSt(F i!j ; X )

+

where (x)+ is the positive part of x. In other words, if we de ne Stij to be the signed di erence in the Brier score:

Stij

t t   2 X 2 X i i = s Xs ? k ? is Xs ? kj s=1 s=1

(1)

then the regret from changing ki to kj is: 8 < Rit!j = :

Stij

if Stij > 0

otherwise

0

In theorem 3 we show that the calibration score is closely related to the maximal regret, in particular:

Ct(F; X ) 

X

i

i!j

Rt : max j t

We pick our distribution t so that it satis es the following conservation condition for all i: X j X i!j t Rtj!i = it Rt (2) +1

j 6=i

+1

+1

j 6=i

Theorem 1 A randomized forecast using the t de ned by the above equation

is a -calibrated forecast. In particular  = O(k ?1=2 ).

We will prove this theorem in the next section. The rest of this section will be devoted to showing that the algorithm is well de ned{in other words, solving equation (2). Let A be a matrix de ned as follows:

aij = Rj!i 11

for all i 6= j; and

aii = ?

X i!j R : j 6=i

Notice that the row sums of A are all zero. Let A0 be the matrix obtained from A as follows: a0ij = aij =B where B = maxi;j jaij j. Notice that ja0ij j  1 and Pi a0ij = 0. Let P = A0 + I . Then, P will be a non-negative row stochastic matrix. Hence there is a non-negative probability vector x such that Px = x (since we don't require that X be unique, we don't need any restrictions on the matrix P ). Since P = A0 + I we deduce that

A0x + Ix = x

) A0 x = 0 ) Ax = 0 The vector x is the required distribution. Further, it can easily be found by Gaussian elimination.

6 Proof that the algorithm works As F is essentially xed, we can, for convenience suppress the dependence on F in our notation. We will write nt(p), t(p) and Ct for nt(p; F; X ), t(p; F; X ) and Ct(F; X ) respectively. The proof divides into two steps. In the rst step we show that Ct can be closely approximated by something akin to its average value. To this end de ne modi ed versions of n,  and C (which may be interpreted as averages) as follows:

12

Base de nitions t X nt( ki )  Ift i s=1

t( ki )  Ct 

=

Modi ed de nitions t X n~ t( ki )  is

k

s=1

t If^t = i Xs X k i s=1 nt ( k )

~t( ki ) 

k nt ( j )  X k j =0

t( kj ) ? kj

t

2

t i X X s s ~ t( ki ) s=1 n

j 

X C~t  n~ t(t k ) ~t( kj ) ? kj

k

2

j =0

Note that nt( ki ) ? n~t( ki ), and t( ki )nt( ki ) ? ~t( ki )~nt( ki ) are both martingales. This allows us to approximate Ct by C~t:

Theorem 2 Ct ? C~t ! 0 in probability as t ! 1. Proof: The function C (): C (a ; a ; : : :; ak ; b ; b ; : : :; bk ) = 0

1

0

1

k X j =0

aj abj ? kj j

!2

is a continuous function over the compact set 0  bi  ai  1 (and hence uniformly continuous). We can rewrite Ct and C~t as: !

t(0)nt(0) ; t( k )nt( k ) ; : : :; t(1)nt(1) Ct = C ntt(0) ; nt(t k ) ; : : :; nt(1) ; t t t t ! ~t(0)~nt(0) ; ~t( k )~nt( k ) ; : : :; ~t(1)~nt(1) ; C~t = C n~ tt(0) ; n~ t(t k ) ; : : :; n~t(1) t t t t 1

1

1

1

1

1

It is sucient to show that the di erence in the arguments converge to zero to establish that Ct ? C~t converges to zero. Since nt( ki ) ? n~ t( ki ), and t( ki )nt( ki ) ? ~t( ki )~nt( ki ) are both counting processes, their jumps are bounded by 1, and hence the variance of the jumps 13

are trivially bounded by 1. In other words, 







var nt( ki ) ? n~ t( ki ) ? nt? ( ki ) ? n~t? ( ki )  1     var t( ki )nt( ki ) ? ~t( ki )~nt( ki ) ? t? ( ki )nt? ( ki ) ? ~t? ( ki )~nt? ( ki )  1 1

1

which leads to

1

1

1

1

!

i i var nt( k ) ?t n~t( k )  1=t i i n ( i )! i  t k t ( k )nt ( k ) ? ~t ( k )~  1=t: var t

Since L convergence implies convergence in probability, we see that Ct ? C~t ! 0 in probability. If almost sure convergence is desired, it will follow from a similar argument using 4th moments. 2 In the second step we show that this `average' calibration score goes to zero with t. This done by using the regret to bound the average calibration score and then proving that the regret is asymptotically small. De ne X R~t  g (Rit!j ) 2

i;j

where

8 > < g (x)  :>

x 0

2

2

x0 x0

(3)

Note that g (Rit!j ) = g (Stij ) where Stij is de ned by equation (1).

Theorem 3 The calibration score is related to the regret by:

i!j  tC~  X max Ri!j + t  R~  + k + t max R t t 2 4k 4k i j i j Proof: First note that i +j! t X 2( j ? i ) ij i k k St = k s s Xs ? 2 X

2

=1

14

2

!

i) i+j = 2(j ? kk)~nt( k ~t( ki ) ? k 2 k     = n~ t( ki ) ~t( ki ) ? ki ? n~ t( ki ) ~t( ki ) ? kj 2

So,



n~ t( ki ) ~t( ki ) ? ki

2

2



= Stij + n~ t( ki ) ~t( ki ) ? kj

2

 Stij  max Stij = max Rit!j j j

Now summing over both sides shows the rst inequality. For the second inequality we need only note that the maximum regret occurs at the point where n~ t( ki ) ~t( ki ) ? kj is smallest. So 2



n~t( ki ) ~t( ki ) ? ki



2





2

Stij + min (~nt( ki ) ~t( ki ) ? kj = max j j  ij  1 S +  max t j 2k The smallest the minimum obtains is less than n~t( ki )(2k)? . Thus the second inequality follows. Since 1=(2) + g (x)  x we see that i!j  1 + max g (Ri!j )  1 + X g (Ri!j ) max R j 2 j  2 j  2

2

summing over i shows the last inequality.

2

Theorem 4 For the t de ned in equation (2), R~t  tk Proof: Note that g (x + ) ? g (x)   (x) +  . De ne Ljt  (Xt ? kj ) , so, Stij = Stij? + (Ljt ? Lit). From, g (Stij ) ? g (Stij? )  (Ljt ? Lit)Rit!? j + (Ljt ? Lit) +

2

2

1

1

1

15

2

we get:

R~ t ? R~t? = 1



Xn

o

g (Stij ) ? g (Stij? ) 1

i;j o Xn  itLit ? itLjt Rit!?1j i;j

From equation (2) we see that o X n i i tLt ? itLjt Rit!?1j i;j :i6=j

n

+ itLit ? itLjt

8 X i <X i i!j = Lt : tRt?1 j i

So,

o2

9 = ? jt Rtj?!1i ; = 0

o2 Xn i i t Lt ? it Ljt i;j X X i 2  (t) = k (it)2 i i;j

R~t ? R~ t?   1

  k Thus,

R~ t ? R~  tk: But, we know that R~ = 0 So, R~ t  tk: 0

2

0

Combining Theorems 1-4 yields the following obvious but technical corollary.

Corollary For all  > 0, if k > ? = ,  < =(4k), and t > 2k=(), then for all t  t , we have that C~t  . Further, there exists a t > t , such that for all t  t , pr(Ct < )  1 ? . 1 2

0

0

1

0

1

With care,  can be chosen to be O(t? = ). Theorem 1 now follows directly from this corollary. Theorem 1 can be strengthened to almost sure convergence. We will sketch the argument here. First run a 2?i -calibrated algorithm for a \long time" and then switch to a 2? i -calibrated algorithm. Repeat inde nitely. 0

( +1)

16

1 3

The hard part is de ning what a \long time" means. It must be suciently long such that each stage has a probability of at most 2?i of ever being above 2? i? . It also must be suciently long so that we can amortize the burn in period of the 2? i i -calibrated algorithm. Combining this with the almost sure version of theorem 2, yields a non-constructive proof of the existence of an algorithm such that Ct ! 0 almost surely. (

1)

( + )

7 Forecasting With Distributions Suppose that instead of making a point forecast ft of the probability that Xt = 1, we forecast a distribution t(). Clearly, the de nition of calibration must be generalized if it is to be applied to a distributional forecast. A reasonable de nition of t() is: X

ds (p)Xs

sX t

t(p; X ) =

st

ds (p)

:

Hence, if  is a distributional forecast, its calibration with respect to X after n periods is T Z X (t(p) ? p) dt (p): CT (; X ) = 1

t=1

2

0

Notice that if the distributional forecast is a degenerate one, i.e., a point forecast, the de nition of calibration reduces the one given earlier in the paper. Given what we know about deterministic point forecasts, we can assert that some distributional forecasts are not calibrated. Any randomized point forecast can be viewed as a distributional forecast. If this is done, the calibration score is exactly C~t used in Theorem 2. Simply treat the randomization at each period as the distribution being forecast. In this case the calibration of the distributional forecast is a number, in contrast to the calibration of the associated randomized point forecast, which is a 17

random variable. This observation yields the following corollary to Theorem 1,2,3:

Corollary There is a distributional forecast () such that for all X and  > 0, CT (; X )   for all t suciently large. If we think of t() as a posterior distribution for pr(Xt = 1), then we can combine Oakes' result with the corollary to conclude that a posterior mean might not be calibrated for all X but there are some posterior distributions that are always calibrated. Thus, in terms of calibration, the posterior distribution is a better statistic than the posterior mean.

8 Conclusion Our goal in this paper has been to rescue the notion of calibration. We have done this by generalizing the original de nition of calibration o ered by Dawid to allow for randomized forecasts. Further, we have shown that this weakened de nition is not vacuous, by exhibiting a forecasting scheme that satis es it.

9 Acknowledgments We are grateful for some useful comments by A. P. Dawid, Arnold Zellner and Daniel Nelson.

10 References BLACKWELL, D. (1956). An analog of the minimax theorem for vector payo s. Paci c Journal of Mathematics, 6, 1-8. 18

BLATTENBERGER, G. and LAD, F. (1985). Seperating the Brier Score into Calibration and Re nement Components: A Graphical Exposition. The American Statistician, 39, 26 - 32. BRIER, G. W. (1950). Viri cation of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 75, 1-3. COVER, T. (1991). Universal portfolios. Mathematical Finance, 1-29. DAWID, A. P. (1982). The Well Calibrated Bayesian. Journal of The American Statistical Association, 77, 605-613. DAWID, A. P. (1985). The Impossibility of Inductive Inference. Journal of The American Statistical Association, 80, 340-341. FEDER, M., MEHRAV, N. and Gutman M. (1992) Universal prediction of individual sequences. IEEE Transactions on Information Theory, 38, 1258-1270. FOSTER, D. P. (1991). A Worst Case Analysis of Prediction. Annals of Statistics. 21, 625 - 644. FOSTER, D. P. and VOHRA, R. (1993). A Randomization Rule for Selecting Forecasts. Operations Research, 41, 704 - 709. FREUND, Y. and SCHAPIRE, R. (1995) A decision-theoretic generalization of on-line learning and an application to boosting. Proceedings of the Second European Conference on Computational Learning Theory. HANAN, J. (1957) Approximation to bayes risk in repeated plays. in M. Dresher, A.W Tucker and P. Wolfe, editors, Contributions to the Theory of Games of Games, volume 3, 97-139, Princeton University Press, 1957. LITTLESTONE, N. and WARMUTH, M. (1994) The weighted majority algorithm. Information and Computation. 108, 212-261. MURPHY, A. H. (1972). Scalar and Vector Partitions of the Probabiltiy Score. Part I: Two-State Situation. Journal of Applied Meteorology, 11, 273 - 278. MURPHY, A. H. (1973). A new Vector Partition of the Probability Score. Journal of Applied Meteorology, 12, 595 - 600. MURPHY, A. H. and EPSTEIN, E. (1967). Veri cation of Probabilistic 19

Predictions: A Brief Review. Journal of Applied Meteorology, 6, 748-755. OAKES, D. (1985). Self-Calibrating Priors Do Not Exist. Journal of The American Statistical Association, 80, 339. SCHERVISH, M. (1985). Comment on paper by Oakes. Journal of The American Statistical Association, 80, 341-342. VOVK, V. (1990) Aggregating Strategies. Proceedings of the 3rd Annual Conference on Computational Learning Theory, 371-383.

20