A Counterexample to Theorems of Cox and Fine - Semantic Scholar

Report 1 Downloads 168 Views
Journal of Arti cial Intelligence Research 10 (1999) 67{85

Submitted 6/98 published 2/99

A Counterexample to Theorems of Cox and Fine Joseph Y. Halpern

[email protected]

Cornell University, Computer Science Department Ithaca, NY 14853 http://www.cs.cornell.edu/home/halpern

Abstract

Cox's well-known theorem justifying the use of probability is shown not to hold in nite domains. The counterexample also suggests that Cox's assumptions are insucient to prove the result even in innite domains. The same counterexample is used to disprove a result of Fine on comparative conditional probability.

1. Introduction One of the best-known and seemingly most compelling justications of the use of probability is given by Cox (1946). Suppose we have a function Bel that associates a real number with each pair ( ) of subsets of a domain such that 6= . We write Bel( j ) rather than Bel( ), since we think of Bel( j ) as the credibility or likelihood of given .1 Cox further assumes that Bel( j ) is a function of Bel( j ) (where denotes the complement of in ), that is, there is a function such that U V

W

U V

U

V

V U

V

V U

V U

V U

W

A1. Bel(

U

V

S

j

V U

) = (Bel( S

j

V U

)) if = 6 , U

and that Bel( \ j ) is a function of Bel( function such that V

V

0

U

V

0

jV \ U )

and Bel(

j

V U

), that is, there is a

F

A2. Bel(

V

\ V 0 jU ) = F (Bel(V 0 jV \ U )

Bel(

j

V U

)) if

V

\ U 6= .

Notice that if Bel is a probability function, then we can take ( ) = 1 ; and ( ) = . Cox makes much weaker assumptions: he assumes that is twice di erentiable, with a continuous second derivative, and that is twice di erentiable. Under these assumptions, he shows that Bel is isomorphic to a probability distribution in the sense that there is a continuous one-to-one onto function : ! such that  Bel is a probability distribution on , and (Bel( j ))  (Bel( )) = (Bel( \ )) if 6= , (1) where Bel( ) is an abbreviation for Bel( j ). Not surprisingly, Cox's result has attracted a great deal of interest, particularly in the maximum entropy community and, more recently, in the AI community. For example S x

xy

x

F x y

F

S

g

IR

IR

g

W

g

V U

g

U

U

g

V

U

U

U W

1. Cox writes j rather than Bel( j ), and takes and to be propositions in some language rather than events, i.e., subsets of a given set. This dierence is minor|there are well-known mappings from propositions to events, and vice versa. I use events here since they are more standard in the probability literature. V U

V U

U

V

c 1999 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.



Halpern

Cheeseman (1988) has called it the \strongest argument for use of standard (Bayesian) probability theory". Similar sentiments are expressed by Jaynes (1978, p. 24) indeed, Cox's Theorem is one of the cornerstones of Jaynes' recent book (1996).  Horvitz, Heckerman, and Langlotz (1986) used it as a basis for comparison of probability and other nonprobabilistic approaches to reasoning about uncertainty.  Heckerman (1988) used it as a basis for providing an axiomatization for belief update. The main contribution of this paper is to show (by means of an explicit counterexample) that Cox's result does not hold in nite domains, even under strong assumptions on and (stronger than those made by Cox and those made in all papers proving variants of Cox's results). Since nite domains are arguably those of most interest in AI applications, this suggests that arguments for using probability based on Cox's result|and other justications similar in spirit|must be taken with a grain of salt, and their proofs carefully reviewed. Moreover, the counterexample suggests that Cox's assumptions are insucient to prove the result even in innite domains. It is known that some assumptions regarding and must be made to prove Cox's result. Dubois and Prade (1990) give an example of a function Bel, dened on a nite domain, that is not isomorphic to a probability distribution. For this choice of Bel, we can take ( ) = min( ) and ( ) = 1 ; . Since min is not twice di erentiable, Cox's assumptions block the Dubois-Prade example. Other authors have made di erent assumptions. Aczel (1966, Section 7 (Theorem 1)) does not make any assumptions about , but he does make two other assumptions, each of which block the Dubois-Prade example. The rst is that the Bel( j ) takes on every value in some range  ], with . In the Dubois-Prade example, the domain is nite, so this certainly cannot hold. The second is that if and are disjoint, then there is a continuous function : 2 ! , strictly increasing in each argument, such that A3. Bel(  j ) = (Bel( j ) Bel( j )). With these assumptions, he gives a proof much in the spirit of that of Cox to show that Bel is essentially a probability distribution. Dubois and Prade point out that, in their example, there is no function satisfying A3 (even if we drop the requirement that be continuous and strictly increasing in each argument).2 Reichenbach (1949) earlier proved a result similar to Aczel's, under somewhat stronger assumptions. In particular, he assumed A3, with being +. Other variants of Cox's result have also been considered in the literature. For example, Heckerman (1988) and Horvitz, Heckerman, and Langlotz (1986) assume that is continuous and strictly increasing in each argument and is continuous and strictly decreasing. Since min is not strictly continuous in each argument, it fails this restriction too.3 Aleliunas (1988) gives yet another collection of assumptions and claims that they suce to guarantee that Bel is essentially a probability distribution. 

S

F

F

F x y

x y

S x

S

x

F

V U

e E

e < E

V

G

V

V

0

U

IR

V

0

IR

G

V U

V

0

U

G

G

G

F

S

2. In fact, Acz el allows there to be a dierent function U for each set on the right-hand side of the conditional. However, the Dubois-Prade example does not even satisfy this weaker condition. 3. Actually, the restriction that be strictly increasing in each argument is a little too strong. If = Bel( ), then it can be shown that ( ) = ( ) = for all , so that is not strictly increasing if one of its arguments is . G

U

F

F e x

e

F x e

e

x

e

68

F

A Counterexample to Theorems of Cox and Fine

The rst to observe potential problems with Cox's result is Paris (1994). As he puts it, \Cox's proof is not, perhaps, as rigorous as some pedants might prefer and when an attempt is made to ll in all the details some of the attractiveness of the original is lost." Paris provides a rigorous proof of the result, assuming that the range of Bel is contained in 0 1] and using assumptions similar to those of Horvitz, Heckerman, and Langlotz. In particular, he assumes that is continuous and strictly increasing in (0 1]2 and that is decreasing. However, he makes use of one additional assumption that, as he himself says, is not very appealing: F

A4. For all 0

  

and each of jBel(

S

1 and  > 0, there are sets U1 U2 U3 U4 such that U3 6= , 4 jU3 ) ; j, jBel(U3 jU2 ) ;  j, and jBel(U2 jU1 ) ;  j is less than .

U

Notice that this assumption forces the range of Bel to be dense in 0 1]. This means that, in particular, the domain on which Bel is dened cannot be nite. Is this assumption really necessary? Paris suggests that Aczel needs something like it. (This issue is discussed in further detail below.) The counterexample of this paper gives further evidence. It shows that Cox's result fails in nite domains, even if we assume that the range of Bel is in 0 1], ( ) = 1 ; (so that, in particular, is twice di erentiable and monotonically decreasing), ( ) = + , and is innitely di erentiable and strictly increasing on (0 1]2 . We can further assume that is commutative, (0 ) = ( 0) = 0, and that ( 1) = (1 ) = . The example emphasizes the point that the applicability of Cox's result is far narrower than was previously believed. It remains an open question as to whether there is an appropriate strengthening of the assumptions that does give us Cox's result in nite settings. There is further discussion of this issue in Section 5. In fact, the example shows even more. In the course of his proof, Cox claims to show that must be an associative function, that is, that ( ( )) = ( ( ) ). For the Bel of the counterexample, there can be no associative function satisfying A2. It is this observation that is the key to showing that there is no probability distribution isomorphic to Bel. What is going on here? Actually, Cox's proof just shows that ( ( )) = ( ( ) ) only for those triples ( ) such that, for some sets 1 , 2 , 3 , and 4 , we have = Bel( 4 j 3 \ 2 \ 1 ), = Bel( 3 j 2 \ 1 ), and = Bel( 2 j 1 ). If the set of such triples ( ) is dense in 0 1]3 , then we conclude by continuity that is associative. The content of A4 is precisely that the set of such triples is dense in 0 1]3 . Of course, if is nite, we cannot have density. As my counterexample shows, we do not in general have associativity in nite domains. Moreover, this lack of associativity can result in the failure of Cox's theorem. A similar problem seems to exist in Aczel's proof (as already observed by Paris (1994)). While Aczel's proof does not involve showing that is associative, it does involve showing that is associative. Again, it is not hard to show that is associative for appropriate triples, just as is the case for . But it seems that Aczel also needs an assumption that guarantees that the appropriate set of triples is dense, and it is not clear that his assumptions W

S x

G x y

x

S

x

y

F

F

F x

F

x

F

x

F x

x

F

F x F y z

F F x y

z

F

F x F y z

x y z

x

U

U

U

U

U

y

U

U

U

z

U

U

x y z

U

F F x y

z

U

U

F

W

F

G

G

F

69

Halpern

do in fact guarantee this.4 As shown in Section 2, the problem also arises in Reichenbach's proof. The counterexample to Cox's theorem, with slight modications, can also be used to show that another well-known result in the literature is not completely correct. In his seminal book on probability and qualitative probability (1973), Fine considers a non-numeric notion of comparative (conditional) probability, which allows us to say \ given is at least as probable as given ", denoted j j . Conditions on are given that are claimed to force the existence of (among other things) a function Bel such that j j i Bel( j ) Bel( j ) and an associative function satisfying A2. (This is Theorem 8 of Chapter II in (Fine, 1973).) However, the Bel dened in my counterexample to Cox's theorem can be used to give a counterexample to this result as well. Interestingly, this is not the rst time a similar error has been noted in the use of functional equations. Falmagne (1981) gives another example (in a case involving a utility model of choice behavior) and mentions that he knows \of at least two similar examples in the psychological literature". The remainder of this paper is organized as follows. In the next section there is a more detailed discussion of the problem in Cox's proof. The counterexample to Cox's theorem is given in Section 3. The following section shows that it is also a counterexample to Fine's theorem. Section 5 concludes with some discussion, particularly of assumptions under which Cox's theorem might hold. U

U

0

V

0

U V

U

0

V

V

0

U V

U V

U

0

V

0

U

0

V

0

F

2. The Problem With Cox's Proof

To understand the problems with Cox's proof, I actually consider Reichenbach's proof, which is similar in spirit Cox's proof (it is actually even closer to Aczel's proof), but uses some additional assumptions, which makes it easier to explain in detail. Aczel, Cox, and Reichenbach all make critical use of functional equations in their proof, and they make the same (seemingly unjustied) leap at corresponding points in their proofs. In the notation of this paper, Reichenbach (1949, pp. 65{67) assumes (1) that the range of Bel( j ) is a subset of 0 1], (2) Bel( j ) = 1 if  , (3) that if and are disjoint, then Bel(  j ) = Bel( j ) + Bel( j ) (thus, he assumes that A3 holds, with being +), and (4) that A2 holds with a function that is di erentiable. (He remarks that the result holds even without assumption (4), although the proof is more complicated Aczel in fact does not make an assumption like (4).) Reichenbach's proof proceeds as follows: Replacing in A2 by 1  2 , where 1 and 2 are disjoint, we get that Bel( \ ( 1  2 )j ) = (Bel( 1  2 j \ ) Bel( j )) (2) Using the fact that is +, we immediately get Bel( \ ( 1  2 )j ) = Bel( \ 1 j ) + Bel( \ 2 j ) (3) V U

V

V

0

U

V U

V

0

U

V

V

V

U

0

G

F

V

0

V

V

V U

:

V

V

V

V

V

U

F

V

V

V

U

G

V

V

V

U

V

V

U

V

V

U

4. I should stress that my counterexample is not a counterexample to Acz el's theorem, since he explicitly assumes that the range of Bel is innite. However, it does point out potential problems with his proof, and certainly shows that his argument does not apply to nite domains. Acz el is in fact aware of the problems with his proof private communication, 1996]. He later proved results in a similar spirit with the aid of a requirement of nonatomicity (Acz el & Daroczy, 1975, pp. 5{6), which is in fact a stronger requirement than A4, and thus also requires the domain to be innite.

70

A Counterexample to Theorems of Cox and Fine

and

(Bel( 1  2 j \ ) Bel( j )) = (Bel( 1 j \ ) + Bel( 2 j \ ) Bel( Moreover, by A2, we also have, for = 1 2, F

V

F

V

V

U

V U

V

V

U

V

V

U

j

))

(4)

j

))

(5)

V U

i

Bel(

V

\ Vi jU ) = F (Bel(V \ Vi jV \ U )

Bel(

V U

:

Putting together (2), (3), (4), and (5), we get that F

(Bel( \ = (Bel(

1 jV \ U ) Bel(V jU )) + F (Bel(V \ V2 jV \ U ) Bel(V jU )) V \ V1 jV \ U ) + Bel(V \ V2 jV \ U ) Bel(V jU )):

V

(6)

V

F

Taking = Bel( \ the functional equation x

V

1 jV \ U ), y = Bel(V \ V2 jV \ U ), and z = Bel(V jU ) in (6),

we get

V

( )+ ( )= ( + ) (7) Suppose that we assume (as Reichenbach implicitly does) that this functional equation holds for all ( ) 2 = f( ) 2 0 1]3 : + 1g. The rest of the proof now follows easily. First, taking = 0 in (7), it follows that F x z

x y z

P

F y z

x y z

F x

x

y z :

y

x

F

(0 ) + (

from which we get that

z

F y z

Next, x and let z ( ) = ( have that z ( ) = ylim0( ( + g

g

0

x

F x z

x

F y z

)

(0 ) = 0 ). Since is, by assumption, di erentiable, from (7) we F

z

)= (

F x

z

:

F

y z

!

); (

) ) = ylim0 (

F x z =y

)

F y z =y:

!

It thus follows that z ( ) is a constant, independent of . Since the constant may depend on , there is some function such that z ( ) = ( ). Using the fact that (0 ) = 0, elementary calculus tells us that g

0

x

z

x

h

g

0

x

z (x) = F (x

g

Using the assumption that for all Bel(

j

V U

) = Bel(

V

U V

z

h z

F

h z x:

, we have Bel(

j

V U

(1 ) = ( ) = z

z

)= ( )

\ V jU ) = F (Bel(V jV \ U )

Thus, we have that

F

h z

) = 1 if

Bel(

j

V U

U

V,

we get that

)) = (1 Bel( F

j

V U

))

:

z:

We conclude that ( ) = . Note, however, that this conclusion depends in a crucial way on the assumption that the functional equation (7) holds for all ( ) 2 .5 In fact, all that we can conclude from (6) is that it holds for all ( ) such that there exist , , 1 , and 2 , with 1 and disjoint, such that = Bel( \ j 2 1 \ ), = Bel( \ 2 j \ ), and = Bel( j ). F x z

xz

x y z

P

x y z

V

x

V

V

U

V

U

y

V

V

V

V

V

V

U

z

V

V U

5. Actually, using the continuity of , it suces that the functional equation holds for a set of triples which is dense in . F

P

71

Halpern

Let us say that a triple that satises this condition is R-constrained (since it must satisfy certain constraints imposed by the and functions the here is for Reichenbach, to distinguish this notion from a similar one dened in the next section.) As I mentioned earlier, Aczel also assumes that Bel( j ) takes on all values in  ], where = Bel(j ) and = Bel( j ). (In Reichenbach's formulation, = 0 and = 1.) There are two ways to interpret this assumption. The weak interpretation is that for each 2 0 1], there exist such that Bel( j ) = . The strong interpretation is that for each and , there exists such that Bel( j ) = . It is not clear which interpretation is intended by Aczel. Neither one obviously suces to prove that every triple in is R-constrained, although it does seem plausible that it might follow from the second assumption. In any case, neither Aczel nor Reichenbach see a need to check that Equation (7) holds throughout . (Nor does Cox for his analogous functional equation, nor do the authors of more recent and polished presentations of Cox's result, such as Jaynes (1996) and Tribus (1969).) However, it turns out to be quite necessary to do this. Moreover, it is clear that if is nite, there are only nitely tuples in that are R-constrained, and it is not the case that all of is. As we shall see in the next section, this observation has serious consequences as far as all these proofs are concerned. F

G

R

V U

E

e E

U U

e

e

U

E

x

U V

V U

V

x

V U

U

x

x

P

P

W

P

P

3. The Counterexample to Cox's Theorem The goal of this section is to prove

Theorem 3.1: There is a function Bel0, a nite domain satisfying A1, A2, and A3 respectively such that 

Bel0 (V jU ) 2 0 1] for U 6= ,



S x





W

, and functions S , F , and G

( ) = 1 ; (so that is strictly decreasing and innitely dierentiable), x

S

( ) = + (so that dierentiable), G x y

x

y

G

is strictly increasing in each argument and is innitely

is innitely dierentiable, nondecreasing in each argument in 0 1]2 , and strictly increasing in each argument in (0 1]2 . Moreover, F is commutative, F (x 0) = F (0 x) = 0, and F (x 1) = F (1 x) = x.

F

However, there is no one-to-one onto function g : 0 1] ! 0 1] satisfying (1).

Note that the hypotheses on Bel0 , , , and are at least as strong as those made in all the other variants of Cox's result, while the assumptions on are weaker than those made in the variants. For example, there is no requirement that be continuous or increasing nor that  Bel0 is a probability distribution (although Paris and Aczel both prove that, under their assumptions, can be taken to satisfy all these requirements). This serves to make the counterexample quite strong. S

G

F

g

g

g

g

72

A Counterexample to Theorems of Cox and Fine

The proof of Theorem 3.1 is constructive. Consider a domain with 12 points: . We associate with each point 2 a weight ( ), as follows. 12 W

w

1

::: w

w

W

f w

( 4 ) = 5  104 ( 5 ) = 6  104 ( 6 ) = 8  104

( 1) = 3 ( 2) = 2 ( 3) = 6

f w

f w

f w

f w

f w

f w

( 7 ) = 3  108 ( 10 ) = 3  1018 ( 8 ) = 8  108 ( 11 ) = 2  1018 ( 9 ) = 8  108 ( 12 ) = 14  1018 P For a subset of , we dene ( ) = w U ( ). Thus, we can dene a probability distribution Pr on by taking Pr( ) = ( ) ( ). Let be identical to , except that ( 10 ) = (3 ; )  1018 and ( 11 ) = (2 + )  1018P, where is dened below. Again, we extend to subsets of by dening ( ) = w U ( ). Let = f 10 11 12 g. If 6= , dene (  Bel0 ( j ) = (( \\ )) (( )) ifotherwise. U

f

f w

f w

f w

f w

f w

W

f U

W

U

0

f w

2

f U =f W

f



f

f w

0

f

w





0

U

2

f

0

f

w

W

0

w

w

f

V U

w

0

V

W

0

U

U =f U

U

f

V

U

f V

U

=f U

if Pr(

j

V U



j

V U

) ; Pr(

j

V U

)j =

>

) Pr( >

w

W

Bel0 is clearly very close to Pr. If 6= , then it is easy to see that jBel0 ( j ( \ ) ; ( \ )j ( ) . We choose 0 so that 0

0

U

U =f U

f V

f

0

0

V

jU 0 ),

then Bel0 (

) Bel0 (

j

V U

>

V

0

jU 0 ).

(8)

Since the range of Pr is nite, all suciently small satisfy (8). The exact choice of weights above is not particularly important. One thing that is important though is the following collection of equalities: Pr( 1 jf 1 2 g) = Pr( 10 jf 10 11 g) = 3 5 Pr(f 1 2 gjf 1 2 3 g) = Pr( 4 jf 4 5 g) = 5 11 Pr(f 4 5 gjf 4 5 6 g) = Pr(f 7 8 gjf 7 8 9 g) = 11 19 (9) Pr( 4 jf 4 5 6 g) = Pr(f 10 11 gjf 10 11 12 g) = 5 19 Pr( 1 jf 1 2 3 g) = Pr( 7 jf 7 8 g) = 3 11 It is easy to check that exactly the same equalities hold if we replace Pr by Bel0 . We show that Bel0 satises the requirements of Theorem 3.1 by a sequence of lemmas. The rst lemma is the key to showing that Bel0 cannot be isomorphic to a probability function. It uses the fact (proved in Lemma 3.3) that if Bel0 were isomorphic to a probability function, then there would have to be a function satisfying A2 that is associative. Although, as is shown in Lemma 3.7, the function satisfying A2 can be taken to be innitely di erentiable and increasing in each argument, the equalities in (9) suce to guarantee that it cannot be taken to be associative, that is, we do not in general have

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

w

=

w

w

w

w

w

w

w

w

w

w

w

=

w

w

w =

w

w

=

=

:

F

F

(

(

F x F y z

Indeed, there is no associative function that be di erentiable or increasing.

)) = ( (

F

F F x y

) )

z :

satisfying A2, even if we drop the requirements

F

73

Halpern

Lemma 3.2: For Bel0 as dened above, there is no associative function satisfying A2. Proof: Suppose there were such a function . From (9), we must have that F

F

(5 11 11 19) = (Bel0 ( 4 jf 4 = Bel0 ( 4 jf 4 5

w

(3 5 5 11) = (Bel0 ( 1 jf = Bel0 ( 1 jf 1

w

F

=

=

F

w

w

and that

F

=

w

w

w

F

and that

w

w

1

5 g) Bel0 (fw4 6 g) = 5=19

6 g))

5 gjfw4

w

5

w

2 gjfw1

w

2

w

w

w

=

F

It follows that

w

w

2

w

2 g) Bel0 (fw1 3 g) = 3=11:

w

3 g))

w

(3 5 (5 11 11 19)) = (3 5 5 19) =

F

=

=

F

=

=

( (3 5 5 11) 11 19) = (3 11 11 19) Thus, if were associative, we would have (3 5 5 19) = (3 11 11 19) On the other hand, from (9) again, we see that (3 5 5 19) = (Bel0 ( 10 jf 10 11 g) Bel0 (f 10 11 gjf 10 11 12 g)) = Bel0 ( 10 jf 10 11 12 g) = (3 ; ) 19 while (3 11 11 19) = (Bel0 ( 7 jf 7 8 g) Bel0 (f 7 8 gjf 7 8 9 g)) = Bel0 ( 7 jf 7 8 9 g) = 3 19 It follows that cannot be associative. ut To understand how Lemma 3.2 relates to our discussion in Section 2 of the problems with Reichenbach's proof, we say ( ) is a constrained triple if there exist sets 1 2

3 4 with 3 6=  such that = Bel0 ( 4 j 3 ), = Bel0 ( 3 j 2 ), and = Bel0 ( 2 j 1 ). It is easy to see that A2 forces to be associative on constrained triples, since if = Bel0 ( 3 j 1 ) and = Bel0 ( 4 j 2 ), by A2, we have ( ( )) = ( ) = Bel0 ( 4 j 1 ) and ( ( ) ) = ( ) = Bel0 ( 4 1 ). A4 says that the set of constrained triples is dense in 0 1]3 . We similarly dene ( ) to be a constrained pair if there exist sets 1 2 3 with 2 6=  such that = Bel0 ( 3 j 2 ) and = Bel0 ( 2 j 1 ). We say that ( 1 2 3 ) corresponds to the constrained pair ( ). (Note that there may be more than one triple of sets corresponding to a constrained pair.) If ( 1 2 3 ) corresponds to the constrained pair ( ) and satises A2, then we must have ( ) = Bel0 ( 3 j 1 ). Note that both (3 5 5 11) and (5 11 11 19) are constrained pairs, although the triple (3 5 5 11 11 19) is not constrained. It is this fact that we use in Lemma 3.2. The next lemma shows that Bel0 cannot be isomorphic to a probability function. F F

=

=

=

F

=

=

:

F

F

F

=

=

=

F

=

=

:

=

F

w

w

F

w

w

w

w

=

w

w

w

w

w

w

=

=

F

w

w

w

w

w

w

w

w

=

w

w

w

w

:

F

x y z

U

U

U

U

x

U

U

y

U

U

U

z

U

F

U

U

F F x y

w

0

z

U

F w

0

U

F x F y z

z

U

F x w

x

U

U

U

U

U

x y

U

U

w

U

y

U

U

U

U

U

U

U

x y

U

x y

=

=

F

U

U

F x y

=

=

U

U

=

74

=

=

A Counterexample to Theorems of Cox and Fine

Lemma 3.3: For Bel0 as dened above, there is no one-to-one onto function : 0 1] ! 0 1] satisfying (1). Proof: Suppose there were such a function . First note that (Bel0( )) 6= 0 if 6= . For if (Bel0 ( )) = 0, then it follows from (1) that for all  , we have g

g

g

g

U

V

g

(Bel0 ( )) = (Bel0 ( V

g

j

V U

U

))  (Bel0 ( )) = (Bel0 ( g

U

U

U

g

j

V U

))  0 = 0

:

Thus, (Bel0 ( )) = (Bel0 ( )) for all subsets of . Since the denition of Bel0 guarantees that Bel0 ( ) 6= Bel0 ( ) if is a strict subset of , this contradicts the assumption that is one-to-one. Thus, (Bel0 ( )) 6= 0 if 6= . It now follows from (1) that if 6= , then g

V

g

V

U

U

V

U

V

g

U

U

g

g

U

(Bel0 (

j

V U

U

)) = (Bel0 ( g

\ U ))=g (Bel 0 (U )):

V

(10)

Now dene ( ) = 1 ( ( )  ( )). We show that dened in this way satises A2 and is associative. This will give us a contradiction to Lemma 3.2. To see that satises A2, notice that, by applying the observation above repeatedly, if \ 6= , we get F x y

g

;

g x

g y

F

F

V

U

= = = = =

(Bel0 ( j \ ) Bel0 ( j )) 1 (( (Bel0 ( j \ ))  (Bel0 ( j )) 1 (( (Bel0 ( \ \ )) (Bel 0 ( \ )))  ( (Bel0 ( 1 ( (Bel 0 ( \ \ )) (Bel0 ( ))) 1 ( (Bel 0 ( \ j ))) Bel0 ( \ j ) F g g g g

; ; ; ;

V

0

V

g

U

V

g

V

g

V

g

V

V

0

0

V U

V

0

0 0

U

V

g

U

V

U

V U

=g

V

=g

U

g

\ U ))=g (Bel 0 (U ))))

V

U

V U

V U :

Thus, satises A2. To see that is associative, note that F

F

( (

F F x y

) ) = 1( = 1( = 1( = ( z

g g g

; ; ;

( 1 ( ( )  ( )))  ( )) ( )  ( )  ( )) ( )  ( 1 ( ( )  ( )))) ( ))

g g

;

g x

g x

g y

g x

g g

F x F y z

g y

g z

g z

;

g y

g z

:

This gives us the desired contradiction to Lemma 3.2. It follows that Bel0 cannot be isomorphic to a probability function. ut Despite the fact that Bel0 is not isomorphic to a probability function, functions , , and can be dened that satisfy A1, A2, and A3, respectively, and all the other requirements stated in Theorem 3.1. The argument for and is easy all the work goes into proving that an appropriate exists. S

F

G

S

G

F

Lemma 3.4 : There exists an innitely dierentiable, strictly decreasing function 0 1] ! 0 1] such that Bel0 ( j ) = (Bel0 ( In fact, we can take ( ) = 1 ; . V U

S x

S

j

V U

)) for all sets

.



W

with

U

S

:

6= .

x

Proof: This is immediate from the observation that Bel0( W

U V

j

V U

t u

75

) = 1 ; Bel0 (

j

V U

) for

U V



Halpern

Lemma 3.5: There exists an innitely dierentiable function : 0 1]2 ! 0 1], increasing G

in each argument, such that if U V V  W , V \ V = , and U 6= , then Bel0 (V  V G(Bel0 (V jU ) Bel0 (V U )). In fact, we can take G(x y ) = x + y . 0

0

0

0

jU ) =

Proof: This is immediate from the denition of Bel0. ut Thus, all that remains is to show that an appropriate exists. The key step is provided by the following lemma, which essentially shows that there is a well dened that is increasing. F

F

Lemma 3.6: If

2 \ U1 6= 

and V2 \ V1 6= , then

U

(a) if Bel0 (V3 jV2 \ V1 ) Bel0 (U3 jU2 \ U1 ) and Bel0 (V2 jV1 ) Bel0 (U2 jU1 ), then Bel0 (V3 \ V2 jV1 ) Bel0 (U3 \ U2 jU1 ), (b) if Bel0 (V3 jV2 \ V1 ) < Bel0 (U3 jU2 \ U1 ), Bel0 (V2 jV1 ) Bel0 (U2 jU1 ), Bel0 (U3 jU2 \ U1 ) > 0, and Bel0 (U2 jU1 ) > 0, then Bel0 (V3 \ V2 jV1 ) < Bel0 (U3 \ U2 jU1 ), (c) if Bel0 (V3 jV2 \ V1 ) Bel0 (U3 jU2 \ U1 ), Bel0 (V2 jV1 ) < Bel0 (U2 jU1 ), Bel0 (U3 jU2 \ U1 ) > 0, and Bel0 (U2 jU1 ) > 0, then Bel0 (V3 \ V2 jV1 ) < Bel0 (U3 \ U2 jU1 ),

Proof: First observe that if Bel0 ( 3 j 2 \ 1) Bel0 ( 3 j 2 \ 1) and Bel0( 2 j 1 ) Bel0 ( 2 j 1 ), then from (8), it follows that Pr( 3 j 2 \ 1 ) Pr( 3 j 2 \ 1 ) and Pr( 2 j 1 ) Pr( 2 j 1 ). If we have either Pr( 3 j 2 \ 1 ) Pr( 3 j 2 \ 1 ) or Pr( 2 j 1 ) Pr( 2 j 1 ), then we have either Pr( 3 \ 2 j 1 ) Pr( 3 \ 2 j 1 ) or Pr( 3 j 2 \ 1 ) = 0 or Pr( 2 j 1 ) = 0. It follows that either Bel0 ( 3 \ 2 j 1 ) Bel0 ( 3 \ 2 j 1 ) (this uses (8) again) or that Bel0 ( 3 \ 2 j 1 ) = Bel0 ( 3 \ 2 j 1 ) = 0. In either case, the lemma holds. Thus, it remains to deal with the case that Pr( 3 j 2 \ 1 ) = Pr( 3 j 2 \ 1 ) and Pr( 2 j 1 ) = Pr( 2 j 1 ), and hence Pr( 3 \ 2 j 1 ) = Pr( 3 \ 2 j 1 ). The details of this analysis are left to the appendix. ut V

U

U

V

V

U

V

U

V

V

V

V

V

V

U

V


P

W

z

y

P

U

F

V U

U

V U

U V

W

U

W

0

V U

V

U

0

0

0

W

0

U

0

U

V U

6. Fine assumes that ( \ j ) = ( ( consistency with Cox's theorem. P V

V

0

U

j

F P V U

) (

P V

78

0

jV \ U )).

V

0

U

0

I have reordered the arguments here for

A Counterexample to Theorems of Cox and Fine

if and are disjoint sets. Fine private communication, 1995] suggested that it might be better to constrain QCC7 so that we do not condition on events that are equivalent to  (where is equivalent to  if  and ). Since the only event equivalent to  in the counterexample of the previous section is  itself, this means that the counterexample can be used without change. This is what is done in the proof below. I show below how to modify the counterexample so that it satises Fine's original restrictions. U

U

0

U

U

U

U

Theorem 4.1: There exists an ordering satisfying QCC1, QCC2, QCC5, and QCC7, such that for every function P agreeing with , there is no associative function F of two variables such that P (V \ V )jU ) = F (P (V jV \ U ) P (V jU )). 0

Proof: Let

0

and Bel0 be as in the counterexample in the previous section. Dene so that Bel0 agrees with . Thus, j j i Bel0 ( j ) Bel0 ( j ). Clearly satises QCC1 and QCC2. As was mentioned earlier, since is nite, vacuously satises QCC5. Lemma 3.6 shows that satises parts (a) and (c) of QCC7. To show that also satises part (b) of QCC7, we must prove that if Bel0 ( 3 j 2 \ 1 ) Bel0 ( 2 j 1 ) and Bel0 ( 2 j 1 ) Bel0 ( 3 j 2 \ 1 ), then Bel0 ( 3 \ 2 j 1 ) Bel0 ( 3 \ 2 j 1 ). The proof of this is almost identical to that of Lemma 3.6 we simply exchange the roles of Pr( 2 j 1 ) and Pr( 3 j 2 \ 1 ) in that proof. I leave the details to the reader. Lemma 3.2 shows that there is no associative function satisfying A2 for Bel0 . All that was used in the proof was the fact that Bel0 satised the inequalities of (9). But these equalities must hold for any function agreeing with . Thus, exactly the same proof shows that if is any function agreeing with , then there is no associative function satisfying ( \ j ) = ( ( j \ ) ( j )). W

V U

V

0

U

0

V U

V

0

U

0

W

V

V

V

V

U

V

U

U

V

V

V

V

V

U

U

U

U

V

V

U

V

F

P

F

P V

V

0

U

F P V

0

V

U

P V U

t u

I conclude this section by briey sketching how the counterexample can be modied so that it satises Fine's original restriction. Redene by adding one more element 0 . Redene and so that ( 0 ) = ( 0 ) = 10 5  in addition, redene and on 3 , 6 , 9 , and 12 , so as to decrease their weight by 10 5 , the weight of 0 . Thus, W

f

w

f

0

f w

f

0

w

;

w

f

;

w

f

0

w

w

w

( 3 ) = ( 3 ) = 6 ; 10 5 ,



f w



f w



f w



f w

f

0

;

w

( 6 ) = ( 6 ) = 8  104 ; 10 5 , f

0

;

w

( 9 ) = ( 9 ) = 8  108 ; 10 5 , and (

f

0

12 ) = f

;

w

0

(

12 ) = 14  1018 ; 10 5 . ;

w

Finally, redene to be f 0 10 11 12 g. The denition of Bel0 in terms of , , and remains the same. With these redenitions, the proofs of the previous section go through essentially unchanged. In particular, the equalities in (9) now hold if we add 0 to every set. Let F consist of all subsets of containing 0 . Notice that F is closed under intersection and does not contain the empty set. The lack of associativity in Lemma 3.2 can now be demonstrated by conditioning on sets in F . As a consequence, we get a counterexample to Fine's theorem even when restricting to conditional objects that satisfy his restriction. W

W

0

w

w

w

w

f

0

w

0

W

w

0

79

0

f

0

Halpern

5. Discussion Let me summarize the status of various results in the light of the counterexample of this paper: 

Cox's theorem as originally stated does not hold in nite domains. Moreover, even in innite domains, the counterexample and the discussion in Section 2 suggest that more assumptions are required for its correctness. In particular, the claim in his proof that is associative does not follow. F



Although the counterexample given here is not a counterexample to Aczel's theorem, his assumptions do not seem strong enough to guarantee that the function is associative, as he claims it is. G



The variants of Cox's theorem stated by Heckerman (1988), Horvitz, Heckerman, and Langlotz (1986), and Aleliunas (1988) all succumb to the counterexample.



The claim that the function must be associative in Fine's theorem is incorrect. Fine has an analogous result (Fine, 1973, Chapter II, Theorem 4) for unconditional comparative probability involving a function as in Aczel's theorem. This function too is claimed to be associative, and again, this does not seem to follow (although my counterexample does not apply to that theorem). F

G

Of course, the interesting question now is what it would take to recover Cox's theorem. Paris's assumption A4 suces, as does the stronger assumption of nonatomicity (see Footnote 4). As we have observed, A4 forces the domain of Bel to be innite, as does the assumption that the range of Bel is all of 0 1]. We can always extend a domain to an innite|indeed, uncountable|domain by assuming that we have an innite collection of independent fair coins, and that we can talk about outcomes of coin tosses as well as the original events in the domain. (This type of \extendibility" assumption is fairly standard for example, it is made by Savage (1954) in quite a di erent context.) In such an extended domain, it seems reasonable to also assume that Bel varies uniformly between 0 (certain falsehood) and 1 (certain truth). If we also assume A4 (or something like it), we can then recover Cox's theorem. Notice, however, that this viewpoint disallows a notion of belief that takes on only nitely many gradations. Another possibility is to observe that we are not interested in just one domain in isolation. Rather, what we are interested in is a notion of belief Bel that applies uniformly to all domains. Thus, even if ( ) and ( ) are pairs of subsets of di erent (perhaps even disjoint) domains, if Bel( j ) and Bel( j ) are both 1 2, then we would expect this to denote the same relative strength of belief. In this setting, an analogue of A4 seems more reasonable. That is, we can assume that for all 0 1 and 0, there is some domain and subsets 1 , 2 , 3 , and 4 of such that the conclusion of A4 holds. If we further assume that the functions , , and are also uniform across domains (that is, that A1, A2, and A3 hold for the same choice of , , and in every domain), then we can again recover Cox's theorem.7 U V

U

0

V U

V

V

0

0

U

0

=

  

W

U

U

U

U

F

 >

W

G

S

F

G

S

7. This point was independently observed by Je Paris private communication, 1996].

80

A Counterexample to Theorems of Cox and Fine

The idea of having a notion of uncertainty that applies uniformly in all domains seems implicit in some discussion in that Jaynes' recent book on probability theory (1996). Jaynes focuses almost exclusively on nite domains.8 As he says \In principle, every problem must start with such nite set probabilities extensions to innite sets is permitted only when this is the result of a well-dened and well-behaved limiting process from a nite set." To make sense of this limiting process, it seems that Jaynes must be assuming that the same notion of uncertainty applies in all domains. Moreover, one can make arguments appealing to continuity that when we consider such limiting processes, we can always nd subsets 1 , 2 , 3 , and 4 in some suciently rich (but nite) extension of the original domain such that A4 holds. While this seems like perhaps the most reasonable additional assumptions required to get Cox's result, it does require us to consider many domains at once. Moreover, it does not allow a notion of belief that has only nitely many gradations, let alone a notion of belief that allows some events to be considered incomparable in likelihood.9 Suppose we really are interested in one particular nite domain, and we do not want to extend it or consider all other possible domains. What assumptions do we then need to get Cox's theorem? The counterexample given here could be circumvented by requiring that be associative on all tuples (rather than just on the constrained triples). However, if we really are interested in a single domain, the motivation for making requirements on the behavior of on belief values that do not arise is not so clear. Moreover, it is far from clear that assuming that is associative suces to prove the theorem. For example, Cox's proof makes use of various functional equations involving and , analogous to the equation (7) that appears in Section 2. These functional equations are easily seen to hold for certain tuples. However, as we saw in Section 2, the proof really requires that they hold for all tuples. Just assuming that is associative does not appear to suce to guarantee that the functional equations involving hold for all tuples. Further assumptions appear necessary. Nir Friedman private communication] has conjectured that the following condition, which says that essentially all beliefs are distinct, suces:  if    ,    , and ( ) 6= ( ), then Bel( j ) 6= Bel( j ). Even if this condition suces, note that it precludes, for example, a uniform probability distribution, and thus again seems unduly restrictive. Another possibly interesting line of research is that of characterizing the functions that satisfy Cox's assumptions. As the example given here shows, the class of such functions includes functions that are not isomorphic to any probability function. I conjecture that in fact it includes only functions that are in some sense \close" to a function isomorphic to a probability distribution, although it is not clear exactly how \close" should be dened (nor how interesting this class really is in practice). So what does all this say regarding the use of probability? Not much. Although I have tried to argue here that Cox's justication of probability is not quite as strong as U

U

U

U

F

F

F

F

S

F

S

U

V

U

0

V

0

U V

U

0

V

0

U V

U

0

V

0

8. Actually, Jaynes assigns probability to propositions, not sets, but, as noted earlier, there is essentially no dierence between the two. 9. Interestingly, Jaynes (1996, Appendix A) admits that having plausibility values be elements of a partiallyordered lattice may be a reasonable alternative to traditional probability theory. Nir Friedman and I (1995, 1996, 1997) have recently developed such a theory and shown that it provides a useful basis for thinking about default reasoning and belief revision.

81

Halpern

previously believed, and the assumptions underlying the variants of it need clarication, I am not trying to suggest that probability should be abandoned. There are many other justications for its use.

Acknowledgments I'd like to thank Janos Aczel, Peter Cheeseman, Terry Fine, Ron Fagin, Nir Friedman, David Heckerman, Eric Horvitz, Christopher Meek, Je Paris, and the anonymous referees for useful comments on the paper. I'd also like to thank Judea Pearl for pointing out Reichenbach's work to me and Janos Aczel for pointing out Falmagne's paper. This work was largely carried out while I was at the IBM Almaden Research Center. IBM's support is gratefully acknowledged. The work was also supported in part by the NSF, under grants IRI95-03109 and IRI-96-25901, and the Air Force Oce of Scientic Research (AFSC), under grant F94620-96-1-0323. A preliminary version of this paper appears in Proc. National Conference on Articial Intelligence (AAAI '96), pp. 1313{1319.

Appendix A. Proof of Lemma 3.6 Recall that all that remains in the proof of Lemma 3.6 is to deal with the case that Pr( 3 j 2 \ 1 ) = Pr( 3 j 2 \ 1 ) and Pr( 2 j 1 ) = Pr( 2 j 1 ), and hence Pr( 3 \ 2 j 1 ) = Pr( 3 \ 2 j 1 ). Before proceeding with the proof, it is useful to collect some general facts about Pr. A set is said to be standard if is a subset of one of f 1 2 3 g, f 4 5 6 g, f 7 8 9 g, or f 10 11 12 g. A real number is said to be relevant if there exists some standard and some arbitrary such that = Pr( j ). Notice that even if 6=  is nonstandard, then, taking to be the standard subset of which has the greatest weight, then j Pr( j ) ; Pr( j )j 002. (This is the reason that the weights are multiplied by factors such as 104 , 108 , and 1018 .) Thus, for any subsets and of , we have that Pr( j ) is close to a relevant number (where \close" means \within .002"). Call a triple ( ) of subsets of good if Bel0 ( \ j ) = Bel0 ( j \ )  Bel0 ( j ). Clearly if both ( 1 2 3 ) and ( 1 2 3 ) are good, then the lemma holds. Notice that if ( ) is not good, then f 10 11 12 g and ( \ f 10 11 12 g) 6= ( \ f 10 11 11 g), which means that \ f 10 11 12 g must contain one of 10 and 11 , but not both, and thus must be one of f 10 g, f 11 g, f 10 12 g, or f 11 12 g. Thus, we may as well assume that at least one of ( 1 2 3 ) or ( 1 2 3 ) is not good. In that case, I claim that one of the following must hold: V

U

U

U

V

U

V

U

U

U

w

w

U

V U

w

w

w

U

0

U

w

V U

U

U V V

w

w

U

W

W

V U

V

w

U

< :

U V V

f

w

U

U

V

0

w

V

U

a

0

0

w

V

V

a

V

V U

w

w

V

V

U

V

U

V

0

U

w

V

w

V

w

w

w

Bel0 (



U



f U

0

V U

V

0

V

U

V

w

w

w

f V

w

w

w

w

U



V U

w

w

w

U

w

U

w

V

V

w

V

3 \ V2 jV1 ) = Bel(V3 jV2 \ V1 ) = Bel0 (U3 jU2 \ U1 ) = Bel0 (U3 \ U2 jU1 ) = 0

V

3 \ U2 \ U1

=

2 \ U1

U

( 1 ) = ( 1 ) and ( f V

and

3 \ V2 \ V1 = V2 \ V1

V

1 \ U2 ) = f (V1 \ V2 )

f U

In the rst case, we have already seen that the lemma holds. In the second case, we have Bel0 ( 3 \ 2 j 1 ) = Bel0 ( 2 j 1 ), Bel0 ( 3 \ 2 j 1 ) = Bel0 ( 2 j 1 ), and Bel0 ( 3 j 2 \ 1 ) = Bel0 ( 3 j 2 \ 1 ) = 1, so the lemma is easily seen to hold. Finally, in the third case, notice that since Pr( 2 \ 3 j 1 ) = Pr( 2 \ 3 j 1 ), we must also have that ( 1 \ 2 \ 3 ) = V

U

V

U

V

V

V

U

U

U

U

U

V

V

V

U

U

U

U

V

V

V

f U

82

U

U

A Counterexample to Theorems of Cox and Fine

( 1 \ 2 \ 3 ). Moreover, it is easy to see that all these equalities must hold if is replaced by . Again, the lemma immediately follows. To prove the claim, for deniteness, assume that ( 1 2 3 ) is not good (an identical argument works if ( 1 2 3 ) is not good). From the characterization above of triples that are not good, it follows that ( 1 \ 2 ) =  1018 + and ( 1 ) = 19  1018 + , where 2 f2 3 16 17g (depending on 2 \ f 10 11 12 g), and both 20  108 . Clearly, the relevant number closest to Pr( 2 j 1 ) is 19. Since Pr( 2 j 1 ) = Pr( 2 j 1 ) by assumption, Pr( 2 j 1 ) is also close to 19. Thus, we must have that ( 1 \ 2 ) =  10k + and ( 1 ) = 19  10k + , where 2 f0 4 8 18g. In fact, it is easy to see that is either 8 or 18, since there are no relevant numbers of the form 19 (for 2 f2 3 16 17g) that are close to Pr( j ) if  f 1 2 3 4 5 6 g. In addition, if = 18, then 20  108 , 4 while if = 8, then 20  10 . By standard arithmetic manipulation, we have that 1018 ( ; 19 ) + 10k (19 ; ) + ( ; ) = 0 If = 8, then it is easy to see that we must have ; 19 = 0, 19 ; = 0 and ; = 0, (11) while if = 18, then we must have 19( ; ) + ( ; ) = 0 and ; = 0. (12) Now comes a case analysis. First suppose that = 8. Then we must have = = 0, since if 6= 0, then from (11) we have that = 19, and it is easy to see that there do not exist sets 1 and 2 such that ( 1 ) = , ( 2 ) = , and = 19, with 20  104 . Thus, it follows that Pr( 2 j 1 ) = Pr( 2 j 1 ) = 19. Moreover, we must have 1 = f 7 8 9 g and 2 \ 1 either f 7 g or f 8 9 g, depending on . It follows that Pr( 3 j 2 \ 1 ) must be one of f0 1 2 1g. Since Pr( 3 j 2 \ 1 ) = Pr( 3 j 2 \ 1 ), we must have that Pr( 3 j 2 \ 1 ) 2 f0 1 2 1g. Since 2 \ 1 contains exactly one of 10 and 11 , it is easy to see that Pr( 3 j 2 \ 1 ) cannot be 1 2. If Pr( 3 j 2 \ 1 ) = Pr( 3 j 2 \ 1 ) = 0, then 3 \ 2 \ 1 = 3 \ 2 \ 1 = , and we must have Bel0 ( 3 \ 2 j 1 ) = Bel0 ( 3 \ 2 j 1 ) = 0, so the claim follows. On the other hand, if Pr( 3 j 2 \ 1 ) = Pr( 3 j 2 \ 1 ) = 1, then 3 \ 2 \ 1 = 2 \ 1 and 3 \ 2 \ 1 = 2 \ 1 , and the claim again follows. Now suppose = 18. If = , then by (12), we must have that = . It immediately follows that ( 1 ) = ( 1 ) and ( 1 \ 2 ) = ( 1 \ 2 ), so the claim holds. Thus, we can suppose 6= . Suppose that 6= 0 (an identical argument works if 6= 0). Then there exists some 6= 1 such that = . Since ; = 0, it follows that = . Substituting for and for in (12), we get that (1 ; ) (1 ; ) = 19, from which it follows that = 19. Moreover, we also get that either = = 0 or = 19. It is easy to check that must be either 3 or 16. If = 19, then we must have = and = . As we have seen, this suces to prove the claim. Thus, we can assume that = = 0. But this means that 1 = f 10 11 12 g, and that 1 \ 2 is either f 10 g or f 11 12 g. It follows that the only possibilities for Pr( 3 j 2 \ 1 ) are 0, 1 8, 7 8, or 1. It is easy to see that Pr( 3 j 2 \ 1 ) cannot be 1 8 or 7 8, while the cases where it is either 0 or 1 are easily taken care of, as above. This completes the proof of the claim and of the lemma. ut f V

f

V

V

f

0

U

V

V

a

U

U

U

V

a

w

U

b

w

f U

w

0

c

V

V

U

V

w

w

0

0

a

k

U

k

U

f V

b

k

a=

V U

c

b c