International Journal of General Systems, Vol. 35, No. 5, October 2006, 509–528
Uncertainty measures on probability intervals from the imprecise Dirichlet model ´ N* J. ABELLA Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain (Received 6 February 2006; in final form 3 March 2006) When we use a mathematical model to represent information, we can obtain a closed and convex set of probability distributions, also called a credal set. This type of representation involves two types of uncertainty called conflict (or randomness) and non-specificity, respectively. The imprecise Dirichlet model (IDM) allows us to carry out inference about the probability distribution of a categorical variable obtaining a set of a special type of credal set (probability intervals). In this paper, we shall present tools for obtaining the uncertainty functions on probability intervals obtained with the IDM, which can enable these functions in any application of this model to be calculated. Keywords: Imprecise probabilities; Credal sets; Uncertainty; Entropy; Conflict; Imprecise Dirichlet model
1. Introduction Since the amount of information obtained by any action is measured by a reduction in uncertainty, the concept of uncertainty is intricately connected to the concept of information. The concept of ‘information-based uncertainty’ (Klir and Wierman 1998) is related to information deficiencies such as the information being incomplete, imprecise, fragmentary, not fully reliable, vague, contradictory or deficient, and this may result in different types of uncertainty. This paper is solely concerned with the information conceived in terms of uncertainty reduction, unlike the term ‘information’ as it is used in the theory of computability or in terms of logic. In classic information theory, Shannon’s entropy (1948) is the tool used to quantify uncertainty. This function has certain desirable properties and has been used as the starting point when looking for another function to measure the amount of uncertainty in situations in which a probabilistic representation is not suitable. Many mathematical imprecise probability theories for representing information-based uncertainty are based on a generalization of the probability theory: e.g. Dempster –Shafer’s theory (DST) (Dempster 1967, Shafer 1976), interval-valued probabilities (de Campos et al. 1994), order-2 capacities (Choquet 1953/1954), upper – lower probabilities (Suppes 1974, Fine 1983, * Email:
[email protected] International Journal of General Systems ISSN 0308-1079 print/ISSN 1563-5104 online q 2006 Taylor & Francis http://www.tandf.co.uk/journals DOI: 10.1080/03081070600687643
510
J. Abella´n
1988) or general convex sets of probability distributions (Good 1962, Levi 1980, Walley 1991, Berger 1994). Each of these represents a type of credal set that is a closed and convex set of probability distributions with a finite set of extreme points. In the DST, Yager (1983) distinguishes between two types of uncertainty: one is associated with cases where the information focuses on sets with empty intersections, and the other is associated with cases where the information focuses on sets where the cardinality is greater than one. These are called conflict and non-specificity, respectively. The study of uncertainty measures in the DST is the starting point for the study of these measures on more general theories. In any of these theories, it is justifiable that a measure capable of measuring the uncertainty represented by a credal set must quantify the parts of conflict and non-specificity. More recently, Abella´n and Moral (2005b) and Klir and Smith (2001) justified the use of maximum entropy on credal sets as a good measure of total uncertainty. The problem lies in separating these functions into others that really do measure the conflict and non-specificity parts by using a credal set to represent the information. Abella´n et al. (2006) managed to split maximum entropy into functions that are capable of coherently measuring the conflict and non-specificity of a credal set P; and also as algorithms in order to facilitate their calculation in order-2 capacities (Abella´n and Moral 2005a, 2006) so that
S * ðPÞ ¼ S* ðPÞ þ ðS * 2 S* ÞðPÞ; where S * represents maximum entropy and S* represents minimum entropy on a credal set P; with S* ðPÞ coherently quantifying the conflict part of a credal set and ðS* 2 S* ÞðPÞ the nonspecificity part of a credal set. A natural way of representing knowledge is with probability intervals (Campos et al. 1994). In this paper, we shall work with a special type of probability intervals obtained using the imprecise Dirichlet model (IDM). The main use of IDM is to infer about a categorical variable. Abella´n and Moral (2003b, 2005b) recently used IDM to join uncertainty measures in classification (an important problem in the field of machine learning). In this paper, we shall study IDM probability intervals and we shall prove that, while they can be represented by belief functions, they are not the only type of credal set belonging to belief functions and probability intervals. In addition, we shall present an algorithm that obtains the maximum entropy for this type of interval; we shall demonstrate a property that will enable us rapidly to obtain the minimum entropy for this type of interval; and using the fact that they represent a special type of belief function, we shall directly obtain the value of the Hartley measure on them. In Section 2 of this paper, we shall introduce the most important imprecise probability theories and distinguish between probability intervals and belief functions. In Section 3, we shall present the IDM and its main properties and shall also examine the situation of IDM probability intervals in relation to other imprecise probability theories. In Section 4, we shall explore uncertainty measures on credal sets. In Section 5, we shall outline some procedures and algorithms for obtaining the values of the main uncertainty measures on IDM probability intervals and practical examples. Conclusions are presented in Section 6.
Uncertainty measures on IDM
511
2. Theories of imprecise probabilities 2.1 Credal sets All theories of imprecise probabilities that are based on classical set theory share some common characteristics (see Walley 1991, Klir 2006). One of them is that evidence within each theory is fully described by a lower probability function P* on a finite set X or, alternatively, by an upper probability function P * on X. These functions are always regular monotone measures (Wang and Klir 1992) that are superadditive and subadditive, respectively, and X X P* ð{x}Þ < 1; P* ð{x}Þ > 1: ð1Þ x[X
x[X
In the various special theories of uncertainty, they possess additional special properties. When evidence is expressed (at the most general level) in terms of an arbitrary credal set, P of probability distribution functions p, on a finite set X (Kyburg 1987), functions P* and P * associated with P are determined for each set A # X by the formulae X X pð{x}Þ; P* ðAÞ ¼ sup pð{x}Þ: ð2Þ P* ðAÞ ¼ inf p[P
p[P x[A
x[A
Since for each p [ P and each A # X, it follows that P* ðAÞ ¼ 1 2 P* ðX 2 AÞ:
ð3Þ
Owing to this property, functions P* and P * are called dual (or conjugate). One of them is sufficient for capturing given evidence; the other one is uniquely determined by equation (3). It is common to use the lower probability function to capture the evidence. As is well known (Chateauneuf and Jaffray 1989, Grabisch 2000) any given lower probability function P* is uniquely represented by a set-valued function m for which mðYÞ ¼ 0 and X mðAÞ ¼ 1; ð4Þ A[‘ðXÞ
where we note ‘(X) as the power set of X. Any set A # X for which mðAÞ – 0 is often called a focal element, and the set of all focal elements with the values assigned to them by function m is called a body of evidence. Function m is called a Mo¨bius representation of P* when it is obtained for allA # X via the Mo¨bius transform X mðAÞ ¼ ð21ÞjA2Bj P* ðBÞ: ð5Þ BjB#A
The inverse transform is defined for all A # X by the formula X P* ðAÞ ¼ mðBÞ:
ð6Þ
BjB#A
It follows directly from equation (5) P* ðAÞ ¼
X
mðBÞ;
ð7Þ
BjB>A–Y
for all A # X. Assume now that evidence is expressed in terms of a given lower probability function P*. Then, the set of probability distribution functions that are consistent with P*,
J. Abella´n
512
PðP* Þ; which is always closed and convex, is defined as follows ( PðP* Þ ¼
pjx [ X; pðxÞ [ ½0; 1;
X
pðxÞ ¼ 1 P* ðAÞ
P* ðAÞ þ P* ðBÞ 2 P* ðA > BÞ; P* ðA > BÞ < P* ðAÞ þ P* ðBÞ 2 P* ðA < BÞ;
ð9Þ
for all A,B # X. Less general uncertainty theories are then based on capacities of order k. For each k . 2, the lower and upper probabilities, P* and P *, satisfy the inequalities ! ! k [ \ X P* Aj > ð21ÞjKjþ1 P Aj ; j¼1
P*
j[K
K#N k ;K–Y
k \ Aj
!
2 (the underlying capacity is said to be of order 1). This theory, which was extensively developed by Shafer (1976), is usually referred to as evidence theory or DST. In this theory, lower and upper probabilities are called belief and plausibility measures, noted as Bel and Pl, respectively. An important feature of DST is that the Mo¨bius representation of evidence m (usually called a basic probability assignment function in this theory) is a nonnegative function (m(A) [ [0,1]). Hence, we can obtain Bel and Pl function from m as the following way X X BelðAÞ ¼ mðBÞ; PlðAÞ ¼ mðBÞ: ð11Þ BjB#A
BjB>A–Y
DST is thus closely connected with the theory of random sets (Molchanov 2004). When we work with nested families of focal elements, we obtain a theory of graded possibilities, which is a generalization of classical possibility theory (De Cooman 1997, Klir 2006). 2.3 Probability intervals In this theory, lower and upper probabilities P* and P * are determined for all sets A # X by intervals [l(x), u(x)] of probabilities on singletons (x [ X). Clearly, lðxÞ ¼ P* ð{x}Þ and
Uncertainty measures on IDM
513
uðxÞ ¼ P* ð{x}Þ and inequalities (1) must be satisfied. Each given set of probability intervals I ¼ {½lðxÞ; uðxÞjx [ X} is associated with a credal set, PðIÞ; of probability distribution functions, p, defined as follows ( ) X PðIÞ ¼ pjx [ X; pðxÞ [ ½lðxÞ; uðxÞ; pðxÞ ¼ 1 : ð12Þ x[X
Sets defined in this way are clearly special cases of sets defined by equation (8). Their special feature is that they always form an (n 2 1)-dimensional polyhedron, where n ¼ jXj: In general, the polyhedron may have c vertices (corners), where n < c < nðn 2 1Þ; and each probability distribution function contained in the set can be expressed as a linear combination of these vertices (Weichselberger and Po¨hlmann 1990, de Campos et al. 1994). A given set I of probability intervals may be such that some combinations of values taken from the intervals do not correspond to any probability distribution function. This indicates that the intervals are unnecessarily broad. To avoid this deficiency, the concept of reachability was introduced in the theory (Campos et al. 1994). A given set I is called reachable (or feasible) if and only if for each x [ X and every value v(x) [ [l(x), u(x)] there exists a probability distribution function p for which pðxÞ ¼ vðxÞ: The reachability of any given set I can be easily checked: the set is reachable if and only if it passes the following tests X lðxÞ þ uðyÞ 2 lðyÞ < 1; ;y [ X; x[X
X
ð13Þ uðxÞ þ lðyÞ 2 uðyÞ > 1;
;y [ X:
x[X
If I is not reachable, it can be converted to the set I 0 ¼ {½l0 ðxÞ; u0 ðxÞjx [ X} of reachable intervals by the formulae ( ) X 0 l ðxÞ ¼ max lðxÞ; 1 2 uðyÞ ; y–x
( 0
u ðxÞ ¼ min uðxÞ;
12
X
ð14Þ
) lðyÞ ;
y–x
for all x [ X. Given a reachable set I of probability intervals, the lower and upper probabilities are determined for each A # X by the formulae ( ) X X lðxÞ; 1 2 uðxÞ ; P* ðAÞ ¼ max ( P* ðAÞ
¼ min
x[A
xA
X
X
x[A
uðxÞ;
12
)
ð15Þ
lðxÞ :
xA
The theory based on reachable probability intervals and DST are not comparable in terms of their generalities. However, they both are subsumed under a theory based on Choquet capacities of order 2 as we can see in the following subsection.
J. Abella´n
514
2.4 Choquet capacities of order 2 Although Choquet capacities of order 2 do not capture all credal sets, they subsume all the other special uncertainty theories that are examined in this paper. They are thus quite general. Their significance is that they are computationally easier to handle than arbitrary credal sets. In particular, it is easier to compute PðP* Þ defined by equation (8) when P* is a Choquet capacity of order 2. Let X ¼ {x1 ; x2 ; . . .; xn } and let s ¼ ðsðx1 Þ; sðx2 Þ; . . .; sðxn ÞÞ denote a permutation by which elements of X are reordered. Then, it is established (de Campos and Bolan˜os 1989) that for any given Choquet capacity of order 2, PðP* Þ is determined by its extreme points, which are probability distributions ps computed as follows ps ðsðx1 ÞÞ ¼ P* ð{sðx1 Þ}Þ; ps ðsðx2 ÞÞ ¼ P* ð{sðx1 Þ; sðx2 Þ}Þ 2 P* ð{sðx1 Þ}Þ; ......... ps ðsðxn ÞÞ ¼ P* ð{sðx1 Þ; . . .; sðxn Þ}Þ 2 P* ð{sðx1 Þ; . . .; sðxn21 Þ}Þ:
ð16Þ
Each permutation defines an extreme point of PðP* Þ; but different permutations can give rise to the same point. The set of distinct probability distributions ps is often called an interaction representation of P* (Grabisch 2000).
Figure 1. Main uncertainty theories ordered by their generalities.
Uncertainty measures on IDM
515
Belief functions and reachable probability intervals represent special types of capacities of order 2, as we can see in Figure 1. However, belief functions are not generalizations of reachable probability intervals and the inverse is also not verified as we can see in Examples 1 and 2, respectively: Example 1. We consider the set X ¼ {x1 ; x2 ; x3 } and the following set of probability intervals on X L ¼ {½0; 0:5; ½0; 0:5; ½0; 0:5}: This set of probability intervals L has associated a credal set, P L ; with vertices {ð0:5; 0:5; 0Þ; ð0:5; 0; 0:5Þ; ð0; 0:5; 0:5Þ}: There does not exist any basic probability assignment for this credal set. To prove this we suppose the contrary condition. Using equation (16) it can be proved that the credal set associated with a basic probability assignment on X has the vertices that we can see in Table 1, where mi ¼ mð{xi }Þ; mij ¼ mð{xi ; xj }Þ; m123 ¼ mðXÞ; i; j [ {1; 2; 3}: Then, a basic probability assignment m with the same credal set, P L ; must verify that m1 þ m12 þ m13 þ m123 ¼ 0:5; m2 þ m12 þ m23 þ m123 ¼ 0:5; m3 þ m13 þ m23 þ m123 ¼ 0:5; m1 ¼ m2 ¼ m3 ¼ 0; m2 þ m23 ¼ 0; m3 þ m23 ¼ 0; m1 þ m13 ¼ 0; m3 þ m13 ¼ 0; m1 þ m12 ¼ 0; m2 þ m12 ¼ 0; where any other option give us a contradiction. Hence, we have that mi ¼ 0; mij ¼ 0 (i, j [ {1,2,3}) and m123 ¼ 0:5; implying that m is not a basic probability assignment. Example 2. We consider the following basic probability assignment m on the finite set X ¼ {x1 ; x2 ; x3 ; x4 } defined by mð{x1 ; x2 }Þ ¼ 0:5;
mð{x3 ; x4 }Þ ¼ 0:5:
Table 1. Set of vertices associated with a basic probability assignment on a set of 3 elements.
s (1,2,3) (1,3,2) (2,1,3) (2,3,1) (3,1,2) (3,2,1)
p1
p2
p3
m1 þ m12 þ m13 þ m123 m1 þ m12 þ m13 þ m123 m1 þ m13 m1 m1 þ m12 m1
m2 þ m23 m2 m2 þ m12 þ m23 þ m123 m2 þ m12 þ m23 þ m123 m2 m2 þ m12
m3 m3 þ m23 m3 m3 þ m13 m3 þ m13 þ m23 þ m123 m3 þ m13 þ m23 þ m123
J. Abella´n
516
Computing the upper and lower probability values for every xi, we have the following set of probability intervals compatible with m: L ¼ {½0; 0:5; ½0; 0:5; ½0; 0:5; ½0; 0:5}; but this set contains the following probability distribution p0 ¼ ð0:5; 0:5; 0; 0Þ on X, that not belongs to the credal set associated with m 0 ¼ p0 ð{x3 ; x4 }Þ , Bel ð{x3 ; x4 }Þ ¼ 0:5; 1 ¼ p0 ð{x1 ; x2 }Þ . Pl ð{x1 ; x2 }Þ ¼ 0:5: However, it is easy to obtain a set of reachable probability intervals that represents the same credal set that a belief function, as we can see in the following example. Example 3. We consider the set X ¼ {x1 ; x2 ; x3 } and the following set of reachable probability intervals on X L ¼ {½0:3; 0:65; ½0:2; 0:55; ½0:15; 0:3}: This set of probability intervals L has associated a credal set, P L ; with vertices {ð0:65; 0:2; 0:15Þ; ð0:3; 0:55; 0:15Þ; ð0:5; 0:2; 0:3Þ ð0:3; 0:4; 0:3Þ}: Using Table 1, it can be obtained that also this credal set is represented by the belief function associated with the basic probability assignment (has the same set of vertices) mð{x1 }Þ ¼ 0:3; mð{x2 }Þ ¼ 0:2; mð{x3 }Þ ¼ 0:15; mð{x1 ; x2 }Þ ¼ 0:2; mð{x1 ; x2 ; x3 }Þ ¼ 0:15:
3. IDM probability intervals The IDM was introduced by Walley (1996) to draw an inference about the probability distribution of a categorical variable. Let us assume that Z is a variable taking values on a finite set X and that we have a sample of size N of independent and identically distributed outcomes of Z. If we want to estimate the probabilities, ux ¼ pðxÞ; with which Z takes its values, a common Bayesian procedure consists in assuming a prior Dirichlet distribution for the parameter vector (ux)x[X, and then taking the posterior expectation of the parameters given the sample. The Dirichlet distribution depends on the parameters s, a positive real P value, and t, a vector of positive real numbers t ¼ ðtx Þx[X ; verifying x[X tx ¼ 1: The density takes the form f ððux Þx[X Þ ¼ Q
Y GðsÞ uxs·tx 21 ; x[X Gðs·t x Þ x[X
where G is the gamma function. If r(x) is the number of occurrences of value x in the sample, the expected posterior value of parameter ux is (r(x) þ s·tx)/(N þ s), which is also the Bayesian estimate of ux (under quadratic loss).
Uncertainty measures on IDM
517
The IDM (Walley 1996) only depends on parameter s and assumes all the possible values of t. This defines a non-closed convex set of prior distributions. It represents a much weaker assumption than a precise prior model, but it is possible to make useful inferences. In our particular case, where the IDM is applied to a single variable, we obtain a credal set for this variable Z that can be represented by a system of probability intervals. For each parameter, ux, we obtain a probability interval given by the lower and upper posterior expected values of the parameter given the sample. These intervals can be easily computed and are given by [r(x)/(N þ s), (r(x) þ s)/(N þ s)]. The associated credal set on X is given by all the probability distributions p0 on X, such that p0 (x) [ [r(x)/(N þ s), (r(x) þ s)/(N þ s)], ;x. The intervals are coherent in the sense that if they are computed by taking infimum and supremum in the credal set, then the same set of intervals is again obtained. Parameter s determines how quickly the lower and upper probabilities converge as more data become available; larger values of s produce more cautious inferences. Walley (1996) does not give a definitive recommendation, but he advocates values between s ¼ 1 and s ¼ 2. We can define a generalization of a set of IDM probability intervals, considering that the frequencies r(xi) are non-negative real numbers. For the sake of simplicity, we use the same name for this type of probability interval. Formally: Definition 1. Let X ¼ {x1 ; . . .; xn } be a finite set. Then a set of IDM probability intervals on X can be defined as the set ( ) n X rðxi Þ rðxi Þ þ s ; ui ¼ ; i ¼ 1; 2; . . .; n; L ¼ ½li ; ui jli ¼ rðxi Þ ¼ N ; N þs Nþs i¼1 where r(xi) are non-negative numbers and not all are equal to zero, and s is non-negative parameter. 3.1 Properties Using the notation in definition 1, we can express the following properties: 1. Sets of IDM probability intervals generalize probability distributions. For a probability distribution p on a finite set X ¼ {x1 ; . . .; xn }; it is only necessary to consider s ¼ 0 and rðxi Þ ¼ pð{xi }Þ; for all i ¼ 1; . . .; n: 2. The credal set associated with a set L of IDM probability intervals, P L ; has the following set of vertices {v1,. . .,vn} rðx1 Þ þ s rðx2 Þ rðxn Þ ; ; . . .; v1 ¼ Nþs Nþs Nþs rðx1 Þ rðx2 Þ þ s rðxn Þ ; ; . . .; v2 ¼ Nþs N þs Nþs ......... vn ¼
rðx1 Þ rðx2 Þ rðxn Þ þ s ; ; . . .; Nþs Nþs Nþs
ð17Þ
J. Abella´n
518
3. Denoting as P sL the credal set associated with a set L of IDM probability intervals for a value of the parameter s and a fixed array of values r ¼ ðrðx1 Þ; . . .; rðxn ÞÞ; it can be verified that s1 < s2 , P sL1 # P sL2 4. Every set of IDM probability intervals represents a set of reachable probability intervals. In Section 2.4, we see that belief functions are not generalizations of probability intervals and the inverse is also not verified. However, the credal set associated with a set of IDM probability intervals L can also be expressed by a belief function. Proposition 1. Let L be a set of IDM probability intervals as in Definition 1.The credal set associated with L is the credal set associated with the belief function associated with the basic probability assignment mL mL ð{xi }Þ ¼ mL ðXÞ ¼
rðxi Þ ; i ¼ 1; 2; . . .; n Nþs s Nþs ð18Þ
mL ðAÞ ¼ 0; ;A , X; 1 , jAj , n: Proof. Using that the lower probability associated with L verifies that P* ð{xi ; . . .; xj }Þ ¼
rðxi Þ þ · · · þ rðxj Þ ; Nþs
via the Mo¨bius transform, we can obtain the following values mL ð{xi }Þ ¼ mL ð{xi ; xj }Þ ¼ mL ð{xi ; xj ; xk }Þ ¼ 2
rðxi Þ ; Nþs rðxi Þ þ rðxj Þ rðxi Þ rðxj Þ 2 2 ¼ 0; Nþs Nþs Nþs rðxi Þ þ rðxj Þ þ rðxk Þ rðxi Þ þ rðxj Þ rðxj Þ þ rðxk Þ 2 2 Nþs Nþs N þs
ð19Þ
rðxj Þ þ rðxk Þ rðxi Þ rðxj Þ rðxk Þ þ þ þ ¼ 0; Nþs Nþs Nþs Nþs
. . . . . .; for all i,j,k [ {1,2,. . .,n}. For a general set A such that 1 , jAj ¼ w , n; we have mL ð{A}Þ ¼
X
jA2Bj
ð21Þ
P* ðBÞ ¼
B#A
¼
X
ð21Þ
jA2Bj
P
20 4@
rðxi Þ : Nþs
w21 0
1
0
A2@
w22 1
1
xi [B rðxi Þ
Nþs
B#A
xi [A
:
X
0
Aþ@
w23 3
1
0
A 2 · · · þ ð21Þw21 @
w21 w21
13 A5
ð20Þ
Uncertainty measures on IDM
519
Taking into account that 0 ¼ ð1 2 1Þw21 ! w21 2 ¼ 0
w22 1
! þ
w23
!
3
2 · · · þ ð21Þ
w21
w21 w21
! ;
ð21Þ
then mL ð{A}Þ ¼ 0; Now mL ð{X}Þ ¼ 1 2
X rðxi Þ s ¼ : N þs Nþs xi [X
ð22Þ
ð23Þ
Therefore, mL obtained is a basic probability assignment on X. Now, let P L be the credal set associated with L and let P mL be the credal set associated with mL. Then, P L ¼ P mL : i) Let p [ P L be a probability distribution. Then P X rðxi Þ rðxi Þ þ s < pðAÞ < xi [A ¼ PlmL ðAÞ; Bel mL ðAÞ ¼ Nþs Nþs x [A i
for all A # X. Hence, p [ P mL ; ii) Let p [ P mL be a probability distribution. Then rðxi Þ rðxi Þ þ s ¼ BelmL ð{xi }Þ < pð{xi }Þ < PlmL ð{xi }Þ ¼ ; Nþs Nþs for all xi [ X. Hence, p [ P L :
A
Sets of IDM probability intervals are not the only credal sets that can be expressed jointly by reachable probability intervals and belief functions. As we can observe in example 3, it is possible for a credal set to be represented by a set of reachable probability intervals and by a belief function, although this credal set cannot be represented by a set of IDM probability intervals. We only need to consider in example 3 the value s/(N þ s): it must be 0.35, using l1 and u1 and 0.15 using l3 and u3. However, the description of the credal sets belonging to reachable probability intervals and belief functions is still an open problem. In Figure 1, we can see where the sets of IDM probability intervals are placed in relation to other theories of imprecise probabilities using a generality order. 4. An overview of uncertainty measures It has well been established that uncertainty in classical possibility theory is quantified by the Hartley measure (Hartley 1928). For each nonempty and finite set A # X of possible alternatives, the Hartley measure, H(A), is defined by the formula HðAÞ ¼ log2 jAj;
ð24Þ
J. Abella´n
520
where jAj denotes the cardinality of A. Since HðAÞ ¼ 1 when jAj ¼ 2; H defined by equation (24) measures uncertainty in bits. The uniqueness of H was proven on axiomatic grounds by Re´nyi (1970). The type of uncertainty measured by H is usually called non-specificity. In classical probability theory, a justifiable measure of uncertainty was derived by Shannon (1948). This measure, which is usually referred to as Shannon entropy and denoted by S, is defined for each given probability distribution function p on a finite set X by the formula X SðpÞ ¼ 2 pðxÞ log 2 pðxÞ: ð25Þ x[X
Since SðpÞ ¼ 1 when X ¼ 2 and pðxÞ ¼ 1 2 pðxÞ ¼ 0:5; S defined by equation (25) measures uncertainty in bits. However, the type of uncertainty measured by the Shannon entropy is different from the uncertainty type quantified by the Hartley measure; it is well captured by the term conflict. When the classical uncertainty theories are generalized, both types of uncertainty coexist. This requires the Hartley measure and Shannon entropy to be properly generalized in the various theories. The Hartley measure was first generalized for graded possibilities by Higashi and Klir (1983) and later to the DST by Dubois and Prade (1985). Its generalized form GH, is defined in terms of the Mo¨bius representation m, by the formula X GHðmÞ ¼ mðAÞ log2 jAj: ð26Þ A#X
The uniqueness of this generalized Hartley measure GH was proved for graded possibilities by Klir and Mariano (1987) and for the DST by Ramer (1987). Efforts to generalize the Shannon entropy to DST were less successful. Although several intuitively promising candidates for the generalized Shannon measure GS, were published in the 1980s and early 1990s, each was found to violate the essential property of subadditivity. This would have been acceptable if subadditivity were satisfied for the sum GH þ GS. Unfortunately, this was not the case for any of the proposed measures. A digest of these frustrating efforts is given in Klir and Wierman (1998) and also in Klir (2006). In the early 1990s, the unsuccessful attempts to find a generalized Shannon entropy in the DST were replaced with attempts to find an aggregated measure of both types of uncertainty (Harmanec and Klir 1994). An aggregate measure that satisfies all the required properties (additivity, subadditivity, monotonicity, proper range, etc.) was eventually found around the mid-1990s by several authors, (see Klir (2006) for more details). This aggregate uncertainty measure is a functional S * that for each belief function Bel in the DST is defined as follows ( ) X * S ðBelÞ ¼ max 2 pðxÞ log2 pðxÞ ; ð27Þ PBel
x[X
where the maximum is taken over the set PBel of all probability distribution functions p that P dominate the given function Bel (i.e. BelðAÞ < x[A pðxÞ for all A # X). This functional can be readily generalized to any given convex set of probability distributions, as shown by Abella´n and Moral (2003a). Useful algorithms for computing S * were developed for the DST by Harmanec et al. (1996), for reachable interval-valued probability distributions by Abella´n
Uncertainty measures on IDM
521
and Moral (2003a), and for the theory based on Choquet order-2 capacities (2-monotone measures) by Abella´n and Moral (2006). Although the functional S * is acceptable on mathematical grounds as an aggregate measure of uncertainty in any uncertainty theory where evidence can be represented in terms of arbitrary convex sets of probability distributions, it is highly insensitive to changes in evidence due to its aggregated nature (Klir and Smith 2001) and, moreover, it does not explicitly show measures of the two coexisting types of uncertainty, i.e. non-specificity and conflict. It is therefore desirable to disaggregate it. Clearly, S* ¼ GH þ GS; where GH and GS denote, respectively, a generalized Hartley measure (measuring non-specificity) and a generalized Shannon entropy (measuring conflict). Since S * and GH (defined by equations (26) and (27), respectively) are well established (at least in the DST), it is suggestive to define GS indirectly as the difference S * 2 GH, providing that it is non-negative. It was proven by Smith (2000) that S* 2 GH > 0 and therefore it is meaningful to take GS ¼ S* 2 GH as the generalized Shannon entropy. The disaggregated total uncertainty measure, TU, is then defined as the pair TU ¼ kGH; GSl;
ð28Þ
where GH is defined by equation (26), S * is defined by equation (27), and GS ¼ S* 2 GH: Function GH þ GS is guaranteed to satisfy all the required mathematical properties (since GH þ GS ¼ S* Þ and it does not matter whether either of the two TU components also satisfies them. This is important since subadditivity of GH is not guaranteed beyond the DST, as demonstrated in Abella´n and Moral (2005c). The idea of disaggregating S * into two components (measures of non-specificity and conflict) has opened new possibilities. One of these is based on the recognition that the following two functionals can be defined for each credal set P ( ) X S * ðPÞ ¼ max 2 pðxÞ log2 pðxÞ ; p[P
x[X
( S* ðPÞ ¼ min 2 p[P
X
)
ð29Þ
pðxÞ log2 pðxÞ :
x[X
The significance of these functionals and their difference, S * 2 S*, for capturing uncertainty associated with convex sets of probability distributions was first discussed by Kapur (1994) and Kapur et al. (1995). It was also suggested by Smith (2000) and Klir and Smith (2001). More recently, Abella´n and Moral (2005a,b) further investigated properties of the difference S * 2 S* and described an algorithm for calculating the value of S*, which is applicable to any convex set of probability distributions the lower probability function of which is a Choquet order-2 capacity. They suggested that it is reasonable to view this difference as an alternative measure of non-specificity. In other words, they suggested that a measure of non-specificity, N, be defined for each credal set P of probability distributions by the formula NðPÞ ¼ S * ðPÞ 2 S* ðPÞ:
ð30Þ
They also showed that functional N possesses the following properties: 1. NðPÞ [ ½0; log2 jXj; where X denotes the set of all alternatives (elementary events) on which the probability distributions in P are defined: NðPÞ ¼ 0 when P consists of a single
J. Abella´n
522
probability distribution; NðPÞ ¼ log2 jXj when P consists of all the probability distributions that can be defined on X (total ignorance expressed by vacuous probabilities). 2. N is monotone increasing with respect to the subsethood relationship between the sets of probability distributions defined on the same set X: for all iP and jP, if i P #j P then Nði PÞ < Nðj PÞ; 3. N is continuous; 4. N is additive. These properties, which every non-specificity measure must possess, gave rise to the suggestion that this functional may be viewed as a measure of non-specificity. Unfortunately, contrary to the generalized Hartley functional, functional N violates the essential requirement of subadditivity in virtually any uncertainty theory, including the DST. This means that N is not acceptable alone as a measure of non-specificity. However, when considered as one component of a disaggregated total uncertainty measure, then the lack of subadditivity of the individual components is of no consequence; the only thing that matters is that the aggregated uncertainty S * satisfies all the essential requirements, including subadditivity. This suggests (see Abella´n, et al. 2006) that an alternative disaggregated total uncertainty, a TU, be defined as the pair a
TU ¼ kS * ðPÞ 2 S* ðPÞ; S* ðPÞl:
ð31Þ
It can be observed that the first component of aTU is the alternative non-specificity measure N, while the second component, S*, is a generalized Shannon measure (a general measure of conflict). When the two components are aggregated, we obtain S * and this functional clearly satisfies all the essential mathematical requirements. Therefore, although neither of the aTU components is subadditive, this is of no importance since the aggregated uncertainty S * is subadditive. It is interesting to observe that the functional S * has often been considered as one of the candidates for the generalized Shannon entropy. This was dismissed since neither is this subadditive nor is it subadditive when aggregated with the generalized Hartley measure GH. However, it is perfectly justifiable when aggregated with the alternative measure of nonspecificity N. In fact, some of the other candidates considered for the generalized Shannon entropy could now be considered on similar grounds, although the functional S * seems to be better justified than its competitors not only for its properties, but also for its behavior and its applicability to all credal sets. Nevertheless, viewing the measure of non-specificity in general, as the difference of the aggregate uncertainty S * and the generalized Shannon entropy GS, opens a new area of research, the purpose of which is to compare the various candidates for GS with the functional S*. 5. Computation for uncertainty measures 5.1 Upper entropy on the IDM In Abella´n and Moral (2005b) we presented an algorithm to compute the upper entropy for order-2 capacities as an extension of the algorithm presented by Meyerovitz et al. (1994) for belief functions. For specific sets of probability intervals, we presented a more efficient
Uncertainty measures on IDM
523
algorithm in Abella´n and Moral (2003a) that can be used to obtain upper entropy for a set of IDM probability intervals. Owing to the special structure of this type of interval, we can simplify the algorithm of Abella´n and Moral (2003a) for the IDM. In order to express the algorithm, we shall assume that the following functions and procedures have already been implemented: Min (l) returns the index of the minimum value of the array l. Sig (l) returns the index of the second minor value of the array l. Nmin (l) returns the number of the index that attains the minimum value of the array l. Minof (a,b) returns the minimum value of the set {a,b} (real numbers). Now, let l be the array with the lower values of the set of IDM probability intervals L on a finite set X ¼ {x1 ; . . .; xn } such as in Definition 1, i.e. li ¼ rðxi Þ=ðN þ sÞ: Let p^ be the array with the probability with maximum entropy. The initialization steps are: ^ s) GetMaxEntro(l, p, min ˆ Min(l); sig ˆ Sig(l); nmin ˆ Nmin(l,S); For i ¼ 1 to n If li ¼ lmin then s li ˆ li þ Minofðlsig 2 lmin ; nmin Þ; p^ i ˆ li ; s s ˆ s 2 nmin·Minofðlsig 2 lmin ; nmin Þ; If s . 0 then ^ GetMaxEntro(l,p;s). The proof of this algorithm is the same as the one in Abella´n and Moral (2003a). In the IDM applications on classification methods, such as in Abella´n and Moral (2005b) or the one in Zaffalon (1999), the r(xi) values are non-negative integer numbers and s is equal to 1 or 2. In Walley (1996), the author presents arguments which favor the value of s [ [1,2]. In addition, in Bernard (2005), new strong arguments for s ¼ 2 are presented. For these values of s [ [1,2], we can obtain the upper entropy more quickly. First, we must determine A ¼ {xj : rðxj Þ ¼ mini {rðxi Þ}}: If z is the number of elements in A, then noting the distribution with maximum entropy as p^ ; the algorithm GetMaxEntro can be simplified using the procedure 1. Case z . 1 or s ¼ 1 p^ ðxi Þ ¼
8
SðqÞ: Theorem 1. With the above notation, let L be a set of probability intervals on X. The lower entropy of P L is then obtained in the probability distribution P on X P ¼ ðu*1 ; l*2 ; l*3 ; . . .; l*n Þ:
ð32Þ
Proof. Denoting r * as the reordered array of frequencies from r in a decreasing way, p is the array * r ðx1 Þ þ s r * ðx2 Þ r * ðxn Þ ; ; . . .; : N þs NþS Nþs For any probability distribution q [ P L ; with q* being its corresponding reordered array, it can be verified that r * ðxi Þ r * ðxi Þ þ s < q*i < ; ;i Nþs Nþs r * ðxi Þ þ r * ðxj Þ r * ðxi Þ þ r * ðxj Þ þ s < q*i þ q*j < ; ;i; j Nþs Nþs
ð33Þ
r * ðxi Þ þ r * ðxj Þ þ · · · þ r * ðxk Þ < q*i þ q*j þ · · · þ q*k Nþs
q* ; 1 1 N þs
_1
r * ðx1 Þ þ r * ðx2 Þ þ s p þp > q*1 þ q*2 ; _2 ¼ Nþs
_1
ð34Þ
... Pw
Xw
p¼ i¼1 _ i
i¼1 r
* ðx
iÞ
Nþs
þs
>
Xw
q* ; i¼1 i
3 < w < n:
Now, using the Lemma of Wasserman and Kadane SðpÞ < SðqÞ; ;q [ P L (we should mention that it is immediate that SðpÞ ¼ Sðp * Þ; for any p probability distribution). A
Uncertainty measures on IDM
525
5.3 Hartley measure on the IDM This result is easy to obtain considering Property 3.1. Theorem 2. Let L be a set of probability intervals on a finite variable X ¼ {x1 ; . . .; xn }: Then GHðP L Þ ¼
s log ðnÞ Nþs
Proof. We only need to consider the belief function mL of Proposition 3.1. Then GHðP mL Þ ¼ GHðP L Þ ¼
s log ðnÞ: Nþs
A
Example 4. Using the above notation, we consider the following finite set X ¼ {x1 ; x2 ; x3 ; x4 } and the vector of values r ¼ ð0; 1; 10; 7Þ on X. Then for s ¼ 7; the associate IDM probability intervals L is the set 7 1 8 10 17 7 14 ; ; ; L¼ 0; ; ; ; : 25 25 25 25 25 25 25 Let p^ be the array of the algorithm GetMaxEntro. Then p^ has the following values on each loop of the algorithm in this order 1: p^ ¼ 2: p^ ¼ 3: p^ ¼
1 10 7 0; ; ; 25 25 25
1 1 10 7 ; ; ; 25 25 25 25 3 3 10 7 ; ; ; 50 50 25 25
where finally p^ ¼ ð^pðx1 Þ; p^ ðx2 Þ; p^ ðx3 Þ; p^ ðx4 ÞÞ in equation (3) is the probability distribution with maximum entropy of P L : The minimum value of entropy of P L is attained in the probability distribution p such that pðx1 Þ ¼ l*4 ¼ l1 ¼ 0;
_
pðx2 Þ ¼ l*3 ¼ l2 ¼
1 ; 25
pðx3 Þ ¼ u*1 ¼ u3 ¼
17 ; 25
pðx4 Þ ¼ l*2 ¼ l4 ¼
7 : 25
_
_
_
J. Abella´n
526
Now, using Theorem 2, the Hartley measure on P L is s 7 log ðnÞ ¼ log ð4Þ: N þs 25
GHðP L Þ ¼
Example 5. Now, we use example 7 of Abella´n and Moral (2005b), where a classification problem is presented. Assume that we have a class variable with three possible values: X ¼ {x1 ; x2 ; x3 }: For a determinate partition of a database we have the following frequencies rðx1 Þ ¼ 4; rðx2 Þ ¼ 0; rðx3 Þ ¼ 0: With s ¼ 1; we have the following set of probability intervals, using the IDM 4 1 1 ; 1 ; 0; ; 0; : 5 5 5 Let p^ be the array of the algorithm GetMaxEntro. Now, we can use the simplification of the algorithm GetMaxEntro. We can use the case 1 of this simplification and, with the same notation, we have A ¼ {x2 ; x3 }; z ¼ 2 and rðx1 Þ rðx2 Þ þ s=z rðx3 Þ þ s=z 4 1 1 ; ; ; ; ¼ p^ ¼ Nþs Nþs Nþs 5 10 10 The minimum value of entropy of P L is attained in the probability distribution _p such that pðx1 Þ ¼ u*1 ¼ 1;
_
pðx2 Þ ¼ l*2 ¼ 0;
_
pðx3 Þ ¼ l*3 ¼ 0:
_
Now, using Theorem 2, the Hartley measure on P L is GHðP L Þ ¼
s 1 log ðnÞ ¼ log ð3Þ Nþs 5
6. Conclusions In this paper, we have presented the following study on the credal sets obtained from the probability intervals that arise when IDM is used. 1. We have proved that they represent a special type of reachable probability intervals that can be represented also by belief functions. We are determined that this type of credal sets are not the only one belongs to reachable probability intervals and belief functions. 2. We have presented their principal properties as credal sets. 3. We have developed and proved results and algorithms to obtain important uncertainty measures on this type of credal sets: maximum of entropy, minimum of entropy and Hartley measure.
Uncertainty measures on IDM
527
Acknowledgement This work has been supported by the Spanish Ministry of Science and Technology under the Algra project (TIN 2004-06204-C03-02).
References J. Abella´n and S. Moral, “Maximum of entropy for credal sets”, Int. J. Uncertainty, Fuzziness Knowledge-Based Systems, 11, pp. 587– 597, 2003a. J. Abella´n and S. Moral, “Using the total uncertainty criterion for building classification trees”, Int. J. Intell. Systems, 18, pp. 1215– 1225, 2003b. J. Abella´n and S. Moral, “Maximum difference of entropies as a non-specificity measure for credal sets”, Int. J. Gen. Systems, 34, pp. 201– 214, 2005a. J. Abella´n and S. Moral, “Upper entropy of credal sets. Applications to credal classification”, Int. J. Approx. Reasoning, 39, pp. 235– 255, 2005b. J. Abella´n and S. Moral, “An algorithm that computes the upper entropy for order-2 capacities”, Int. J. Uncertainty, Fuzziness Knowledge-Based Systems, 14(2), pp. 141–154, 2006. J. Abella´n and S. Moral, “Corrigendum: a non-specificity measure for convex sets of probability distributions”, Int. J. Uncertainty, Fuzziness Knowledge-Based Systems, 13, p. 467, 2005c. J. Abella´n, G.J. Klir and S. Moral, “Disaggregated total uncertainty measure for credal sets”, Int. J. Gen. Systems, 35, 2006, 35(1), pp. 29 –44, 2006. J.O. Berger, “An overview of robust Bayesian analysis (with discussion)”, Test, 5, pp. 5–124, 1994. J.M. Bernard, “An introduction to the imprecise Dirichlet model for multinomial data”, Int. J. Approx. Reasoning, 39, pp. 123–150, 2005. L.M. de Campos and M.J. Bolan˜os, “Characterization of fuzzy measures trough probabilities”, Fuzzy Sets Systems, 31, pp. 23–36, 1989. L.M. de Campos, J.F. Huete and S. Moral, “Probability intervals: a tool for uncertainty reasoning”, Int. J. Uncertainty, Fuzziness Knowledge-Based System, 2, pp. 167– 196, 1994. A. Chateauneuf and J.Y. Jaffray, “Some characterizations of lower probabilities and other monotone capacities through the use of Mo¨bius inversion”, Math. Soc. Sci., 17, pp. 263–283, 1989. G. Choquet, “The´orie des Capacite´s”, Ann. Inst. Fourier, 5, pp. 131–292, 1953/1954. G. De Cooman, “Possibility theor—I, II, III”, Int. J. Gen. Systems, 25, pp. 291 –371, 1997. A.P. Dempster, “Upper and lower probabilities induced by a multivaluated mapping”, Ann. Math. Statistic, 38, pp. 325–339, 1967. D. Dubois and H. Prade, “A note on measure of specificity for fuzzy sets”, Int. J. Gen. Systems, 10, pp. 279 –283, 1985. T.L. Fine, “Foundations of probability”, in Basics Problems in Methodology and Linguistics, R.E. Butts and J. Hintikka, Eds., Dordrecht: Reidel, 1983, pp. 105–119. T.L. Fine, “Lower probability models for uncertainty and nondeterministic processes”, J. Stat. Plan. Infer., 20, pp. 389–411, 1989. I.J. Good, “Subjective probability as the measure of a non-measurable set”, in Login, Methodology and Philosophy of Science, E. Nagel, P. Suppes and A. Tarski, Eds., California: Stanford University Press, 1962, pp. 319–329. M. Grabisch, “The interaction and Mo¨bius representations of fuzzy measures on finite speces, k-additive measures: a survey”, in Fuzzy Measures and Integrals: Theory and Applications, M. Grabisch, et al. Ed., New York: Springer-Verlag, 2000. D. Harmanec and G.J. Klir, “Measuring total uncertainty in Dempster-Shafer theory: a novel approach”, Int. J. Gen. System, 22, pp. 405– 419, 1994. D. Harmanec, G. Resconi, G.J. Klir and Y. Pan, “On the computation of uncertainty measure in Dempster-Shafer theory”, Int. J. Gen. System, 25, pp. 153–163, 1996. R.V.L. Hartley, “Transmission of information”, The Bell Systems Tech., 7, pp. 535 –563, 1928. M. Higashi and G.J. Klir, “Measures of uncertainty and information based on possibility distributions”, Int. J. Gen. System, 9, pp. 43–58, 1983. J.N. Kapur, Measures of Information and their Applications, Ch. 23. New York: John Wiley, 1994. J.N. Kapur, G. Baciu and H.K. Kesavan, “The minmax information measure”, Int. J. Systems Sci., 26, pp. 1 –12, 1995. G.J. Klir, Uncertainty and Information: Foundations of Generalize Information Theory, New York: John Wiley, 2006. G.J. Klir and M. Mariano, “On the uniqueness of possibilistic measure of uncertainty and information”, Fuzzy Sets Systems, 24, pp. 197– 219, 1987. G.J. Klir and R.M. Smith, “On measuring uncertainty and uncertainty-based information: recent developments”, Ann. Math. Artif. Intell., 32, pp. 5–33, 2001.
528
J. Abella´n
G.J. Klir and M.J. Wierman, Uncertainty-Based Information: Elements of Generalized Information Theory. PhysicaVerlag/Springer-Verlag, Heidelberg and New York (2nd ed., 1999). H.E. Kyburg, “Bayesian and non-Bayesian evidential updating”, Artif. Intell., 31, pp. 271 –293, 1987. I. Levi, The Enterprise of Knowledge, London: NIT Press, 1980. A. Meyerowitz, F. Richman and E.A. Walker, “Calculating maximum-entropy probabilities densities for belief functions”, Int. J. Uncertainty, Fuzziness Knowledge-Based Systems, 2, pp. 377– 389, 1994. I. Molchanov, Theory of Random Sets, New York: Springer, 2004. A. Ramer, “Uniqueness of information measure in the theory of evidence”, Fuzzy Sets Systems, 35, pp. 183 –196, 1987. A. Re´nyi, Probability Theory, Amsterdam: North-Holland, 1970. G. Shafer, A Mathematical Theory of Evidence, Princeton: Princeton University Press, 1976. C.E. Shannon, “A mathematical theory of communication”, The Bell System Tech. J., 27, pp. 423– 379, 623 –656, 1948. R.M. Smith, “Generalized information theory: resolving some old questions and opening some new ones”. PhD dissertation, Binghamton University-SUNY, Binghamton, 2000. P. Suppes, “The measurement of belief (with discussion)”, J. Roy. Statist. Soc. B, 36, pp. 160–191, 1974. P. Walley, Statistical Reasoning with Imprecise Probabilities, New York: Chapman and Hall, 1991. P. Walley, “Inferences from multinomial data: learning about a bag of marbles”, J. Roy. Statist. Soc. B, 58, pp. 3–57, 1996. Z. Wang and G.J. Klir, Fuzzy Measure Theory, New York: Plenum Press, 1992. L. Wasserman, J.B. Kadane, Bayesian Analisis in Statistics and Econometrics, Ch. 47. New York: John Wiley, 1996, pp. 549–555. K. Weichselberger and S. Po¨hlmann, A Methodology for Uncertainty in Knowledge-Based Systems, New York: Springer-Verlag, 1990. R.R. Yager, “Entropy and specificity in a mathematical theory of evidence”, Int. J. Gen. Systems, 9, pp. 249 –260, 1983. M. Zaffalon, “A credal approach to naive classification”, Proceedings of the First International Symposium on Imprecise Probabilities and their Applications, ISIPTA ’99, Gent 1999, pp. 405 –414.
Joaquı´n Abella´n received his PhD degree in January of 2003 from the University of Granada. He is an assistant professor of computer science and artificial intelligence at the University of Granada. His current research interests are a representation of the uncertainty through convex sets of probability distributions and its applications to classification.