A novel measure for evaluating classifiers - CiteSeerX

Report 22 Downloads 125 Views
Expert Systems with Applications 37 (2010) 3799–3809

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

A novel measure for evaluating classifiers Jin-Mao Wei a,*, Xiao-Jie Yuan a, Qing-Hua Hu b, Shu-Qin Wang c a

Department of Computer Science, Nankai University, Tianjin 300071, China Harbin Institute of Technology, Harbin, China c College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China b

a r t i c l e

i n f o

Keywords: Performance evaluation Entropy Accuracy Classification

a b s t r a c t Evaluating classifier performances is a crucial problem in pattern recognition and machine learning. In this paper, we propose a new measure, i.e. confusion entropy, for evaluating classifiers. For each class cli of an ðN þ 1Þ-class problem, the misclassification information involves both the information of how the samples with true class label cli have been misclassified to the other N classes and the information of how the samples of the other N classes have been misclassified to class cli . The proposed measure exploits the class distribution information of such misclassifications of all classes. Both theoretical analysis and statistical experiments show the proposed measure is more precise than accuracy and RCI. Experimental results on some benchmark data sets further confirm the theoretical analysis and statistical results and show that the new measure is feasible for evaluating classifier performances. Ó 2009 Elsevier Ltd. All rights reserved.

1. Introduction Evaluating classifier performances is a crucial step in classifications. For a classification problem, one needs a metric to evaluate which of the classifiers induced by various methods is superior to the others. The measure of accuracy has been used for decades for this purpose. Although the classification accuracy seems to be the most logical way of measuring classifier performances, it may turn to be inefficient in some classification cases. It has been criticized in recent years for the costs of misclassification cannot be appropriately taken into consideration (Everson & Fieldsend, 2006; Hand & Till, 2001). For making up for the inefficiency, some methods and measures have been suggested. The method of ROC, i.e. the receiver operating characteristic, is one of the most recommended methods for evaluating classifier performances. Since its introduction, the method of ROC has been widely studied and used especially in medical diagnosis (Bradley, 1997; Hanley & McNeil, 1982; Metz, 1978; Swets, 1988). ROC analysis provides a convenient graphical display of the trade-off between true and false positive classification rates for two-class problems (Provost & Fawcett, 1997). The area under the ROC curve, i.e. AUC, has become an important performance measure (Bradley, 1997; Landgrebe & Duin, 2008). Unlike ROC analysis, it is invariant to operating conditions. ROC analysis and AUC are powerful for comparing competing classification models over a range of operating conditions. Since they were originally introduced for two-class problems, much ef* Corresponding author. E-mail address: [email protected] (J.-M. Wei). 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.11.040

fort is being devoted to exploiting them for multi-class problems in recent years (Edwards, Metz, & Nishikawa, 2005; Ferri, Hernandez-Orallo, & Salido, 2003; Landgrebe & Duin, 2006, 2008; Srinivasan, 1999). On the other hand, the classification accuracy is also inefficient for unfolding subtle differences of classifiers, with which this paper is mainly concerned. For a classification problem, various classifiers induced by different methods under different conditions may sometimes obtain very similar or even the same classification accuracy. The classification details of how the samples of each class are separated from that of the others cannot be revealed by the accuracy when such cases occur. It may be crucial to discriminate such cases in some real applications. For example, in medical diagnosis, it is usually of vital importance that samples of different types of diseases are separated from each other and from normal ones by classification rules. With such consideration, the authors proposed in Sindhwani, Bhattacharge, and Rakshit (2001) a measure called relative classifier information, RCI in short, for evaluating classifier performances. The measure of RCI can evaluate how well a classifier can discriminate different classes from the results of class predictions (Sindhwani et al., 2001). In Statnikov, Aliferis, Tsamardinos, Hardin, and Levy (2005), the authors have used the measure of RCI to evaluate classifier performances. In this paper, we propose a new measure for performance evaluation. The proposed measure, i.e. confusion entropy, is designed directly for multi-class problems. It tries to make thorough use of information of confusion matrices. It takes into account the class distribution information of all off-diagonal elements of confusion matrices that convey misclassification information. In the following sections, we

3800

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

will firstly define confusion entropy. We will also compare the proposed measure with accuracy and RCI for classifier performance evaluation. 2. Confusion entropy for evaluating classifiers Given a data set and a classifier induced from it, the corresponding confusion matrix summarizes the classification results of the samples in the data set. By checking a row cli , the element C i;k ; k ¼ 1; 2; . . ., indicates how many samples with true class label cli have been classified to each of all classes. By checking a column clj , the element C k;j ; k ¼ 1; 2; . . ., indicates how many samples of each class have been classified to class clj . The diagonal elements indicate how many samples have been correctly classified. The off-diagonal elements indicate how many samples have been misclassified. Table 1 shows an example of a confusion matrix. For an ðN þ 1Þ-class problem, the misclassification probability of classifying the samples of class cli to class clj subject to class clj , denoted as Pji;j , is defined as:

Pji;j ¼ PNþ1 k¼1

k¼1

P ii;i

ðC j;k þ C k;j Þ

;

i–j;

i; j ¼ 1; . . . ; N þ 1

C i;j ðC i;k þ C k;i Þ

;

i–j;

i; j ¼ 1; . . . ; N þ 1

ð3Þ

The confusion entropy of class clj ; CENj in short, is defined as: N þ1   X Pjj;k log 2N Pjj;k þ Pjk;j log 2N P jk;j

ð4Þ

k¼1;k–j

In the equation, if P jj;k ¼ 0 then Pjj;k log 2N P jj;k ¼ 0. If Pjk;j ¼ 0 then Pjk;j log 2N Pjk;j ¼ 0. The overall confusion entropy of a confusion matrix, CEN in short, is defined as:

CEN ¼

X

Pj CEN j ;

ð5Þ

þ C k;j Þ ; k;l C k;l

ð6Þ

j

where

P Pj ¼

k ðC j;k

2

P

which can be called the confusion probability of class clj . For understanding the proposed measure, we further analyze the above definitions. From Eqs. (1) and (2) one can see that the misclassification probability is defined subject to a certain class. For a class clj , the row and column clj of the confusion matrix involve all the classification information about this class. Except for C j;j , the row element C j;i ði ¼ 1; 2; . . .Þ conveys the misclassification information of how many samples of class clj have been misclassified to class cli . The column element C i;j ði ¼ 1; 2; . . .Þ indicates how many samples of class cli have been misclassified to class clj . Hence, an element C i;j will be used twice for computing the misclassifi-

Table 1 A confusion matrix of a four-class problem.

cl1 cl2 cl3 cl4

The above equation indicates that the overall misclassification probability subject to class clj will take the value of 1 only when C j;j ¼ 0. On the other hand, the overall misclassification probability will take the value of 0 if

C j;j ¼

Nþ1 X ðC k;j þ C j;k Þ

ð8Þ

k¼1

Eq. (8) can be further rewritten as

ð1Þ

ð2Þ

ð7Þ

Nþ1 X

ðC k;j þ C j;k Þ

ð9Þ

k¼1;k–j

The equation implies that the overall misclassification probability subject to class clj will take the value of 0 only when C k;j and C j;k take the value of 0 for all k–j. That is Nþ1 X

ðC k;j þ C j;k Þ ¼ 0

ð10Þ

k¼1;k–j

¼0

CEN j ¼ 

 X j 2C j;j Pk;j þ P jj;k ¼ 1  PNþ1 ðC k;j þ C j;k Þ k;k–j k¼1

2C j;j ¼ 2C j;j þ

C i;j

The misclassification probability of classifying the samples of class cli to class clj subject to class cli , denoted as P ii;j , is defined as:

Pii;j ¼ PNþ1

cation probabilities subject to class cli and clj . Pji;j is defined to describe how the samples of each other class cli have been misclassified to class clj . Pii;j is defined to describe how the samples of class cli have been misclassified to each other class clj . Apparently, we have

cl1

cl2

cl3

cl4

50 10 10 0

10 90 0 10

10 1 50 10

30 0 42 83

Apparently, all samples of class clj have been correctly classified, and no sample of any other classes has been misclassified to class clj when this case occurs. The two definitions and the above analysis are essential for understanding the computation of confusion entropy CEN j , which is computed from the misclassification probabilities subject to class clj . First of all, in Eq. (4), it should be noticed that the base of logarithm is 2N. For an ðN þ 1Þ-class problem, there are N þ 1 rows and N þ 1 columns in the corresponding confusion matrix. For class cli , the elements in the row and column of cli , i.e. C i;k and C k;i , except for C i;i , indicate how the samples are misclassified. That is, the samples with true class label cli may be misclassified only to the N other classes. In addition, only the samples of the N other classes may be misclassified to class cli . Therefore we have 2N kinds of possible misclassifications for each class of the ðN þ 1Þ-class problem. Take Table 1 as an example. For class cl2 , the elements with bold face in row cl2 and column cl2 except for C 2;2 convey the information of misclassification. The element C 2;1 ; C 2;3 and C 2;4 indicate how many samples with true class label cl2 have been misclassified to class cl1 ; cl3 and cl4 . The element C 1;2 ; C 3;2 and C 4;2 indicate how many samples of class cl1 ; cl3 and cl4 have been misclassified to class cl2 . Hence the base of logarithm will be 6 (not 4 or 8) for computing the confusion entropy of class cl2 and all of the other classes. Eq. (4) can be simply rewritten as

CEN j ¼ 

Nþ1 X k¼1;k–j

Pjj;k log 2N Pjj;k 

Nþ1 X

Pjk;j log 2N Pjk;j

ð11Þ

k¼1;k–j

The first item in the right-side of the equation can be taken as the confusion entropy corresponding to row clj . The second item can be taken as the confusion entropy corresponding to column clj . The first item will take the value of 0 if no sample of class clj has been misclassified. The second item will take the value of 0 if no sample of other classes has been misclassified to class clj . Apparently, CENj will take the value of 0 if all the samples of class clj have been correctly classified and no sample of other classes has been misclassified to class clj . On the other hand, the first item will take

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

the largest value if all the samples of class clj have been uniformly misclassified to all the other classes. The second item will take the largest value if the samples that have been misclassified to class clj are uniformly from all the other classes. When all the samples of class clj are uniformly misclassified to all the other N classes and the same number of samples that are misclassified to class clj are uniformly from all the other N classes, CENj will take the largest value of 1. In other words, CENj will take the value of 1 if C j;j ¼ 0 and C k1;j ¼ C j;k2 for all k1; k2–j. The overall confusion entropy CEN can be taken as the weighted sum of the confusion entropy of all classes. For all Pj we have

X

Pj ¼ 1

ð12Þ

j

This can be easily obtained from the definition of P j . Eq. (6) can be simply rewritten as

P P C j;k C k;j Pj ¼ Pk þ Pk 2 k;l C k;l 2 k;l C k;l

ð13Þ

In the right-side of the equation, the first item involves the row information of class clj . The second item involves the column information of class clj . If the number of samples with true class label clj is small, the value of Pj tends to be small. If the number of samples that are misclassified to class clj is small, the value of P j also tends to be small. The classification information of class clj involves the above two kinds of information. Pj is defined to describe the proportion of classification information of class clj to the whole classification result. It can be easily accepted that the classification result of the majority class will mainly affect the overall classification result. The classification result of the classes with smaller number of samples will make less contribution. Consequently, if Pj is larger than Pi , the classification information of class clj will contribute more to the overall confusion entropy than that of class cli . From the above analysis, we have two observations about confusion entropy. One is that the measure only computes the entropy of misclassified samples. The samples that are correctly classified make no direct contribution to the calculation of the measure. Consequently, it implies that the smaller the number of misclassified samples is, the smaller the value of confusion entropy tends to be. The best case of classification is that all samples are correctly classified. This means all off-diagonal elements take the value of 0. The value of confusion entropy will be zero if this case occurs. From the definitions, this is the unique case when confusion entropy takes the value of zero. This also indicates that the proposed measure tends to take small value when the value of accuracy is large. In other words, we say the two measures tend to keep ’consistent’ with each other. Another observation is that the more balanced the class distribution of misclassified samples is, the larger the value of confusion entropy tends to be. The worst case of classification is that no sample has been correctly classified, i.e. all diagonal elements of the confusion matrix take the value of zero, and all other elements take the same value. The value of confusion entropy of the case will be 1. This can be easily computed from the above definitions. By further investigating we may find, when the sum of diagonal elements remains the same, the off-diagonal elements may take various values, which will result in different values of confusion entropy. This apparently indicates that the measure of confusion entropy is more discriminating than accuracy. Moreover, the proposed measure is capable of measuring how the samples of different classes have been separated from each other. We argue that the samples of different classes are well separated from each other if the class distribution of misclassified samples is imbalanced. The above analysis suggests that the measure of confusion entropy potentially has its merits in classifications. Yet it is still insuf-

3801

ficient to determine whether and in what degree the measure of confusion entropy is superior for evaluating classifiers. In the following sections, we try to compare the proposed measure with accuracy and RCI. We will determine in terms of consistency and discriminancy which measure is superior for evaluating classifier performances. 3. Some measures defined on confusion matrix The concept of confusion matrix is well known and has been widely exploited in related studies. Some measures have been defined on confusion matrix for different purposes. The accuracy of a classifier can be easily obtained from its corresponding confusion matrix. In their work (Freitas, de Carvalho, & Oliveira, 2007), the authors defined a simple measure on confusion matrix, i.e. the global performance index, for evaluating performances of ensemble classifiers. The global performance index of a classifier for an Nclass problem was defined as:

RR ¼ 1N

N X

C i;j ;

ð14Þ

i;j¼1

where C i;j is an element of the N  N confusion matrix. By computing the distances between each of all classifiers and the others, the distance-based disagreement was obtained to express classifiers disagreement of an ensemble classifier. In van Son (1994, 1995), the author defined the entropy of a confusion matrix for speech research as:

HCM ¼

J I X X i¼1

pi;j log 2 pi;j ;

ð15Þ

j¼1

where I is the number of stimulus classes, J the number of response categories, and pi;j the probability of response j to stimulus i. Interestingly, in their works (Kreuz, Haas, Morelli, Abarbanel, & Politi, 2007; Victor & Purpura, 1996), the authors discussed the measuring of response clustering in neuronal response studies in terms of transmitted information. The used measure that was introduced in Abramson (1963) was defined as:

" # X X 1X H¼ C i;j logC i;j  log C i;j  log C i;j þ logC C i;j i j

ð16Þ

H was called the transmitted information of the classifier. The above three measures were all defined on confusion matrix. Without investigating the original intentions of their works, it seems that the measures expect for the first one can be exploited for evaluating classifiers. The value of the second and third measures will reach the largest and the smallest respectively when all elements in a confusion matrix distribute uniformly, i.e. each element takes the value of NC2 . Where C is the total number of samples, N the number of classes. Such result is apparently different from that of the measure of confusion entropy. For the worst case of classification, the measure of confusion entropy will take its largest value when all diagonal elements take 0 and all off-diagonal C C for N-class problems(NðNþ1Þ for elements take the value of NðN1Þ ðN þ 1Þ-class problems). Furthermore, the former two measures take their worst values but the corresponding accuracy does not take the worst value of zero. From another angle, if the accuracy is zero, the two measures do not take their worst values. It is intuitively hard to accept this fact. This further implies that much inconsistency will be observed when the cases between the two extreme ones occur. On the other hand, the value of HCM will reach the smallest when the samples of each class are classified to only one class, which is not necessarily the correct class. If all samples are correctly classified, HCM takes its best value. However, it will

3802

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

also take the best value if the samples of each class are classified to a unique wrong class. The measure of H will reach the largest value of logN when the samples of each class are classified to only one class, while no sample of other classes has been classified to the same class. As one may notice, for the best cases of classification, the two measures are also different from the measure of confusion entropy, which will take its smallest or best value 0 when all samples have been correctly classified. In the following section, we will mainly compare the proposed measure with accuracy and RCI which are used for evaluating classifiers.

4. Comparisons In this section, we will firstly compare the measure of confusion entropy with accuracy and RCI respectively on some extreme cases. Subsequently, we try to compare the measures statistically to reach some conclusions of the proposed measure.

As we are all familiar, the value of accuracy can be computed directly from confusion matrices. When the diagonal elements of a matrix are known, the accuracy can be uniquely obtained no matter how the other off-diagonal elements take their values. Whereas the value of confusion entropy will change in the same cases. We try to use the extreme cases shown in Tables 2, 3 to show the difference between the two measures. The accuracy is 0.7 for each of the four cases. The confusion entropies for the four cases are 0.2299, 0.105, 0.2825,0.1774, respectively. The results show that the accuracy does not differentiate the four cases, while the confusion entropy can clearly tell them apart. It is easy to accept that case b ranks the highest for all samples of the majority class are correctly classified, and no sample of other classes is misclassified to this class. It is worth noticing that case d ranks the second. Compared with the other two cases, in case d, the samples of two classes are correctly classified. It may be little difficult to compare case a and c. Case a is a trivial classification result while case c involves more misclassification information than case a. From the cases, we find the measure of confusion entropy prefers the cases, such as case a, with less misclassification information. Moreover, one may notice that it is worthy of measuring the differences of classifiers especially when the accuracy cannot differentiate them.

Table 2 Two extreme cases a and b.

cl1 cl2 cl3 cl4

(b)

cl1

cl2

cl3

cl4

cl1

cl2

cl3

cl4

70 5 10 15

0 0 0 0

0 0 0 0

0 0 0 0

70 0 0 0

0 0 0 15

0 5 0 0

0 0 10 0

Table 3 Two extreme cases c and d.

Hd ¼ 

P X P C i;l C i;l l log l N N i

ð17Þ

The overall uncertainty after observation was defined as:

X P C k;j k Hoj ; N j

ð18Þ

where

Hoj ¼ 

X i

C C P i;j log P i;j ; C k;j k k C k;j

ð19Þ

which is called the uncertainty after observing output clj . The amount of uncertainty removed by the classifier therefore was:

Hc ¼ Hd  HO

ð20Þ

Hc =Hd was called the relative classifier information. From the definition, we notice that Hoj computes the entropy of class distribution of the samples observed in the output of class clj . If the observed samples are uniquely from one class, Hoj will take the value of 0. Consequently, one may notice that the measure of RCI will take its largest value of 1 when the samples of each class are classified to a class and no sample of other classes is classified to this class. On the other hand, it will take its smallest value of 0 when the samples of each class are classified uniformly to each of all classes. It is easy to compute the values of RCI of the four cases shown in Tables 2, 3. It can be seen that the RCI is also able to separate the four cases apart from each other. Nevertheless, compared with the computation of confusion entropy, we find that the measure of RCI only takes into account the column information of a confusion matrix and ignores the row information that indicates how the samples of one class have been misclassified to other classes. The two extreme cases shown in Table 4 are used to compare the measure of RCI with confusion entropy. The RCI takes the value of 1 for cases e, f and case b in Table 2 and also for the case when all samples are correctly classified. The values of confusion entropy for cases e and f are 0.2744, 0.2416, respectively. It is certainly acceptable that case b is better than case f and case f is better than case e.

Table 4 Two extreme cases e and f.

(c)

cl1 cl2 cl3 cl4

In Sindhwani et al. (2001), the authors proposed another measure called relative classifier information, RCI in short, for evaluating classifier performances. It was introduced in view of the observation that the classification accuracy is incapable of unfolding how the samples of different classes have been separated from each other. The measure was also defined on confusion matrix and was designed directly for multi-class problems. For comparison, we firstly review the definition of the measure. Given a confusion matrix, each element C i;j of the matrix is the number of samples with true class label cli that have been classified to class clj . The uncertainty of the problem was defined as:

HO ¼

4.1. Confusion entropy versus accuracy

(a)

4.2. Confusion entropy versus RCI

(d)

(e)

cl1

cl2

cl3

cl4

cl1

cl2

cl3

cl4

50 5 0 0

5 0 0 5

15 0 10 0

0 0 0 10

50 0 0 0

20 5 10 0

0 0 0 0

0 0 0 15

cl1 cl2 cl3 cl4

(f)

cl1

cl2

cl3

cl4

cl1

cl2

cl3

cl4

0 0 0 15

0 0 10 0

0 5 0 0

70 0 0 0

0 0 10 0

0 5 0 0

0 0 0 15

70 0 0 0

3803

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

4.3. Statistical comparisons on synthetic data Although the above extreme examples show that confusion entropy is more competent than accuracy and RCI in some cases, we still need to compare whether the proposed measure is always superior for evaluating classifiers. In this section, we try to compare the measures statistically in terms of consistency and discriminancy that were introduced in Huang and Ling (2005). The original definitions of consistency and discriminancy can be found in their paper. For understanding the meanings of the criteria, we review the definitions of degree of consistency and discriminancy. Degree of consistency was defined as: (Huang & Ling, 2005). For two measures f and g on domain w, let R ¼ fða; bÞja; b 2 w; f ðaÞ > f ðbÞ; gðaÞ > gðbÞg and S ¼ fða; bÞja; b 2 w; f ðaÞ > f ðbÞ; gðaÞ < gðbÞg. The degree of consistency of f and g is Cð0 6 C 6 1Þ, jRj . where C ¼ jRjþjSj Degree of discriminancy was defined as: (Huang & Ling, 2005). For two measures f and g on domain w, let P ¼ fða; bÞja; b 2 w; f ðaÞ > f ðbÞ; gðaÞ ¼ gðbÞg and Q ¼ fða; bÞja; b 2 w; gðaÞ > gðbÞ; jPj . f ðaÞ ¼ f ðbÞg. The degree of discriminancy for f over g is D ¼ jQ j They argued (Huang & Ling, 2005), if C > 0:5 and D > 1, the measure f can be determined to be superior to g. The criterion ’consistency’ is used to statistically determine in what degree the two measures f and g make the same or similar conclusions about the classification results within a domain. In other words, it is used to determine in what degree the two measures simultaneously suggest that the classification result with respect to one confusion matrix is ’better’ or ’worse’ than that with respect to the other. Measure f is strictly consistent with measure g if measure f makes the same conclusions for all possible cases when measure g determines the result of a confusion matrix is better or worse than that of another. The degree of consistency C ¼ 1 when the two measures are strictly consistent with each other. C > 0:5 means in over 50% of the cases the two measures keep consistent. For computing the degree of consistency of accuracy and confusion entropy with respect to a certain problem, we will firstly enumerate its all possible confusion matrices. For each two confusion matrices a and b, we compute the accuracy and confusion entropy of a and b. If the accuracy of a is greater(smaller) than that of b and the confusion entropy of a is smaller(greater) than that of b, ða; bÞ will be put into set R. That is, if the two measures all suggest that a is better than b(or b is better than a), ða; bÞ will be put into set R. Otherwise, the pair will be put into S. After we investigate all possible pairs of confusion matrices of the problem, we can then compute the degree of consistency of accuracy and confusion entropy. The criterion ’discriminancy’ is used to determine in what degree one measure is more discriminating than the other. The degree of discriminancy D > 1 if measure f can discriminate more classification cases than measure g. In fact, the two criteria can be used to determine whether a measure is more precise than another. If measure f keeps consistent with measure g for the majority of cases while f can discriminate more cases than g; f can be said to be more precise than g. Consequently, f is said to be better than g. For statistically comparing two measures in terms of consistency and discriminancy for a problem, we need to exhaustedly enumerate all possible confusion matrices. This is of course very time-consuming. For an N-class problem with C 1 to C N samples for each of the N classes, the total number of possible confusion matrices is: N QN1 Y ðC j þ iÞ i¼1

j¼1

ðN  1Þ!

ð21Þ

Furthermore, it is certainly impossible to compare the measures subject to the cases with all possible number of classes. In view of

this observation, we merely report the exhausted results on some 3-class problems. We try to draw conclusions about problems of large number of samples and classes from the observations on the reported problems. In the statistical experiments, the problems were generated as follows. The number of samples of each class cli ði ¼ 1; 2; 3Þ was generated randomly. For controlling the scale of each problem, the number of samples of each class ranged from 2 to 10. We generated 12 such 3-class problems. Each generated problem was denoted as i  j  k. i is the number of samples of class cl1 ; j the number of samples of class cl2 ; k the number of samples of class cl3 . For example, the problem 2-4-3 shown in Table 5 indicates that there are 2, 4 and 3 samples for each of the three classes. After the generation of 12 problems, we exhaustedly enumerated all possible confusion matrices of the problems. That is, for a certain generated problem, the number of samples of one class that were classified to each of all classes changed from 0 to the maximal number. Take problem 2-4-3 as an example. The two samples of class cl1 may be all classified to class cl1 ; cl2 or cl3 . Or one sample to class cl1 , the other to class cl2 or cl3 , etc. Tables 5 and 6 show the statistical results of consistency and discriminancy of confusion entropy to the other two measures on the 12 synthetic data sets. From the results, we have the following observations: (1) The measure of confusion entropy keeps relatively consistent with accuracy and RCI. Higher accuracy or RCI tends to result in lower confusion entropy. (2) The measure of confusion entropy is more discriminating for evaluating classifiers than the other two measures. (3) As the number of samples grows, the measure of confusion entropy tends to be even more discriminating than the other two measures. The first two observations indicate that the measure of confusion entropy is statistically superior to the other two measures

Table 5 Consistency comparison of different pairs of measures on the 12 problems. Prob.

cen-acc

cen-rci

2-4-3 2-5-3 2-6-3 3-4-5 3-5-5 3-5-7 6-3-6 6-4-5 4-8-4 4-8-6 4-10-5 6-5-8

0.82 0.7957 0.7741 0.8065 0.8015 0.783 0.7892 0.8003 0.776 0.7849 0.7619 0.7916

0.6057 0.6108 0.6144 0.6221 0.6256 0.6295 0.63 0.6305 0.6311 0.6343 0.6348 0.6355

Table 6 Discriminancy comparison of different pairs of measures on the 12 problems. Prob.

cen-acc

cen-rci

2-4-3 2-5-3 2-6-3 3-4-5 3-5-5 3-5-7 6-3-6 6-4-5 4-8-4 4-8-6 4-10-5 6-5-8

246.824 409.582 625.355 1202.49 1120.832 2475.681 2384.607 2258.042 2869.262 5069.248 6844.032 5812.044

19.5 26.663 27.36 33.123 32.297 42.837 53.525 36.664 45.695 41.207 46.681 39.408

3804

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

for the 3-class problems in terms of consistency and discriminancy. The last observation implies that as the number of classes grows from three to larger numbers, the measure of confusion entropy will be even more discriminating. In view of this observation, we do not report the simulation results of different measures about the problems of larger class number. 5. Experiments on some real data sets In this section, we further compare the measure of confusion entropy with accuracy on some real data sets. In real applications, some validation techniques should be exploited to evaluate different classification models. For a problem with both training and test data sets, a classifier induced from the training data can be tested on the independent test data. However, few problems could supply appropriate independent test data sets for validating the constructing classifiers. Some cross validation methods are usually employed when only training data sets are available. There are many research works about cross validation technique (Bengio & Grandvalet, 2004; Cawley & Talbot, 2003; Ron Kohavi, 1995). k-fold cross validation, leave-one-out cross validation are all pervasively used validation techniques. In this paper, we employed the k-fold cross validation technique. In the experiment, we ran different classification methods on the training data sets of different problems and ran cross validation with different fold numbers. The aim of the experiments is for comparing the measures from different aspects. It is not aimed at comparing different cross validation techniques. We also ran different classification methods on the training data sets and then tested the induced classifiers on the test data sets. We used the measure of accuracy and confusion entropy to evaluate the classifiers induced by five popular methods for classification. We employed the classification methods implemented in the weka 3.5.7 package (Witten & Frank, 2005). Among the five methods, ’J48’ is the java version of C4.5 (Quinlan, 1986, 1993). ’RF’ is RandomForest. ’SC’ is SimpleCart. ’NB’ is NaiveBayes. ’RBF’ is RBFNetwork. We report here the experimental results of the five methods on 13 multi-class problems from the UCI machine learning data repository, each of which has both training and test data sets. The detailed description of the data sets is shown in Table 7. In the table, ’Tr-N’ is the number of training samples. ’Te-N’ is the number of test samples. ’Attributes’ indicates the number of condition attributes. ’Classes’ indicates the number of classes. For the problem ’horse’, some attributes are removed according to the original description of the data set. Attribute ’Outcome’ is taken as its class attribute. The accuracies under different fold numbers and the accuracies of the classifiers induced from the training data sets and tested on the test data sets, denoted as Tr/Te, are listed in Tables 8–12. In the

Table 7 13 Multi-class problems.

Table 8 Accuracy of J48 on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.8822 0.6742 0.6589 0.881 0.9513 0.7752 0.8546 0.9994 0.9668 0.9868 0.9936 0.9896 0.9966

0.8885 0.7121 0.6187 0.8429 0.9565 0.8371 0.8539 0.9995 0.9671 0.9864 0.9936 0.9871 0.9979

0.9148 0.7045 0.6421 0.881 0.9584 0.8502 0.8627 0.9995 0.9725 0.9868 0.9939 0.9918 0.9976

0.9261 0.7045 0.6455 0.8905 0.9617 0.8339 0.8616 0.9997 0.9739 0.9861 0.9936 0.9926 0.9971

0.95 0.9286 0.6567 0.91 0.9205 0.867 0.8535 0.9995 0.9784 1 0.9949 0.9907 0.9939

Table 9 Accuracy of RandomForest on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.9323 0.7348 0.6355 0.8905 0.9823 0.7948 0.8938 0.9996 0.9689 0.9857 0.9846 0.9789 0.9913

0.9424 0.7879 0.6622 0.8524 0.9843 0.8534 0.8992 0.9997 0.97 0.9825 0.9846 0.9818 0.9952

0.9424 0.75 0.6622 0.9143 0.9853 0.8632 0.9017 0.9997 0.9721 0.9832 0.9893 0.9839 0.9963

0.9524 0.7424 0.6722 0.8952 0.9873 0.8697 0.9033 0.9998 0.9736 0.9854 0.99 0.9854 0.9936

0.98 0.9286 0.7015 0.9457 0.9577 0.9229 0.9015 0.9988 0.9753 0.9934 0.9959 0.9856 0.9921

Table 10 Accuracy of SimpleCart on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.8922 0.8485 0.6522 0.8714 0.9389 0.8013 0.8514 0.9993 0.9636 0.9864 0.9921 0.9896 0.9966

0.9085 0.8333 0.6254 0.8333 0.9537 0.8469 0.8564 0.9993 0.9725 0.9864 0.9929 0.9904 0.9976

0.9123 0.8258 0.689 0.881 0.9589 0.8665 0.8629 0.9995 0.9714 0.9868 0.9939 0.9921 0.9976

0.9198 0.8182 0.679 0.8714 0.9598 0.8665 0.8602 0.9994 0.9725 0.9864 0.995 0.9921 0.9976

0.94 0.8571 0.7164 0.8667 0.9168 0.891 0.859 0.9998 0.9733 1 0.9969 0.9897 0.9936

Table 11 Accuracy of NaiveBayes on the 13 problems.

Data sets

Tr-N

Te-N

Attributes

Classes

Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

798 132 300 210 7494 307 4435 43500 2800 2800 2800 2800 3772

100 28 68 2100 3498 376 2000 14500 972 152 972 972 3428

38 5 21 19 16 35 36 9 29 29 29 29 21

6 3 3 7 10 19 6 7 3 5 5 4 3

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.787 0.8409 0.6689 0.781 0.879 0.8632 0.795 0.919 0.9439 0.9643 0.9489 0.9475 0.9563

0.792 0.8409 0.679 0.7571 0.8804 0.886 0.7966 0.9126 0.9425 0.9571 0.95 0.9443 0.9571

0.7907 0.8409 0.679 0.7762 0.8795 0.8958 0.7948 0.9129 0.9407 0.9586 0.9525 0.9414 0.9541

0.7932 0.8182 0.679 0.7762 0.8798 0.899 0.795 0.9163 0.9389 0.9586 0.9507 0.94 0.9552

0.79 0.8929 0.6866 0.801 0.8213 0.883 0.796 0.9221 0.9403 0.9474 0.965 0.9424 0.9498

3805

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809 Table 12 Accuracy of RBFNetwork on the 13 problems.

Table 15 Confusion entropy of SimpleCart on the 13 problems.

Datasets

2fs

3fs

5fs

10fs

Tr/Te

Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.896 0.7121 0.6656 0.281 0.9514 0.8143 0.8446 0.9803 0.9507 0.9718 0.95 0.9704 0.9571

0.9073 0.7803 0.6656 0.2714 0.9568 0.8143 0.846 0.9808 0.9582 0.9736 0.9446 0.9704 0.9586

0.9298 0.7879 0.6656 0.2714 0.9582 0.8436 0.8485 0.9827 0.9571 0.9764 0.9511 0.9696 0.9618

0.9223 0.7652 0.6823 0.281 0.9588 0.8371 0.8492 0.9804 0.9546 0.9771 0.9496 0.9718 0.9613

0.89 0.9643 0.7164 0.291 0.9174 0.8431 0.8465 0.9817 0.9619 0.9737 0.9588 0.9619 0.9711

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.1457 0.2533 0.5597 0.1723 0.1059 0.1479 0.2228 0.002 0.083 0.0266 0.017 0.0208 0.0134

0.1318 0.2592 0.5693 0.1997 0.082 0.1124 0.2178 0.002 0.0658 0.0271 0.0156 0.0231 0.0098

0.1301 0.2732 0.5083 0.1531 0.0762 0.0952 0.2073 0.0013 0.0725 0.0265 0.0135 0.0186 0.0102

0.1222 0.2807 0.4984 0.1658 0.074 0.0989 0.2117 0.0016 0.0677 0.027 0.0107 0.0191 0.0098

0.0912 0.2541 0.4433 0.1435 0.1244 0.0746 0.1996 0.0005 0.0663 0 0.0081 0.0157 0.0218

tables, ’2 fs’, ’3 fs’, ’5 fs’ and ’10 fs’ indicate the accuracies under 2, 3, 5 and 10-fold cross validation. ’Tr/Te’ indicates the Tr/Te accuracies of the classifiers. The confusion entropies of different methods with different fold numbers and the confusion entropies of Tr/Te are listed in Tables 13–17. In the tables, ’2 fs’, ’3 fs’, ’5 fs’ and ’10 fs’ indicate the confusion entropies of the classification methods under 2, 3, 5 and 10fold cross validation on the 13 training data sets. ’Tr/Te’ indicates the Tr/Te confusion entropies of the classifiers. The average accuracy of Tr/Te of the classifiers induced by J48 is 0.9264. The average accuracies of the classifiers induced by RandomForest, SimpleCart, NaiveBayes and RBFNetwork are 0.9445, 0.9231, 0.8721 and 0.8675, respectively. From the results, we can roughly partition the five methods into three groups. RandomForest ranks the highest. It is followed by J48 and SimpleCart. NaiveBayes and RBFNetwork form the third group. The average

Table 16 Confusion entropy of NaiveBayes on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.211 0.3145 0.5478 0.213 0.1442 0.0978 0.2563 0.0881 0.1163 0.0514 0.0698 0.0857 0.0922

0.2033 0.2575 0.5283 0.2226 0.1426 0.0842 0.2551 0.091 0.1186 0.0595 0.0671 0.0888 0.0936

0.2055 0.2575 0.5205 0.2003 0.143 0.0788 0.2563 0.0963 0.122 0.0573 0.0652 0.0919 0.0958

0.2025 0.2807 0.518 0.2048 0.1427 0.0758 0.255 0.0878 0.124 0.0567 0.0656 0.0941 0.0934

0.1928 0.169 0.5031 0.1709 0.1685 0.0878 0.2575 0.0831 0.1079 0.0484 0.0541 0.0891 0.1032

Table 17 Confusion entropy of RBFNetwork on the 13 problems.

Table 13 Confusion entropy of J48 on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.1557 0.5173 0.4227 0.1592 0.0864 0.1713 0.2243 0.0016 0.0747 0.0247 0.0139 0.0198 0.0141

0.1546 0.46 0.5605 0.1942 0.0785 0.1307 0.2162 0.0015 0.0785 0.0263 0.0149 0.025 0.0095

0.1258 0.4691 0.5322 0.163 0.0757 0.1119 0.2132 0.0014 0.0697 0.0263 0.0138 0.0183 0.0102

0.1122 0.4727 0.5114 0.1577 0.07 0.1121 0.2138 0.0011 0.0679 0.0276 0.0137 0.0153 0.0122

0.0743 0.1338 0.4432 0.1293 0.1127 0.0897 0.2164 0.0009 0.0578 0 0.013 0.0137 0.0231

Table 14 Confusion entropy of RandomForest on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.1001 0.4398 0.504 0.1404 0.0353 0.1605 0.1671 0.0009 0.0702 0.0264 0.0339 0.0392 0.0311

0.0897 0.3931 0.4911 0.1825 0.0317 0.1136 0.161 0.0008 0.07 0.0307 0.0339 0.0371 0.0196

0.0905 0.4193 0.5073 0.1197 0.0299 0.1039 0.1561 0.0005 0.0644 0.0293 0.0249 0.0318 0.0149

0.0776 0.4266 0.4675 0.133 0.0261 0.0971 0.1545 0.0004 0.063 0.0263 0.0239 0.0309 0.0232

0.0352 0.1341 0.4736 0.0866 0.064 0.0657 0.1528 0.0034 0.0594 0.009 0.011 0.0284 0.0258

Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.1393 0.465 0.5444 0.4734 0.0738 0.1474 0.2087 0.0355 0.104 0.0479 0.0717 0.0556 0.0924

0.1285 0.3724 0.5563 0.4921 0.0678 0.1435 0.2089 0.0348 0.0933 0.0442 0.0784 0.051 0.093

0.1034 0.3386 0.5421 0.539 0.0637 0.1284 0.2059 0.0324 0.0962 0.0412 0.0704 0.057 0.0863

0.1078 0.3872 0.5334 0.5406 0.0638 0.1334 0.2057 0.0355 0.0988 0.0405 0.0721 0.0544 0.0889

0.1027 0.0849 0.4113 0.4004 0.0955 0.1177 0.2091 0.0328 0.0819 0.0298 0.0618 0.0633 0.0746

confusion entropy of Tr/Te of the classifiers induced by J48 is 0.1006. The average confusion entropies of the classifiers induced by RandomForest, SimpleCart, NaiveBayes and RBFNetwork are 0.0884, 0.1110, 0.1566 and 0.1358. We can roughly partition the methods into four or five groups. RandomForest also ranks the highest of all the methods. The results of Tr/Te show that the measure of confusion entropy is more discriminating than accuracy for evaluating classifiers. From the cross validation results, it is hard at the first sight to determine which measure is better for evaluating classifiers. The majority of the overall results show that higher accuracy tends to correspond to lower confusion entropy. This is in accordance with the relation of accuracy and confusion entropy which we have discussed in Section 2. Some of the results exactly showing this observation are pictured in Fig. 1. In the figure, the left-side vertical axis shows accuracy while the right-side vertical axis shows confusion

3806

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

Fig. 1. Higher (or lower) accuracy corresponds to lower (or higher) confusion entropy.

entropy. The line decorated with square exhibits the varying trend of accuracy under different fold numbers. The line with triangle exhibits the varying trend of confusion entropy. In the results, there exist some cases exhibiting the difference of the discriminating power between the measure of accuracy and confusion entropy. Some of the cases are pictured in Figs. 2 and 3. Fig. 2 shows one case that the confusion entropy remains unchanged whereas the accuracy changes when the fold number changes. Such case indicates that the measure of confusion entropy is incapable of differentiating the induced classifiers while the measure of accuracy can tell them apart. Fig. 3 shows some of the opposite cases of that shown in Fig. 2. It shows that the accuracy remains the same whereas the confusion entropy changes when the fold number changes. This means the measure of accuracy is incapable of differentiating these induced classifiers while the measure of confusion entropy can tell them apart. In the results, there exist some cases that the two measures change in the same direction when the fold number changes. Such cases are shown in Fig. 4. The cases shown in the figure indicate that sometimes the increase of accuracy does not necessarily result in the decrease of confusion entropy. Or in other words, the

Fig. 2. Accuracy changes while confusion entropy remains unchanged.

improvement of accuracy does not always result in the improvement of confusion entropy or vice versa. Finally, Fig. 5 shows the cases that both the accuracy and confusion entropy remain unchanged under different number of folds. The cases shown in Fig. 5 imply that no improvement is made on both the accuracy and confusion entropy when the fold number changes. Such cases can be viewed as the trivial cases of that shown in Fig. 1. In fact, such cases make no contribution to the comparison of the two measures. Instead, they are appropriate to be used for comparing the results of cross validation with different fold numbers to show whether a fold number is superior to other fold numbers. Among all the 65 cases, there are 38, 1, 16, 10 and 3 cases with respect to Figs. 1–5. Some cases are not shown simply for space consideration. The cases pictured in Fig. 1 show that in over 50% of the cases the measure of accuracy and confusion entropy make the same or similar conclusion about the confusion matrices. The cases shown in Figs. 2 and 3 imply that the measure of confusion entropy is more discriminating than accuracy. The results of Fig. 4 show that in about 15% of the cases the improvement of accuracy does not result in the improvement of confusion entropy. In a word, these results show that the measure of confusion entropy is more precise than accuracy for evaluating classifiers. As aforementioned, it is crucial to unfold subtle differences of classifications in some real applications when the accuracy turns to be the same or very similar. Among the reported data sets, allbp, allhyper, allhypo, allrep, Ann are all data sets about thyroid disease. For clarity, the results of these data sets are shown in Figs. 6 and 7. Fig. 6 shows the 10-fold cross validation results of the five methods on the five training data sets. Fig. 7 shows the Tr/Te results. As one may notice, except for NB, the other four methods make very similar classifications on each of the five data sets. It is certain that the measure of confusion entropy is more capable of discriminating the induced classifiers than the measure of accuracy. From Figs. 1–5 we may also notice that, for a problem, the accuracies corresponding to all classification methods do not always increase or decrease when the fold number changes from 2 to 10. Similarly, the confusion entropies do not always decrease or increase when the fold number changes. For a classification method, the accuracies on different data sets also do not always increase or

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

Fig. 3. Accuracy remains unchanged while confusion entropy changes.

Fig. 4. Accuracy and confusion entropy change in the same direction.

Fig. 5. Accuracy and confusion entropy remain unchanged.

3807

3808

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

Fig. 6. The 10-fold cross validation results of five classification methods on the five data sets.

Fig. 7. The Tr/Te results of five classification methods on the five data sets.

decrease when the fold number changes. Similar observation is also true for the measure of confusion entropy. These observations imply that a different fold number may choose a different classification model for a certain problem. No conclusion can be made on how to determine the number of folds for different problems. This may need further experimental and theoretical efforts. Some research works about cross validation techniques can be found in many research works, such as Ron Kohavi (1995), Cawley and Talbot (2003), Bengio and Grandvalet (2004), etc. 6. Conclusions In this paper, we propose a new measure based upon the concept of entropy for evaluating classifier performances. By exploit-

ing the misclassification information of confusion matrices, the measure evaluates the confusion level of the class distribution of misclassified samples. Both theoretical analysis and statistical results show that the proposed measure is more discriminating than accuracy and RCI while it remains relatively consistent with the two measures. Moreover, it is more capable of measuring how the samples of different classes have been separated from each other. Hence the proposed measure is more precise than the two measures and can substitute for them to evaluate classifiers in classification applications. The results on some benchmark data sets from the UCI machine learning data repository further confirm the theoretical analysis and statistical results and instantiate that the proposed measure is capable of evaluating classifier performances in real applications.

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

Acknowledgments This work was supported by the Science Foundation of Jilin Province Grant No. 20040529, the National 863 High Technology Research and Development Program of China Grant No. 2009AA01Z152, and the National Natural Science Foundation of China Grant No. 60703013 and 10978011. References Abramson, N. (1963). Information theory and coding. New York: McGraw-Hill. Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5, 1089–1105. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145–1159. Cawley, G. C., & Talbot, N. L. C. (2003). Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers. Pattern Recognition, 36, 2585–2592. Edwards, E., Metz, C., & Nishikawa, R. (2005). The hypervolume under the ROC hypersurface of ’Near-Guessing’ and ’Near-Perfect’ observers in N-class classification tasks. IEEE Transactions on Medical Imaging, 24, 293–299. Everson, R. M., & Fieldsend, J. E. (2006). Multi-class ROC analysis from a multiobjective optimization perspective. Pattern Recognition Letters, 27, 918–927. Ferri, C., Hernandez-Orallo, J., & Salido, M. (2003). Volume under the ROC surface for multi-class problems. In Proceedings of the 14th european conference machine learning (pp. 108–120). Freitas, C. O. A., de Carvalho, J. M., & Oliveira, J. J., Jr., et al. (2007). Confusion matrix disagreement for multiple classifiers, CIARP 2007, LNCS 4756 (pp. 387–396). Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45, 171–186. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristics (ROC) curve. Radiology, 143, 29–36. Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17, 299–310. Kreuz, T., Haas, J. S., Morelli, A., Abarbanel, H. D. I., & Politi, A. (2007). Measuring spike train synchrony. Journal of Neuroscience Methods, 165, 151–161.

3809

Landgrebe, T., & Duin, R. (2006). A simplified extension of the area under the ROC to the multiclass domain. In Proceedings of the 17th annual symposium pattern recognition association, South Africa. Landgrebe, T. C. W., & Duin, R. P. W. (2008). Efficient multiclass ROC approximation by decomposition via confusion matrix perturbation analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 810–822. Metz, C. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8, 283–298. Provost, F., & Fawcett, T. (1997). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Proceedings of the third international conference on knowledge discovery and data mining (pp. 43–48). Mnelo Park, CA: AAAI Press. Quinlan, J. R. (1986). Introduction of decision trees. Machine Learning, 3, 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning, Morgan Kaufmann. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th international joint conference on artificial intelligence, Morgan Kaufmann (pp. 1137–1143). Sindhwani, V., Bhattacharge, P., & Rakshit, S. (2001). Information theoretic feature crediting in multiclass support vector machines. In First SIAM international conference on data mining (ICDM’01). Chicago, IL, April 5–7. Srinivasan, A. (1999). Note on the location of optimal classifiers in N dimensional ROC space. Technical Report PRG-TR-2-99, Computing Laboratory, Oxford University. Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 25, 631–643. Swets, J. (1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285–1293. van Son, R. J. J. H. (1994). A method to quantify the error distribution in confusion matrices. In Proceedings, Institute of Phonetic Sciences, University of Amsterdam (Vol. 18, pp. 41–63). van Son, R. J. J. H. (1995). The relation between the error distribution and the error rate in identification experiments. In Proccedings, Institute of Phonetic Sciences, University of Amsterdam (Vol. 19, pp. 71–82). Victor, J., & Purpura, K. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. Journal of Neurophsiology, 76, 1310–1326. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco: Morgan Kaufmann.