A novel measure for evaluating classifiers - CiteSeerX

Comment

Report 22 Downloads 125 Views

Expert Systems with Applications 37 (2010) 3799–3809

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

A novel measure for evaluating classiﬁers Jin-Mao Wei a,*, Xiao-Jie Yuan a, Qing-Hua Hu b, Shu-Qin Wang c a

Department of Computer Science, Nankai University, Tianjin 300071, China Harbin Institute of Technology, Harbin, China c College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China b

a r t i c l e

i n f o

Keywords: Performance evaluation Entropy Accuracy Classiﬁcation

a b s t r a c t Evaluating classiﬁer performances is a crucial problem in pattern recognition and machine learning. In this paper, we propose a new measure, i.e. confusion entropy, for evaluating classiﬁers. For each class cli of an ðN þ 1Þ-class problem, the misclassiﬁcation information involves both the information of how the samples with true class label cli have been misclassiﬁed to the other N classes and the information of how the samples of the other N classes have been misclassiﬁed to class cli . The proposed measure exploits the class distribution information of such misclassiﬁcations of all classes. Both theoretical analysis and statistical experiments show the proposed measure is more precise than accuracy and RCI. Experimental results on some benchmark data sets further conﬁrm the theoretical analysis and statistical results and show that the new measure is feasible for evaluating classiﬁer performances. Ó 2009 Elsevier Ltd. All rights reserved.

1. Introduction Evaluating classiﬁer performances is a crucial step in classiﬁcations. For a classiﬁcation problem, one needs a metric to evaluate which of the classiﬁers induced by various methods is superior to the others. The measure of accuracy has been used for decades for this purpose. Although the classiﬁcation accuracy seems to be the most logical way of measuring classiﬁer performances, it may turn to be inefﬁcient in some classiﬁcation cases. It has been criticized in recent years for the costs of misclassiﬁcation cannot be appropriately taken into consideration (Everson & Fieldsend, 2006; Hand & Till, 2001). For making up for the inefﬁciency, some methods and measures have been suggested. The method of ROC, i.e. the receiver operating characteristic, is one of the most recommended methods for evaluating classiﬁer performances. Since its introduction, the method of ROC has been widely studied and used especially in medical diagnosis (Bradley, 1997; Hanley & McNeil, 1982; Metz, 1978; Swets, 1988). ROC analysis provides a convenient graphical display of the trade-off between true and false positive classiﬁcation rates for two-class problems (Provost & Fawcett, 1997). The area under the ROC curve, i.e. AUC, has become an important performance measure (Bradley, 1997; Landgrebe & Duin, 2008). Unlike ROC analysis, it is invariant to operating conditions. ROC analysis and AUC are powerful for comparing competing classiﬁcation models over a range of operating conditions. Since they were originally introduced for two-class problems, much ef* Corresponding author. E-mail address: [email protected] (J.-M. Wei). 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.11.040

fort is being devoted to exploiting them for multi-class problems in recent years (Edwards, Metz, & Nishikawa, 2005; Ferri, Hernandez-Orallo, & Salido, 2003; Landgrebe & Duin, 2006, 2008; Srinivasan, 1999). On the other hand, the classiﬁcation accuracy is also inefﬁcient for unfolding subtle differences of classiﬁers, with which this paper is mainly concerned. For a classiﬁcation problem, various classiﬁers induced by different methods under different conditions may sometimes obtain very similar or even the same classiﬁcation accuracy. The classiﬁcation details of how the samples of each class are separated from that of the others cannot be revealed by the accuracy when such cases occur. It may be crucial to discriminate such cases in some real applications. For example, in medical diagnosis, it is usually of vital importance that samples of different types of diseases are separated from each other and from normal ones by classiﬁcation rules. With such consideration, the authors proposed in Sindhwani, Bhattacharge, and Rakshit (2001) a measure called relative classiﬁer information, RCI in short, for evaluating classiﬁer performances. The measure of RCI can evaluate how well a classiﬁer can discriminate different classes from the results of class predictions (Sindhwani et al., 2001). In Statnikov, Aliferis, Tsamardinos, Hardin, and Levy (2005), the authors have used the measure of RCI to evaluate classiﬁer performances. In this paper, we propose a new measure for performance evaluation. The proposed measure, i.e. confusion entropy, is designed directly for multi-class problems. It tries to make thorough use of information of confusion matrices. It takes into account the class distribution information of all off-diagonal elements of confusion matrices that convey misclassiﬁcation information. In the following sections, we

3800

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

will ﬁrstly deﬁne confusion entropy. We will also compare the proposed measure with accuracy and RCI for classiﬁer performance evaluation. 2. Confusion entropy for evaluating classiﬁers Given a data set and a classiﬁer induced from it, the corresponding confusion matrix summarizes the classiﬁcation results of the samples in the data set. By checking a row cli , the element C i;k ; k ¼ 1; 2; . . ., indicates how many samples with true class label cli have been classiﬁed to each of all classes. By checking a column clj , the element C k;j ; k ¼ 1; 2; . . ., indicates how many samples of each class have been classiﬁed to class clj . The diagonal elements indicate how many samples have been correctly classiﬁed. The off-diagonal elements indicate how many samples have been misclassiﬁed. Table 1 shows an example of a confusion matrix. For an ðN þ 1Þ-class problem, the misclassiﬁcation probability of classifying the samples of class cli to class clj subject to class clj , denoted as Pji;j , is deﬁned as:

Pji;j ¼ PNþ1 k¼1

k¼1

P ii;i

ðC j;k þ C k;j Þ

;

i–j;

i; j ¼ 1; . . . ; N þ 1

C i;j ðC i;k þ C k;i Þ

;

i–j;

i; j ¼ 1; . . . ; N þ 1

ð3Þ

The confusion entropy of class clj ; CENj in short, is deﬁned as: N þ1 X Pjj;k log 2N Pjj;k þ Pjk;j log 2N P jk;j

ð4Þ

k¼1;k–j

In the equation, if P jj;k ¼ 0 then Pjj;k log 2N P jj;k ¼ 0. If Pjk;j ¼ 0 then Pjk;j log 2N Pjk;j ¼ 0. The overall confusion entropy of a confusion matrix, CEN in short, is deﬁned as:

CEN ¼

X

Pj CEN j ;

ð5Þ

þ C k;j Þ ; k;l C k;l

ð6Þ

j

where

P Pj ¼

k ðC j;k

2

P

which can be called the confusion probability of class clj . For understanding the proposed measure, we further analyze the above deﬁnitions. From Eqs. (1) and (2) one can see that the misclassiﬁcation probability is deﬁned subject to a certain class. For a class clj , the row and column clj of the confusion matrix involve all the classiﬁcation information about this class. Except for C j;j , the row element C j;i ði ¼ 1; 2; . . .Þ conveys the misclassiﬁcation information of how many samples of class clj have been misclassiﬁed to class cli . The column element C i;j ði ¼ 1; 2; . . .Þ indicates how many samples of class cli have been misclassiﬁed to class clj . Hence, an element C i;j will be used twice for computing the misclassiﬁ-

Table 1 A confusion matrix of a four-class problem.

cl1 cl2 cl3 cl4

The above equation indicates that the overall misclassiﬁcation probability subject to class clj will take the value of 1 only when C j;j ¼ 0. On the other hand, the overall misclassiﬁcation probability will take the value of 0 if

C j;j ¼

Nþ1 X ðC k;j þ C j;k Þ

ð8Þ

k¼1

Eq. (8) can be further rewritten as

ð1Þ

ð2Þ

ð7Þ

Nþ1 X

ðC k;j þ C j;k Þ

ð9Þ

k¼1;k–j

The equation implies that the overall misclassiﬁcation probability subject to class clj will take the value of 0 only when C k;j and C j;k take the value of 0 for all k–j. That is Nþ1 X

ðC k;j þ C j;k Þ ¼ 0

ð10Þ

k¼1;k–j

¼0

CEN j ¼

X j 2C j;j Pk;j þ P jj;k ¼ 1 PNþ1 ðC k;j þ C j;k Þ k;k–j k¼1

2C j;j ¼ 2C j;j þ

C i;j

The misclassiﬁcation probability of classifying the samples of class cli to class clj subject to class cli , denoted as P ii;j , is deﬁned as:

Pii;j ¼ PNþ1

cation probabilities subject to class cli and clj . Pji;j is deﬁned to describe how the samples of each other class cli have been misclassiﬁed to class clj . Pii;j is deﬁned to describe how the samples of class cli have been misclassiﬁed to each other class clj . Apparently, we have

cl1

cl2

cl3

cl4

50 10 10 0

10 90 0 10

10 1 50 10

30 0 42 83

Apparently, all samples of class clj have been correctly classiﬁed, and no sample of any other classes has been misclassiﬁed to class clj when this case occurs. The two deﬁnitions and the above analysis are essential for understanding the computation of confusion entropy CEN j , which is computed from the misclassiﬁcation probabilities subject to class clj . First of all, in Eq. (4), it should be noticed that the base of logarithm is 2N. For an ðN þ 1Þ-class problem, there are N þ 1 rows and N þ 1 columns in the corresponding confusion matrix. For class cli , the elements in the row and column of cli , i.e. C i;k and C k;i , except for C i;i , indicate how the samples are misclassiﬁed. That is, the samples with true class label cli may be misclassiﬁed only to the N other classes. In addition, only the samples of the N other classes may be misclassiﬁed to class cli . Therefore we have 2N kinds of possible misclassiﬁcations for each class of the ðN þ 1Þ-class problem. Take Table 1 as an example. For class cl2 , the elements with bold face in row cl2 and column cl2 except for C 2;2 convey the information of misclassiﬁcation. The element C 2;1 ; C 2;3 and C 2;4 indicate how many samples with true class label cl2 have been misclassiﬁed to class cl1 ; cl3 and cl4 . The element C 1;2 ; C 3;2 and C 4;2 indicate how many samples of class cl1 ; cl3 and cl4 have been misclassiﬁed to class cl2 . Hence the base of logarithm will be 6 (not 4 or 8) for computing the confusion entropy of class cl2 and all of the other classes. Eq. (4) can be simply rewritten as

CEN j ¼

Nþ1 X k¼1;k–j

Pjj;k log 2N Pjj;k

Nþ1 X

Pjk;j log 2N Pjk;j

ð11Þ

k¼1;k–j

The ﬁrst item in the right-side of the equation can be taken as the confusion entropy corresponding to row clj . The second item can be taken as the confusion entropy corresponding to column clj . The ﬁrst item will take the value of 0 if no sample of class clj has been misclassiﬁed. The second item will take the value of 0 if no sample of other classes has been misclassiﬁed to class clj . Apparently, CENj will take the value of 0 if all the samples of class clj have been correctly classiﬁed and no sample of other classes has been misclassiﬁed to class clj . On the other hand, the ﬁrst item will take

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

the largest value if all the samples of class clj have been uniformly misclassiﬁed to all the other classes. The second item will take the largest value if the samples that have been misclassiﬁed to class clj are uniformly from all the other classes. When all the samples of class clj are uniformly misclassiﬁed to all the other N classes and the same number of samples that are misclassiﬁed to class clj are uniformly from all the other N classes, CENj will take the largest value of 1. In other words, CENj will take the value of 1 if C j;j ¼ 0 and C k1;j ¼ C j;k2 for all k1; k2–j. The overall confusion entropy CEN can be taken as the weighted sum of the confusion entropy of all classes. For all Pj we have

X

Pj ¼ 1

ð12Þ

j

This can be easily obtained from the deﬁnition of P j . Eq. (6) can be simply rewritten as

P P C j;k C k;j Pj ¼ Pk þ Pk 2 k;l C k;l 2 k;l C k;l

ð13Þ

In the right-side of the equation, the ﬁrst item involves the row information of class clj . The second item involves the column information of class clj . If the number of samples with true class label clj is small, the value of Pj tends to be small. If the number of samples that are misclassiﬁed to class clj is small, the value of P j also tends to be small. The classiﬁcation information of class clj involves the above two kinds of information. Pj is deﬁned to describe the proportion of classiﬁcation information of class clj to the whole classiﬁcation result. It can be easily accepted that the classiﬁcation result of the majority class will mainly affect the overall classiﬁcation result. The classiﬁcation result of the classes with smaller number of samples will make less contribution. Consequently, if Pj is larger than Pi , the classiﬁcation information of class clj will contribute more to the overall confusion entropy than that of class cli . From the above analysis, we have two observations about confusion entropy. One is that the measure only computes the entropy of misclassiﬁed samples. The samples that are correctly classiﬁed make no direct contribution to the calculation of the measure. Consequently, it implies that the smaller the number of misclassiﬁed samples is, the smaller the value of confusion entropy tends to be. The best case of classiﬁcation is that all samples are correctly classiﬁed. This means all off-diagonal elements take the value of 0. The value of confusion entropy will be zero if this case occurs. From the deﬁnitions, this is the unique case when confusion entropy takes the value of zero. This also indicates that the proposed measure tends to take small value when the value of accuracy is large. In other words, we say the two measures tend to keep ’consistent’ with each other. Another observation is that the more balanced the class distribution of misclassiﬁed samples is, the larger the value of confusion entropy tends to be. The worst case of classiﬁcation is that no sample has been correctly classiﬁed, i.e. all diagonal elements of the confusion matrix take the value of zero, and all other elements take the same value. The value of confusion entropy of the case will be 1. This can be easily computed from the above deﬁnitions. By further investigating we may ﬁnd, when the sum of diagonal elements remains the same, the off-diagonal elements may take various values, which will result in different values of confusion entropy. This apparently indicates that the measure of confusion entropy is more discriminating than accuracy. Moreover, the proposed measure is capable of measuring how the samples of different classes have been separated from each other. We argue that the samples of different classes are well separated from each other if the class distribution of misclassiﬁed samples is imbalanced. The above analysis suggests that the measure of confusion entropy potentially has its merits in classiﬁcations. Yet it is still insuf-

3801

ﬁcient to determine whether and in what degree the measure of confusion entropy is superior for evaluating classiﬁers. In the following sections, we try to compare the proposed measure with accuracy and RCI. We will determine in terms of consistency and discriminancy which measure is superior for evaluating classiﬁer performances. 3. Some measures deﬁned on confusion matrix The concept of confusion matrix is well known and has been widely exploited in related studies. Some measures have been deﬁned on confusion matrix for different purposes. The accuracy of a classiﬁer can be easily obtained from its corresponding confusion matrix. In their work (Freitas, de Carvalho, & Oliveira, 2007), the authors deﬁned a simple measure on confusion matrix, i.e. the global performance index, for evaluating performances of ensemble classiﬁers. The global performance index of a classiﬁer for an Nclass problem was deﬁned as:

RR ¼ 1N

N X

C i;j ;

ð14Þ

i;j¼1

where C i;j is an element of the N N confusion matrix. By computing the distances between each of all classiﬁers and the others, the distance-based disagreement was obtained to express classiﬁers disagreement of an ensemble classiﬁer. In van Son (1994, 1995), the author deﬁned the entropy of a confusion matrix for speech research as:

HCM ¼

J I X X i¼1

pi;j log 2 pi;j ;

ð15Þ

j¼1

where I is the number of stimulus classes, J the number of response categories, and pi;j the probability of response j to stimulus i. Interestingly, in their works (Kreuz, Haas, Morelli, Abarbanel, & Politi, 2007; Victor & Purpura, 1996), the authors discussed the measuring of response clustering in neuronal response studies in terms of transmitted information. The used measure that was introduced in Abramson (1963) was deﬁned as:

" # X X 1X H¼ C i;j logC i;j log C i;j log C i;j þ logC C i;j i j

ð16Þ

H was called the transmitted information of the classiﬁer. The above three measures were all deﬁned on confusion matrix. Without investigating the original intentions of their works, it seems that the measures expect for the ﬁrst one can be exploited for evaluating classiﬁers. The value of the second and third measures will reach the largest and the smallest respectively when all elements in a confusion matrix distribute uniformly, i.e. each element takes the value of NC2 . Where C is the total number of samples, N the number of classes. Such result is apparently different from that of the measure of confusion entropy. For the worst case of classiﬁcation, the measure of confusion entropy will take its largest value when all diagonal elements take 0 and all off-diagonal C C for N-class problems(NðNþ1Þ for elements take the value of NðN1Þ ðN þ 1Þ-class problems). Furthermore, the former two measures take their worst values but the corresponding accuracy does not take the worst value of zero. From another angle, if the accuracy is zero, the two measures do not take their worst values. It is intuitively hard to accept this fact. This further implies that much inconsistency will be observed when the cases between the two extreme ones occur. On the other hand, the value of HCM will reach the smallest when the samples of each class are classiﬁed to only one class, which is not necessarily the correct class. If all samples are correctly classiﬁed, HCM takes its best value. However, it will

3802

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

also take the best value if the samples of each class are classiﬁed to a unique wrong class. The measure of H will reach the largest value of logN when the samples of each class are classiﬁed to only one class, while no sample of other classes has been classiﬁed to the same class. As one may notice, for the best cases of classiﬁcation, the two measures are also different from the measure of confusion entropy, which will take its smallest or best value 0 when all samples have been correctly classiﬁed. In the following section, we will mainly compare the proposed measure with accuracy and RCI which are used for evaluating classiﬁers.

4. Comparisons In this section, we will ﬁrstly compare the measure of confusion entropy with accuracy and RCI respectively on some extreme cases. Subsequently, we try to compare the measures statistically to reach some conclusions of the proposed measure.

As we are all familiar, the value of accuracy can be computed directly from confusion matrices. When the diagonal elements of a matrix are known, the accuracy can be uniquely obtained no matter how the other off-diagonal elements take their values. Whereas the value of confusion entropy will change in the same cases. We try to use the extreme cases shown in Tables 2, 3 to show the difference between the two measures. The accuracy is 0.7 for each of the four cases. The confusion entropies for the four cases are 0.2299, 0.105, 0.2825,0.1774, respectively. The results show that the accuracy does not differentiate the four cases, while the confusion entropy can clearly tell them apart. It is easy to accept that case b ranks the highest for all samples of the majority class are correctly classiﬁed, and no sample of other classes is misclassiﬁed to this class. It is worth noticing that case d ranks the second. Compared with the other two cases, in case d, the samples of two classes are correctly classiﬁed. It may be little difﬁcult to compare case a and c. Case a is a trivial classiﬁcation result while case c involves more misclassiﬁcation information than case a. From the cases, we ﬁnd the measure of confusion entropy prefers the cases, such as case a, with less misclassiﬁcation information. Moreover, one may notice that it is worthy of measuring the differences of classiﬁers especially when the accuracy cannot differentiate them.

Table 2 Two extreme cases a and b.

cl1 cl2 cl3 cl4

(b)

cl1

cl2

cl3

cl4

cl1

cl2

cl3

cl4

70 5 10 15

0 0 0 0

0 0 0 0

0 0 0 0

70 0 0 0

0 0 0 15

0 5 0 0

0 0 10 0

Table 3 Two extreme cases c and d.

Hd ¼

P X P C i;l C i;l l log l N N i

ð17Þ

The overall uncertainty after observation was deﬁned as:

X P C k;j k Hoj ; N j

ð18Þ

where

Hoj ¼

X i

C C P i;j log P i;j ; C k;j k k C k;j

ð19Þ

which is called the uncertainty after observing output clj . The amount of uncertainty removed by the classiﬁer therefore was:

Hc ¼ Hd HO

ð20Þ

Hc =Hd was called the relative classiﬁer information. From the deﬁnition, we notice that Hoj computes the entropy of class distribution of the samples observed in the output of class clj . If the observed samples are uniquely from one class, Hoj will take the value of 0. Consequently, one may notice that the measure of RCI will take its largest value of 1 when the samples of each class are classiﬁed to a class and no sample of other classes is classiﬁed to this class. On the other hand, it will take its smallest value of 0 when the samples of each class are classiﬁed uniformly to each of all classes. It is easy to compute the values of RCI of the four cases shown in Tables 2, 3. It can be seen that the RCI is also able to separate the four cases apart from each other. Nevertheless, compared with the computation of confusion entropy, we ﬁnd that the measure of RCI only takes into account the column information of a confusion matrix and ignores the row information that indicates how the samples of one class have been misclassiﬁed to other classes. The two extreme cases shown in Table 4 are used to compare the measure of RCI with confusion entropy. The RCI takes the value of 1 for cases e, f and case b in Table 2 and also for the case when all samples are correctly classiﬁed. The values of confusion entropy for cases e and f are 0.2744, 0.2416, respectively. It is certainly acceptable that case b is better than case f and case f is better than case e.

Table 4 Two extreme cases e and f.

(c)

cl1 cl2 cl3 cl4

In Sindhwani et al. (2001), the authors proposed another measure called relative classiﬁer information, RCI in short, for evaluating classiﬁer performances. It was introduced in view of the observation that the classiﬁcation accuracy is incapable of unfolding how the samples of different classes have been separated from each other. The measure was also deﬁned on confusion matrix and was designed directly for multi-class problems. For comparison, we ﬁrstly review the deﬁnition of the measure. Given a confusion matrix, each element C i;j of the matrix is the number of samples with true class label cli that have been classiﬁed to class clj . The uncertainty of the problem was deﬁned as:

HO ¼

4.1. Confusion entropy versus accuracy

(a)

4.2. Confusion entropy versus RCI

(d)

(e)

cl1

cl2

cl3

cl4

cl1

cl2

cl3

cl4

50 5 0 0

5 0 0 5

15 0 10 0

0 0 0 10

50 0 0 0

20 5 10 0

0 0 0 0

0 0 0 15

cl1 cl2 cl3 cl4

(f)

cl1

cl2

cl3

cl4

cl1

cl2

cl3

cl4

0 0 0 15

0 0 10 0

0 5 0 0

70 0 0 0

0 0 10 0

0 5 0 0

0 0 0 15

70 0 0 0

3803

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

4.3. Statistical comparisons on synthetic data Although the above extreme examples show that confusion entropy is more competent than accuracy and RCI in some cases, we still need to compare whether the proposed measure is always superior for evaluating classiﬁers. In this section, we try to compare the measures statistically in terms of consistency and discriminancy that were introduced in Huang and Ling (2005). The original deﬁnitions of consistency and discriminancy can be found in their paper. For understanding the meanings of the criteria, we review the deﬁnitions of degree of consistency and discriminancy. Degree of consistency was deﬁned as: (Huang & Ling, 2005). For two measures f and g on domain w, let R ¼ fða; bÞja; b 2 w; f ðaÞ > f ðbÞ; gðaÞ > gðbÞg and S ¼ fða; bÞja; b 2 w; f ðaÞ > f ðbÞ; gðaÞ < gðbÞg. The degree of consistency of f and g is Cð0 6 C 6 1Þ, jRj . where C ¼ jRjþjSj Degree of discriminancy was deﬁned as: (Huang & Ling, 2005). For two measures f and g on domain w, let P ¼ fða; bÞja; b 2 w; f ðaÞ > f ðbÞ; gðaÞ ¼ gðbÞg and Q ¼ fða; bÞja; b 2 w; gðaÞ > gðbÞ; jPj . f ðaÞ ¼ f ðbÞg. The degree of discriminancy for f over g is D ¼ jQ j They argued (Huang & Ling, 2005), if C > 0:5 and D > 1, the measure f can be determined to be superior to g. The criterion ’consistency’ is used to statistically determine in what degree the two measures f and g make the same or similar conclusions about the classiﬁcation results within a domain. In other words, it is used to determine in what degree the two measures simultaneously suggest that the classiﬁcation result with respect to one confusion matrix is ’better’ or ’worse’ than that with respect to the other. Measure f is strictly consistent with measure g if measure f makes the same conclusions for all possible cases when measure g determines the result of a confusion matrix is better or worse than that of another. The degree of consistency C ¼ 1 when the two measures are strictly consistent with each other. C > 0:5 means in over 50% of the cases the two measures keep consistent. For computing the degree of consistency of accuracy and confusion entropy with respect to a certain problem, we will ﬁrstly enumerate its all possible confusion matrices. For each two confusion matrices a and b, we compute the accuracy and confusion entropy of a and b. If the accuracy of a is greater(smaller) than that of b and the confusion entropy of a is smaller(greater) than that of b, ða; bÞ will be put into set R. That is, if the two measures all suggest that a is better than b(or b is better than a), ða; bÞ will be put into set R. Otherwise, the pair will be put into S. After we investigate all possible pairs of confusion matrices of the problem, we can then compute the degree of consistency of accuracy and confusion entropy. The criterion ’discriminancy’ is used to determine in what degree one measure is more discriminating than the other. The degree of discriminancy D > 1 if measure f can discriminate more classiﬁcation cases than measure g. In fact, the two criteria can be used to determine whether a measure is more precise than another. If measure f keeps consistent with measure g for the majority of cases while f can discriminate more cases than g; f can be said to be more precise than g. Consequently, f is said to be better than g. For statistically comparing two measures in terms of consistency and discriminancy for a problem, we need to exhaustedly enumerate all possible confusion matrices. This is of course very time-consuming. For an N-class problem with C 1 to C N samples for each of the N classes, the total number of possible confusion matrices is: N QN1 Y ðC j þ iÞ i¼1

j¼1

ðN 1Þ!

ð21Þ

Furthermore, it is certainly impossible to compare the measures subject to the cases with all possible number of classes. In view of

this observation, we merely report the exhausted results on some 3-class problems. We try to draw conclusions about problems of large number of samples and classes from the observations on the reported problems. In the statistical experiments, the problems were generated as follows. The number of samples of each class cli ði ¼ 1; 2; 3Þ was generated randomly. For controlling the scale of each problem, the number of samples of each class ranged from 2 to 10. We generated 12 such 3-class problems. Each generated problem was denoted as i j k. i is the number of samples of class cl1 ; j the number of samples of class cl2 ; k the number of samples of class cl3 . For example, the problem 2-4-3 shown in Table 5 indicates that there are 2, 4 and 3 samples for each of the three classes. After the generation of 12 problems, we exhaustedly enumerated all possible confusion matrices of the problems. That is, for a certain generated problem, the number of samples of one class that were classiﬁed to each of all classes changed from 0 to the maximal number. Take problem 2-4-3 as an example. The two samples of class cl1 may be all classiﬁed to class cl1 ; cl2 or cl3 . Or one sample to class cl1 , the other to class cl2 or cl3 , etc. Tables 5 and 6 show the statistical results of consistency and discriminancy of confusion entropy to the other two measures on the 12 synthetic data sets. From the results, we have the following observations: (1) The measure of confusion entropy keeps relatively consistent with accuracy and RCI. Higher accuracy or RCI tends to result in lower confusion entropy. (2) The measure of confusion entropy is more discriminating for evaluating classiﬁers than the other two measures. (3) As the number of samples grows, the measure of confusion entropy tends to be even more discriminating than the other two measures. The ﬁrst two observations indicate that the measure of confusion entropy is statistically superior to the other two measures

Table 5 Consistency comparison of different pairs of measures on the 12 problems. Prob.

cen-acc

cen-rci

2-4-3 2-5-3 2-6-3 3-4-5 3-5-5 3-5-7 6-3-6 6-4-5 4-8-4 4-8-6 4-10-5 6-5-8

0.82 0.7957 0.7741 0.8065 0.8015 0.783 0.7892 0.8003 0.776 0.7849 0.7619 0.7916

0.6057 0.6108 0.6144 0.6221 0.6256 0.6295 0.63 0.6305 0.6311 0.6343 0.6348 0.6355

Table 6 Discriminancy comparison of different pairs of measures on the 12 problems. Prob.

cen-acc

cen-rci

2-4-3 2-5-3 2-6-3 3-4-5 3-5-5 3-5-7 6-3-6 6-4-5 4-8-4 4-8-6 4-10-5 6-5-8

246.824 409.582 625.355 1202.49 1120.832 2475.681 2384.607 2258.042 2869.262 5069.248 6844.032 5812.044

19.5 26.663 27.36 33.123 32.297 42.837 53.525 36.664 45.695 41.207 46.681 39.408

3804

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

for the 3-class problems in terms of consistency and discriminancy. The last observation implies that as the number of classes grows from three to larger numbers, the measure of confusion entropy will be even more discriminating. In view of this observation, we do not report the simulation results of different measures about the problems of larger class number. 5. Experiments on some real data sets In this section, we further compare the measure of confusion entropy with accuracy on some real data sets. In real applications, some validation techniques should be exploited to evaluate different classiﬁcation models. For a problem with both training and test data sets, a classiﬁer induced from the training data can be tested on the independent test data. However, few problems could supply appropriate independent test data sets for validating the constructing classiﬁers. Some cross validation methods are usually employed when only training data sets are available. There are many research works about cross validation technique (Bengio & Grandvalet, 2004; Cawley & Talbot, 2003; Ron Kohavi, 1995). k-fold cross validation, leave-one-out cross validation are all pervasively used validation techniques. In this paper, we employed the k-fold cross validation technique. In the experiment, we ran different classiﬁcation methods on the training data sets of different problems and ran cross validation with different fold numbers. The aim of the experiments is for comparing the measures from different aspects. It is not aimed at comparing different cross validation techniques. We also ran different classiﬁcation methods on the training data sets and then tested the induced classiﬁers on the test data sets. We used the measure of accuracy and confusion entropy to evaluate the classiﬁers induced by ﬁve popular methods for classiﬁcation. We employed the classiﬁcation methods implemented in the weka 3.5.7 package (Witten & Frank, 2005). Among the ﬁve methods, ’J48’ is the java version of C4.5 (Quinlan, 1986, 1993). ’RF’ is RandomForest. ’SC’ is SimpleCart. ’NB’ is NaiveBayes. ’RBF’ is RBFNetwork. We report here the experimental results of the ﬁve methods on 13 multi-class problems from the UCI machine learning data repository, each of which has both training and test data sets. The detailed description of the data sets is shown in Table 7. In the table, ’Tr-N’ is the number of training samples. ’Te-N’ is the number of test samples. ’Attributes’ indicates the number of condition attributes. ’Classes’ indicates the number of classes. For the problem ’horse’, some attributes are removed according to the original description of the data set. Attribute ’Outcome’ is taken as its class attribute. The accuracies under different fold numbers and the accuracies of the classiﬁers induced from the training data sets and tested on the test data sets, denoted as Tr/Te, are listed in Tables 8–12. In the

Table 7 13 Multi-class problems.

Table 8 Accuracy of J48 on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.8822 0.6742 0.6589 0.881 0.9513 0.7752 0.8546 0.9994 0.9668 0.9868 0.9936 0.9896 0.9966

0.8885 0.7121 0.6187 0.8429 0.9565 0.8371 0.8539 0.9995 0.9671 0.9864 0.9936 0.9871 0.9979

0.9148 0.7045 0.6421 0.881 0.9584 0.8502 0.8627 0.9995 0.9725 0.9868 0.9939 0.9918 0.9976

0.9261 0.7045 0.6455 0.8905 0.9617 0.8339 0.8616 0.9997 0.9739 0.9861 0.9936 0.9926 0.9971

0.95 0.9286 0.6567 0.91 0.9205 0.867 0.8535 0.9995 0.9784 1 0.9949 0.9907 0.9939

Table 9 Accuracy of RandomForest on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.9323 0.7348 0.6355 0.8905 0.9823 0.7948 0.8938 0.9996 0.9689 0.9857 0.9846 0.9789 0.9913

0.9424 0.7879 0.6622 0.8524 0.9843 0.8534 0.8992 0.9997 0.97 0.9825 0.9846 0.9818 0.9952

0.9424 0.75 0.6622 0.9143 0.9853 0.8632 0.9017 0.9997 0.9721 0.9832 0.9893 0.9839 0.9963

0.9524 0.7424 0.6722 0.8952 0.9873 0.8697 0.9033 0.9998 0.9736 0.9854 0.99 0.9854 0.9936

0.98 0.9286 0.7015 0.9457 0.9577 0.9229 0.9015 0.9988 0.9753 0.9934 0.9959 0.9856 0.9921

Table 10 Accuracy of SimpleCart on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.8922 0.8485 0.6522 0.8714 0.9389 0.8013 0.8514 0.9993 0.9636 0.9864 0.9921 0.9896 0.9966

0.9085 0.8333 0.6254 0.8333 0.9537 0.8469 0.8564 0.9993 0.9725 0.9864 0.9929 0.9904 0.9976

0.9123 0.8258 0.689 0.881 0.9589 0.8665 0.8629 0.9995 0.9714 0.9868 0.9939 0.9921 0.9976

0.9198 0.8182 0.679 0.8714 0.9598 0.8665 0.8602 0.9994 0.9725 0.9864 0.995 0.9921 0.9976

0.94 0.8571 0.7164 0.8667 0.9168 0.891 0.859 0.9998 0.9733 1 0.9969 0.9897 0.9936

Table 11 Accuracy of NaiveBayes on the 13 problems.

Data sets

Tr-N

Te-N

Attributes

Classes

Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

798 132 300 210 7494 307 4435 43500 2800 2800 2800 2800 3772

100 28 68 2100 3498 376 2000 14500 972 152 972 972 3428

38 5 21 19 16 35 36 9 29 29 29 29 21

6 3 3 7 10 19 6 7 3 5 5 4 3

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.787 0.8409 0.6689 0.781 0.879 0.8632 0.795 0.919 0.9439 0.9643 0.9489 0.9475 0.9563

0.792 0.8409 0.679 0.7571 0.8804 0.886 0.7966 0.9126 0.9425 0.9571 0.95 0.9443 0.9571

0.7907 0.8409 0.679 0.7762 0.8795 0.8958 0.7948 0.9129 0.9407 0.9586 0.9525 0.9414 0.9541

0.7932 0.8182 0.679 0.7762 0.8798 0.899 0.795 0.9163 0.9389 0.9586 0.9507 0.94 0.9552

0.79 0.8929 0.6866 0.801 0.8213 0.883 0.796 0.9221 0.9403 0.9474 0.965 0.9424 0.9498

3805

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809 Table 12 Accuracy of RBFNetwork on the 13 problems.

Table 15 Confusion entropy of SimpleCart on the 13 problems.

Datasets

2fs

3fs

5fs

10fs

Tr/Te

Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.896 0.7121 0.6656 0.281 0.9514 0.8143 0.8446 0.9803 0.9507 0.9718 0.95 0.9704 0.9571

0.9073 0.7803 0.6656 0.2714 0.9568 0.8143 0.846 0.9808 0.9582 0.9736 0.9446 0.9704 0.9586

0.9298 0.7879 0.6656 0.2714 0.9582 0.8436 0.8485 0.9827 0.9571 0.9764 0.9511 0.9696 0.9618

0.9223 0.7652 0.6823 0.281 0.9588 0.8371 0.8492 0.9804 0.9546 0.9771 0.9496 0.9718 0.9613

0.89 0.9643 0.7164 0.291 0.9174 0.8431 0.8465 0.9817 0.9619 0.9737 0.9588 0.9619 0.9711

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.1457 0.2533 0.5597 0.1723 0.1059 0.1479 0.2228 0.002 0.083 0.0266 0.017 0.0208 0.0134

0.1318 0.2592 0.5693 0.1997 0.082 0.1124 0.2178 0.002 0.0658 0.0271 0.0156 0.0231 0.0098

0.1301 0.2732 0.5083 0.1531 0.0762 0.0952 0.2073 0.0013 0.0725 0.0265 0.0135 0.0186 0.0102

0.1222 0.2807 0.4984 0.1658 0.074 0.0989 0.2117 0.0016 0.0677 0.027 0.0107 0.0191 0.0098

0.0912 0.2541 0.4433 0.1435 0.1244 0.0746 0.1996 0.0005 0.0663 0 0.0081 0.0157 0.0218

tables, ’2 fs’, ’3 fs’, ’5 fs’ and ’10 fs’ indicate the accuracies under 2, 3, 5 and 10-fold cross validation. ’Tr/Te’ indicates the Tr/Te accuracies of the classiﬁers. The confusion entropies of different methods with different fold numbers and the confusion entropies of Tr/Te are listed in Tables 13–17. In the tables, ’2 fs’, ’3 fs’, ’5 fs’ and ’10 fs’ indicate the confusion entropies of the classiﬁcation methods under 2, 3, 5 and 10fold cross validation on the 13 training data sets. ’Tr/Te’ indicates the Tr/Te confusion entropies of the classiﬁers. The average accuracy of Tr/Te of the classiﬁers induced by J48 is 0.9264. The average accuracies of the classiﬁers induced by RandomForest, SimpleCart, NaiveBayes and RBFNetwork are 0.9445, 0.9231, 0.8721 and 0.8675, respectively. From the results, we can roughly partition the ﬁve methods into three groups. RandomForest ranks the highest. It is followed by J48 and SimpleCart. NaiveBayes and RBFNetwork form the third group. The average

Table 16 Confusion entropy of NaiveBayes on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.211 0.3145 0.5478 0.213 0.1442 0.0978 0.2563 0.0881 0.1163 0.0514 0.0698 0.0857 0.0922

0.2033 0.2575 0.5283 0.2226 0.1426 0.0842 0.2551 0.091 0.1186 0.0595 0.0671 0.0888 0.0936

0.2055 0.2575 0.5205 0.2003 0.143 0.0788 0.2563 0.0963 0.122 0.0573 0.0652 0.0919 0.0958

0.2025 0.2807 0.518 0.2048 0.1427 0.0758 0.255 0.0878 0.124 0.0567 0.0656 0.0941 0.0934

0.1928 0.169 0.5031 0.1709 0.1685 0.0878 0.2575 0.0831 0.1079 0.0484 0.0541 0.0891 0.1032

Table 17 Confusion entropy of RBFNetwork on the 13 problems.

Table 13 Confusion entropy of J48 on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.1557 0.5173 0.4227 0.1592 0.0864 0.1713 0.2243 0.0016 0.0747 0.0247 0.0139 0.0198 0.0141

0.1546 0.46 0.5605 0.1942 0.0785 0.1307 0.2162 0.0015 0.0785 0.0263 0.0149 0.025 0.0095

0.1258 0.4691 0.5322 0.163 0.0757 0.1119 0.2132 0.0014 0.0697 0.0263 0.0138 0.0183 0.0102

0.1122 0.4727 0.5114 0.1577 0.07 0.1121 0.2138 0.0011 0.0679 0.0276 0.0137 0.0153 0.0122

0.0743 0.1338 0.4432 0.1293 0.1127 0.0897 0.2164 0.0009 0.0578 0 0.013 0.0137 0.0231

Table 14 Confusion entropy of RandomForest on the 13 problems. Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.1001 0.4398 0.504 0.1404 0.0353 0.1605 0.1671 0.0009 0.0702 0.0264 0.0339 0.0392 0.0311

0.0897 0.3931 0.4911 0.1825 0.0317 0.1136 0.161 0.0008 0.07 0.0307 0.0339 0.0371 0.0196

0.0905 0.4193 0.5073 0.1197 0.0299 0.1039 0.1561 0.0005 0.0644 0.0293 0.0249 0.0318 0.0149

0.0776 0.4266 0.4675 0.133 0.0261 0.0971 0.1545 0.0004 0.063 0.0263 0.0239 0.0309 0.0232

0.0352 0.1341 0.4736 0.0866 0.064 0.0657 0.1528 0.0034 0.0594 0.009 0.011 0.0284 0.0258

Datasets

2 fs

3 fs

5 fs

10 fs

Tr/Te

Anneal Hayes Horse Image Pen Soybean Landsat Shuttle allbp allhyper allhypo allrep ann

0.1393 0.465 0.5444 0.4734 0.0738 0.1474 0.2087 0.0355 0.104 0.0479 0.0717 0.0556 0.0924

0.1285 0.3724 0.5563 0.4921 0.0678 0.1435 0.2089 0.0348 0.0933 0.0442 0.0784 0.051 0.093

0.1034 0.3386 0.5421 0.539 0.0637 0.1284 0.2059 0.0324 0.0962 0.0412 0.0704 0.057 0.0863

0.1078 0.3872 0.5334 0.5406 0.0638 0.1334 0.2057 0.0355 0.0988 0.0405 0.0721 0.0544 0.0889

0.1027 0.0849 0.4113 0.4004 0.0955 0.1177 0.2091 0.0328 0.0819 0.0298 0.0618 0.0633 0.0746

confusion entropy of Tr/Te of the classiﬁers induced by J48 is 0.1006. The average confusion entropies of the classiﬁers induced by RandomForest, SimpleCart, NaiveBayes and RBFNetwork are 0.0884, 0.1110, 0.1566 and 0.1358. We can roughly partition the methods into four or ﬁve groups. RandomForest also ranks the highest of all the methods. The results of Tr/Te show that the measure of confusion entropy is more discriminating than accuracy for evaluating classiﬁers. From the cross validation results, it is hard at the ﬁrst sight to determine which measure is better for evaluating classiﬁers. The majority of the overall results show that higher accuracy tends to correspond to lower confusion entropy. This is in accordance with the relation of accuracy and confusion entropy which we have discussed in Section 2. Some of the results exactly showing this observation are pictured in Fig. 1. In the ﬁgure, the left-side vertical axis shows accuracy while the right-side vertical axis shows confusion

3806

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

Fig. 1. Higher (or lower) accuracy corresponds to lower (or higher) confusion entropy.

entropy. The line decorated with square exhibits the varying trend of accuracy under different fold numbers. The line with triangle exhibits the varying trend of confusion entropy. In the results, there exist some cases exhibiting the difference of the discriminating power between the measure of accuracy and confusion entropy. Some of the cases are pictured in Figs. 2 and 3. Fig. 2 shows one case that the confusion entropy remains unchanged whereas the accuracy changes when the fold number changes. Such case indicates that the measure of confusion entropy is incapable of differentiating the induced classiﬁers while the measure of accuracy can tell them apart. Fig. 3 shows some of the opposite cases of that shown in Fig. 2. It shows that the accuracy remains the same whereas the confusion entropy changes when the fold number changes. This means the measure of accuracy is incapable of differentiating these induced classiﬁers while the measure of confusion entropy can tell them apart. In the results, there exist some cases that the two measures change in the same direction when the fold number changes. Such cases are shown in Fig. 4. The cases shown in the ﬁgure indicate that sometimes the increase of accuracy does not necessarily result in the decrease of confusion entropy. Or in other words, the

Fig. 2. Accuracy changes while confusion entropy remains unchanged.

improvement of accuracy does not always result in the improvement of confusion entropy or vice versa. Finally, Fig. 5 shows the cases that both the accuracy and confusion entropy remain unchanged under different number of folds. The cases shown in Fig. 5 imply that no improvement is made on both the accuracy and confusion entropy when the fold number changes. Such cases can be viewed as the trivial cases of that shown in Fig. 1. In fact, such cases make no contribution to the comparison of the two measures. Instead, they are appropriate to be used for comparing the results of cross validation with different fold numbers to show whether a fold number is superior to other fold numbers. Among all the 65 cases, there are 38, 1, 16, 10 and 3 cases with respect to Figs. 1–5. Some cases are not shown simply for space consideration. The cases pictured in Fig. 1 show that in over 50% of the cases the measure of accuracy and confusion entropy make the same or similar conclusion about the confusion matrices. The cases shown in Figs. 2 and 3 imply that the measure of confusion entropy is more discriminating than accuracy. The results of Fig. 4 show that in about 15% of the cases the improvement of accuracy does not result in the improvement of confusion entropy. In a word, these results show that the measure of confusion entropy is more precise than accuracy for evaluating classiﬁers. As aforementioned, it is crucial to unfold subtle differences of classiﬁcations in some real applications when the accuracy turns to be the same or very similar. Among the reported data sets, allbp, allhyper, allhypo, allrep, Ann are all data sets about thyroid disease. For clarity, the results of these data sets are shown in Figs. 6 and 7. Fig. 6 shows the 10-fold cross validation results of the ﬁve methods on the ﬁve training data sets. Fig. 7 shows the Tr/Te results. As one may notice, except for NB, the other four methods make very similar classiﬁcations on each of the ﬁve data sets. It is certain that the measure of confusion entropy is more capable of discriminating the induced classiﬁers than the measure of accuracy. From Figs. 1–5 we may also notice that, for a problem, the accuracies corresponding to all classiﬁcation methods do not always increase or decrease when the fold number changes from 2 to 10. Similarly, the confusion entropies do not always decrease or increase when the fold number changes. For a classiﬁcation method, the accuracies on different data sets also do not always increase or

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

Fig. 3. Accuracy remains unchanged while confusion entropy changes.

Fig. 4. Accuracy and confusion entropy change in the same direction.

Fig. 5. Accuracy and confusion entropy remain unchanged.

3807

3808

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

Fig. 6. The 10-fold cross validation results of ﬁve classiﬁcation methods on the ﬁve data sets.

Fig. 7. The Tr/Te results of ﬁve classiﬁcation methods on the ﬁve data sets.

decrease when the fold number changes. Similar observation is also true for the measure of confusion entropy. These observations imply that a different fold number may choose a different classiﬁcation model for a certain problem. No conclusion can be made on how to determine the number of folds for different problems. This may need further experimental and theoretical efforts. Some research works about cross validation techniques can be found in many research works, such as Ron Kohavi (1995), Cawley and Talbot (2003), Bengio and Grandvalet (2004), etc. 6. Conclusions In this paper, we propose a new measure based upon the concept of entropy for evaluating classiﬁer performances. By exploit-

ing the misclassiﬁcation information of confusion matrices, the measure evaluates the confusion level of the class distribution of misclassiﬁed samples. Both theoretical analysis and statistical results show that the proposed measure is more discriminating than accuracy and RCI while it remains relatively consistent with the two measures. Moreover, it is more capable of measuring how the samples of different classes have been separated from each other. Hence the proposed measure is more precise than the two measures and can substitute for them to evaluate classiﬁers in classiﬁcation applications. The results on some benchmark data sets from the UCI machine learning data repository further conﬁrm the theoretical analysis and statistical results and instantiate that the proposed measure is capable of evaluating classiﬁer performances in real applications.

J.-M. Wei et al. / Expert Systems with Applications 37 (2010) 3799–3809

Acknowledgments This work was supported by the Science Foundation of Jilin Province Grant No. 20040529, the National 863 High Technology Research and Development Program of China Grant No. 2009AA01Z152, and the National Natural Science Foundation of China Grant No. 60703013 and 10978011. References Abramson, N. (1963). Information theory and coding. New York: McGraw-Hill. Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5, 1089–1105. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145–1159. Cawley, G. C., & Talbot, N. L. C. (2003). Efﬁcient leave-one-out cross-validation of kernel ﬁsher discriminant classiﬁers. Pattern Recognition, 36, 2585–2592. Edwards, E., Metz, C., & Nishikawa, R. (2005). The hypervolume under the ROC hypersurface of ’Near-Guessing’ and ’Near-Perfect’ observers in N-class classiﬁcation tasks. IEEE Transactions on Medical Imaging, 24, 293–299. Everson, R. M., & Fieldsend, J. E. (2006). Multi-class ROC analysis from a multiobjective optimization perspective. Pattern Recognition Letters, 27, 918–927. Ferri, C., Hernandez-Orallo, J., & Salido, M. (2003). Volume under the ROC surface for multi-class problems. In Proceedings of the 14th european conference machine learning (pp. 108–120). Freitas, C. O. A., de Carvalho, J. M., & Oliveira, J. J., Jr., et al. (2007). Confusion matrix disagreement for multiple classiﬁers, CIARP 2007, LNCS 4756 (pp. 387–396). Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classiﬁcation problems. Machine Learning, 45, 171–186. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristics (ROC) curve. Radiology, 143, 29–36. Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17, 299–310. Kreuz, T., Haas, J. S., Morelli, A., Abarbanel, H. D. I., & Politi, A. (2007). Measuring spike train synchrony. Journal of Neuroscience Methods, 165, 151–161.

3809

Landgrebe, T., & Duin, R. (2006). A simpliﬁed extension of the area under the ROC to the multiclass domain. In Proceedings of the 17th annual symposium pattern recognition association, South Africa. Landgrebe, T. C. W., & Duin, R. P. W. (2008). Efﬁcient multiclass ROC approximation by decomposition via confusion matrix perturbation analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 810–822. Metz, C. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8, 283–298. Provost, F., & Fawcett, T. (1997). Analysis and visualization of classiﬁer performance: Comparison under imprecise class and cost distributions. In Proceedings of the third international conference on knowledge discovery and data mining (pp. 43–48). Mnelo Park, CA: AAAI Press. Quinlan, J. R. (1986). Introduction of decision trees. Machine Learning, 3, 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning, Morgan Kaufmann. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th international joint conference on artiﬁcial intelligence, Morgan Kaufmann (pp. 1137–1143). Sindhwani, V., Bhattacharge, P., & Rakshit, S. (2001). Information theoretic feature crediting in multiclass support vector machines. In First SIAM international conference on data mining (ICDM’01). Chicago, IL, April 5–7. Srinivasan, A. (1999). Note on the location of optimal classiﬁers in N dimensional ROC space. Technical Report PRG-TR-2-99, Computing Laboratory, Oxford University. Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005). A comprehensive evaluation of multicategory classiﬁcation methods for microarray gene expression cancer diagnosis. Bioinformatics, 25, 631–643. Swets, J. (1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285–1293. van Son, R. J. J. H. (1994). A method to quantify the error distribution in confusion matrices. In Proceedings, Institute of Phonetic Sciences, University of Amsterdam (Vol. 18, pp. 41–63). van Son, R. J. J. H. (1995). The relation between the error distribution and the error rate in identiﬁcation experiments. In Proccedings, Institute of Phonetic Sciences, University of Amsterdam (Vol. 19, pp. 71–82). Victor, J., & Purpura, K. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. Journal of Neurophsiology, 76, 1310–1326. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco: Morgan Kaufmann.

Recommend Documents

Correntropy as a Novel Measure for Nonlinearity Tests - CiteSeerX

Bagging for Linear Classifiers - CiteSeerX

A Quality Measure for Compliant Grasps - CiteSeerX