1
PERFORMANCE MEASURES FOR CLASSIFICATION SYSTEMS WITH REJECTION
arXiv:1504.02763v1 [cs.CV] 10 Apr 2015
Filipe Condessa, Jos´e Bioucas-Dias, and Jelena Kovaˇcevi´c
Abstract—Classification systems with rejection are of paramount importance to applications where misclassifications and their effects are critical. We introduce three measures for performance evaluation of classification systems with rejection: nonrejected accuracy measures the ability of not rejecting correctly classified samples; classification quality measures the ability of making correct joint classification/rejection decisions; and rejection quality measures the relative ability of rejecting incorrectly classified samples compared to rejecting correctly classified samples. We formulate the three measures in different frameworks, and derive their properties and bounds. We show the applicability of the measures in the analysis of performance of classification systems with rejection, and through the derivation of loss functions based on the measures and design of classification systems with rejection.
I. I NTRODUCTION Automated classification systems are essential to a large number of applications. In those classification systems where the consequences of misclassifications are critical and where the option not to classify is a viable course of action, introducing a rejection option is of paramount importance. This includes situations where the need to correctly classify is greater than the need to classify all samples — where it is more advantageous to withhold classifying a sample than to risk a misclassification (e.g. in automated medical diagnosis [1], [2]), and where The authors gratefully acknowledge support from the NSF through award 1017278 and the CMU CIT Infrastructure Award. Work partially supported by grant SFRH/BD/51632/2011, from Fundac¸a˜ o para a Ciˆencia e Tecnologia and the CMU-Portugal (ICTI) program.. Filipe Condessa is with Instituto de Telecomunicac¸o˜ es and the Dept. of Electrical and Computer Engineering at Instituto Superior T´ecnico, University of Lisbon, Portugal, Center for Bioimage Informatics and the Dept. of Electrical and Computer Engineering at Carnegie Mellon University, Pittsburgh, PA,
[email protected]. Jos´e BioucasDias is with Instituto de Telecomunicac¸o˜ es and Dept. of Electrical and Computer Engineering at Instituto Superior T´ecnico, University of Lisbon, Portugal,
[email protected]. Jelena Kovaˇcevi´c is with the Dept. of Electrical and Computer Engineering, Center for Bioimage Informatics, and the Dept. of Biomedical Engineering at Carnegie Mellon University, Pittsburgh, PA,
[email protected].
Example of use of rejection in classification of histopathology images. Stained teratoma tissue (left), associated groundtruth (center), and classification with rejection (right). Rejection is represented by the shading on the image, with darker samples being rejected before lighter samples. The selection of the rejected fraction and comparison of the performance is nontrivial. Fig. 1.
samples may be of no interest to the application (e.g. in image retrieval [3]). Furthermore, a classifier with rejection can cope with unknown information, reducing the threat posed by the existence of unknown samples or nonideal training sets that inject noise into the classifier. Classification with rejection was first analyzed in [4], where Chow’s rule for optimum error-reject trade-off was presented. Based on the knowledge of the posterior probabilities, Chow’s rule allows for the determination of a threshold for rejection, a rejection rule, such that the classification risk is minimized. Multiple other designs for incorporating rejection into classification exist. In binary classification the reject option can be embedded in the classifier. This can be achieved by risk minimization and use of hinge loss functions [5], [6], [7] to minimize classification risk. One example of this approach is SVM with embedded reject option [8], [9], [10]. These designs can be extended to a rejection framework into multilabel problems [11]. There is no standard measure for assessing the performance of a classifier system with rejection. Accuracy-rejection curves [12], [8], [13], [14] and their variants based on the analysis of the F1 score [15], [11], albeit popular in practical applications of classification with rejection, have conceptual flaws. Obtaining enough points to obtain a curve might not be feasible for classifiers with
2
embedded reject option or in classification systems that combine classification with rejection with contextual information [2], where the cost of computing multiple points of the accuracy-rejection curve is prohibitive. This means that the accuracy-rejection points and the F1 -rejection curves, in the real world, are not able to describe the behavior of the classification system with rejection. In [16], a different approach to the performance analysis in classification systems with rejection is taken. A 3D ROC (receiver operating characteristic) plot of a 2D ROC surface is obtained by decomposition of the false positive rate into false positive rate for outliers belonging to known classes and false positive rate for outliers belonging to unknown classes, with the VUC (volume under the curve) used as the performance measure. The use of ROC curves for the analysis of the performance suffers from the same problems associated with accuracy-rejection curves. There is no simple way to compare the performance of two classification systems with rejection when they are working at different rejection ratios. The comparison between the accuracy gains obtained by rejecting a larger fraction of the data and the losses associated with classifying a smaller fraction of data is not clear. We thus propose a set of three measures to evaluate the performance of classification systems with rejection with regard to the rejected fraction based on the same knowledge needed to compute the accuracy for a classification system without rejection: • Nonrejected accuracy A provides an insight into the evolution of accuracy obtained by using rejection, that is, it allows for the analysis of the accuracy gains obtained by use of rejection. • Classification quality Q relates the accuracy of the nonrejected samples and the accuracy of the rejected samples, bringing an insight into the behavior of the classification-rejection system as a whole, that is it allows for the analysis of the overall correctness of the classification system with rejection. • Rejection quality φ is an unbounded measure that provides an insight into how well the rejection works in rejecting incorrect classifications, allowing for the fast assessment of whether including an option to reject is useful for the improvement of accuracy, that is, it allows for
the analysis of the ability to concentrate samples incorrectly classified in the set of rejected samples. We reformulate the three measures such that their usage in practical applications is possible, allowing the use of the measures for performance analysis of classifiers with embedded reject option, where the nonrejected label (label of sample should it had not been rejected) and corresponding accuracy might not be available. Finally we derive a loss function from the measures, designing a classifier with rejection, and illustrate the potential of the measures in performance assessment and classifier design. The paper is structured as follows: In Section II, we formulate the problem of classification with rejection. In Section III, we present the performance measures. In Section IV, we show that the performance measures are sufficient to completely describe the problem of classification with rejection. In Section V, we reformulate the measures for use in practical applications. In Section VI, we derive a loss function from the measures that allows for the design of classifier systems with rejection. In Section VII, we illustrate the potential of the measures by applying them to applications in real data and designing a classifier with rejection. Section VIII concludes this paper. II. C LASSIFICATION SYSTEM WITH REJECTION A classification system with rejection can be seen (see in Fig. 2) as a coupling of a classification system C, which maps n d-dimensional feature vectors into n labels C(x) : Rd×n → {1, . . . , K}n , with a rejector R, which maps the feature vectors ˆ : and the labels into a binary vector, R(x, y) Rd×n × {1, . . . , K}n → {0, 1}n . The output of the rejector is a binary vector b representing the decision whether to classify a sample or to reject it. features x
classifier C
labels yˆ
rejector R Fig. 2.
decision b
General diagram of a classification system with
rejection.
We use the general formulation to introduce the problem and derive bounds; the binary classifica-
3
tion formulation to prove that the behavior of the Λ ∪ Ω = {1, . . . , N} we can define the partition classification system with rejection can be perfectly into two subvectors univocally by defining Λ), reconstructed from the proposed measures; and the arg max α(a, Λ). (1) probabilistic formulation to provide an intuition into Λ the measures. The design of the objective function α is such that our goal of maximizing the number of nonzero elements in aΛ and zero elements in aΩ is promoted. A. Measure description The rejected fraction r can be represented as the We now present a formulation of the rejection ratio between the size of support Ω and the size of as a very general problem, constructing a general Λ ∪ Ω, n−k framework that will allows us to draw important r= . (2) properties for the measures. Let a be a binary nn dimensional accuracy vector such that ai measures We should note that, since Λ and Ω are disjoint whether the ith sample is classified correctly, and supports, c an n-dimensional confidence vector, such that ci ||a|| = ||aΛ || + ||aΩ ||. (3) measures the confidence that the ith classification is correct. Rejection consists on a set of n binary As we only work with the norm of binary vectors, choices to output the classification result (with a we point that kak = kak . For simplicity, we omit 0 1 corresponding accuracy vector a) or withold it, the subscript. based on a confidence vector c. Let Λk be the By presenting three objective functions that prosupport containing the largest k elements of c, the mote the goal of separation, and showing the subvector of the samples with highest classification connection between the rejection problem and the confidence — thus not rejected, and Ωn−k the sup- generalized problem, we show the validity of the port containing the smallest n − k elements of c, presented measures in the evaluation of the perforthe subset of the samples with lower classification mance of the rejector. confidence — thus rejected such that ci ≥ cj for all i ∈ Λk , j ∈ Ωn−k 1 . Our goal is to separate III. P ERFORMANCE MEASURES the accuracy vector a into two subvectors (aΛ and Nonrejected accuracy A: The number of nonzero aΩ ), based on the confidence vector c such that all incorrect classifications are in the aΩ subvector, and elements in aΛ is an obvious choice to promote the goal of the separation of zero and nonzero elements all correct classifications are in the aΛ subvector. This formulation is very general and allows us into two subvectors, to find the performance measures associated with α1 (a, Λ) = kaΛ k. (4) the rejection to be analogous to objective functions that evaluate the performance of the separation of a The value of α1 will be the number of samples in two disjoint subvectors induced by the estimate classified correctly not rejected. If normalized by c. The rejection problem is an instance of the the number of nonrejected samples k, we obtain the general problem if we consider the supports of nonrejected accuracy A, the subvectors as sets of indices of rejected and α1 α1 kaΛ k A(r) = = = , (5) nonrejected samples. k (1 − r)n (1 − r)n Our goal is to separate a into two subvectors (aΛ and aΩ ), based on the estimate vector c, such that with r given in (2). The optimization problem that all the 0 values are in the aΩ subvector and all arises from here is a nontrivial optimization problem the 1 values are in the aΛ subvector. We pose this with regard to the support Λ; if kak > k (the generalized separation problem as an optimization number of 1s in a is larger than the size of the problem, in which we maximize a function α of a, support Λ), multiple supports Λ are possible such Λ (as we define (Λ, Ω) such that Λ ∩ Ω = ∅ and that kaΛ k = k. Furthermore, this problem depends on the number of nonzero elements of a and on the 1 We omit the subscript when not relevant. Unless clearly stated, value of k. This means that the objective function Λ = Λk and Ω = Ωn−k . will not give any insight into whether the value of
4
k is adequate to the task (i.e. k does not reflect kak, the total number of correct classifications) as the maximum value achieved by the measure can be achieved for all values of k such that k < kak. Classification quality Q: Concentrating the largest number of nonzero elements in aΛ is equivalent to concentrating the largest number of zero elements in aΩ . This leads us to the second objective function, one that measures both the number of nonzero elements in aΛ and the number of zero elements in aΩ , α2 = ||aΛ || + ||1 − aΩ ||.
(6)
The value of α2 is the number of correct decisions (i.e. reject a sample when it is an incorrectly classified sample, and not reject a sample when it is a correctly classified sample). This objective function, denoted as classification quality Q, measures the fraction of joint correct decisions the classification block and the rejection block make, in other words, Q(r) =
α2 kaΛ k k1 − aΩ k = + , n n n
(7)
with r given in (2). This objective function yields its maximum value n if and only if c induces a perfect separation of 1s and 0s by Λ and Ω (meaning that no 0 value is present in aΛ and there is no 1 value present in aΩ ) if and only if k = kak (meaning that the length of the support Λ must be equal to the number of nonzero elements in a). This gives us insight into what the size of the support Λ should be. The rationale behind this approach is that correctly identifying correct classifications is as important as correctly identifying incorrect classifications. In other words, it is as important to have the nonzero elements in aΛ as it is to have the zero elements in aΩ . Rejection quality φ: Another approach to evaluate the performance is to measure what is lost in the process, that is, how many nonzero v.s. zero elements are present in aΩ ,
the ratio of zero elements by nonzero elements present in a, α3 =
k(1 − a)Ω k kak . kaΩ k k1 − ak
(8)
The value of α3 is the ratio between the number of correct rejections (rejected and incorrectly classified samples) and the number of incorrectly classified rejections (rejected and correctly classified samples), normalized by the total ratio of incorrectly classified to correctly classified samples, in other words, φ(r) = α3 =
k(1 − a)Ω k A(0) . kaΩ k 1 − A(0)
(9)
This objective function allows us to evaluate how kaΛ k/k compares to A(0) = kak/n, this is, how does the concentration of nonzero elements changes: if α3 is larger than 1 the concentration increases, if α3 is equal to 1 the concentration remains the same, and if α3 is smaller than 1 the concentration decreases. It should be noted that, for the extreme cases where kak = n or kak = 0, the objective function is indeterminate (by setting k1 − ak and kaΩ k to zero respectively). When kak = n, the classification block correctly classified all the samples and thus there is no need to reject, while when kak = 0 none of the samples are correctly classified, meaning everything should be rejected. In either case, there is no need to compute rejection quality. The quality of rejection allows a fast evaluation of the performance of the classification system when a option to reject is added. A. Bounds for variation of measures
Our generalized framework now allows us to compute bounds on the variation of the three performance measures. As all three measures can be expressed as functions of the rejected fractions (or objective functions expressed as function of the lengths of the supports), we will study the bounds for variations of the value of r. This allows us to correctly estimate the possible evolutions of the measures, based on results k(1 − a)Ω k . obtained for different rejection fractions. kaΩ k Fundamental identity: Let us consider Λk and Since this is highly dependent on kak (the overall Ωk the supports obtained for a value of k (given number of correct classifications), we normalize by k, define a support Λ of dimension k), Λk′ and Ωk′
5
the supports obtained for a value of k ′ . We define the accuracy of the rejected samples as B(r) =
Bound on the classification quality Q: Using the definition of Q given in (7), Q(r) = kaΛ k/n + k1 − aΩ k/n = A(r)(1 − r) + (1 − B(r))r,
kaΩrn k , rn
where r denotes the rejected fraction. We can relate the accuracy of the nonrejected samples and the accuracy of the rejected samples at different rejected fractions r and r ′ as (a)
kaΛk k + kaΩk k = kaΛk′ k + kaΩk′ k,
(10)
with k = (1 − r)n and k ′ = (1 − r ′ )n, (b)
A(r)(1 − r) + B(r)r = A(r ′ )(1 − r ′ ) + B(r ′ )r ′ ,
(11)
B(r) = 1 −
Q(r) − A(r)(1 − r) . r
This results in the following bound for Q(r ′ ), for r ′ ≥ r, Q(r) + (A(r)(1 − r) − A(r ′ )(1 − r ′)) ≤ Q(r ′ ) ≤ Q(r) + (r ′ − r) + (A(r)(1 − r) − A(r ′ )(1 − r ′)), which, combining with the bound obtained for A(r ′ ) in (13), gives:
where (a) follows from (3); (b) from (a) and from the definition of nonrejected accuracy A(r) in (4), rejected accuracy B(r). This means that, regardless of how the rejector separates the samples into rejected and nonrejected, the total number of correctly classified samples does not change: the overall accuracy does not change by rejecting. We can bound the B(r) by, B(r ′ )r ′ = B(r)r + δ(r ′ − r),
Q(r) − r ′ + r ≤ Q(r ′ ) ≤ Q(r) + r ′ − r.
(12)
(14)
As the classification quality measures the number of correct decisions made, the absolute difference between the classification quality for different rejection ratios will be at most the number of new decisions made |r ′ − r|. Bound on the rejection quality φ: Using the the definition of φ given in (8), φ(r) =
with r ′ ≥ r, and 0 ≤ δ ≤ 1, leading to B(r)r ≤ B(r ′ )r ′ ≤ B(r)r + (r ′ − r),
which allows us to express B(r) as:
1 − B(r) A(0) B(r) 1 − A(0)
From (12) and (15) it is possible to bound the evolution of φ(r).
If we consider the smallest r ′ : r ′ > r, correspondA(0) 1 φ(r) = ing to rejecting one more sample, we have that δ 1 − A(0) φ(r) + A(0) represents whether the newly rejected sample was 1−A(0) incorrectly classified (δ = 0) or correctly classified Let γ = A(0)/(1 − A(0)) denote the ratio of (δ = 1). correct to incorrectly classified samples in the entire Bound on the accuracy of the nonrejected samples set, then we have A: By (11), we have (φ(r) + γ)r ′ ′ ′ ′ φ(r ) = −γ 1 − r B(r)r − B(r )r r + δ(r ′ − r)(φ(r) + γ)/γ A(r ′ ) = A(r) + 1 − r′ 1 − r′ Since 0 ≤ δ ≤ 1, we have that for r ′ ≥ r: 1−r r′ − r = A(r) − δ . 1 − r′ 1 − r′ (φ(r) + γ)r ′ Since 0 ≤ δ ≤ 1, we have that: − γ ≤ φ(r ′ ) r + (r ′ − r)(φ(r) + γ)/γ r′ − r 1−r 1−r (φ(r) + γ)r ′ ′ − ≤ A(r ) ≤ A(r) . (13) A(r) ≤ − γ. (15) 1 − r′ 1 − r′ 1 − r′ r
6
rank system: We can consider the classification system with 1 0 0 0 1 R00 rejection as two coupled classifiers by considering r . R01 = 1 −1 0 −1 the rejector R to be a binary classifier that will R10 0 0 Q(r) 1 1 classify yˆ, the classifications obtained from C, as A(r) (1 − r) 0 1 −1 1 R11 rejected or not rejected. Ideally R should classify Therefore as the set of measures and the confusion all incorrectly classified samples as rejected, and all matrix are related by a full rank system, the set correctly classified samples as not rejected. of measures (r, A(r), Q(r)) perfectly describes the In this binary classification formulation, the clasbinary classification that represents the rejection sification quality Q becomes the accuracy of the system. classifier R, the accuracy of the nonrejected samples becomes the precision (positive predictive value) of V. E XPERIMENTAL CONSIDERATIONS . the classifier R, the rejection quality φ becomes the positive likelihood ratio (ratio between the A. Computation of the measures without access to true positive rate and the false positive rate) of the accuracy of rejected samples classifier R, and the rejected fraction r becomes the The computation of the classification quality Q ratio between the number of samples classified as and the rejection quality φ requires the knowledge rejected and the total number of samples. of the values of classification accuracy of rejected The binary classification formulation allows us to fractions. In the real world this may not be feasible show that the triplet of measures (r, A(r), Q(r)) is if rejection is mutually exclusive to the classification sufficient to describe the classification system with (e.g. rejected samples are not classified, thus unable rejection. This is done by relating the triplet to to be used in the computation of the accuracy of rethe confusion matrix of the binary classifier R. As jected fractions). We discuss two possible solutions the knowledge of the confusion matrix of a binary to this problem. classifier allows the description classifier-rejector Accuracy of the entire set is known: The meapair, this means that if the triplet (r, A(r), Q(r)) sures should be reformulated such that their comallows the reconstruction of the confusion matrix for putation can be based on the knowledge of the accuR. The behavior of the classification system with racy measured on the entire set (with no rejection), rejection is thus perfectly described, showing that and the knowledge of the nonrejected accuracy for the measures introduced are sufficient to describe a given rejected fraction. If the accuracy measured the system. on the entire data set A(0) is known, we can Theorem 1. The set of measures (r, A(r), Q(r) is reformulate the classification quality Q(r) as IV. S UFFICIENCY
OF MEASURES
sufficient to describe the behavior of the classifierrejector pair.
Q(r) = 2A(r)(1 − r) + r − A(0).
(16)
Accuracy of the entire set is unknown: We can also obtain the differential of the measures for different rejected fractions, thus allowing us to sidestep the requirement of the knowledge of the accuracy for the entire set. This can only be applied to the classification quality, and allows only for where n denotes the total number of samples, R00 the comparison of performance among the same the fraction of samples correctly classified and not classification instance. If the accuracy measured on rejected, R01 the fraction of samples incorrectly the entire data set A(0) is unknown, but assumed classified but not rejected, R10 the fraction of sam- constant, we can compare the performance of the ples correctly classified but rejected, and R11 the same classifier by analysis of the differential of the fraction of samples incorrectly classified and re- classification quality. Let us consider two different jected. Given that n binary classifications classified operating points of the classification system with ren samples, the confusion matrix associated with R jection: a nonrejected accuracy A(r ′ ) with a rejected can be uniquely obtained from the following full fraction r ′ , and a nonrejected accuracy A(r) with a Proof: Let us consider the following confusion matrix associated with R: R00 R01 n , R10 R11
7
rejected fraction r. The differential can be obtained needed from the rejector — the effectiveness of a rejector is higher when the performance of the as (17) classifier is lower. ∆Qr,r′ = Q(r ′ ) − Q(r). Our goal is to find an alternative formulation for the quality of rejection such that the differential (17) is the same. By inspection of (16), it is clear that we can reformulate the classification quality as Q(r) = 2A(r)(1 − r) + r − A(0) = Q′ (r) − A(0). As differential does not change ∆Qr,r′ = ∆Q′r,r′ , and the computation of Q(r) does not require knowledge of A(0), we can use Q′ (r) instead of Q(r) (if the assumption that A(0) is constant holds) to compare the performance of the classification system with rejection at different operating points. B. Determination of operating regions for classifiers with reject option Whereas the use of a reject option tends to improve the accuracy of the classifier, it should not be active in all situations. One should not deteriorate the results obtained by a very good classifier by including a reject option that marginally increases the accuracy of the nonrejected samples at the expense of rejecting a large fraction of the samples. To this extent, we can define the operating region O of a classifier with reject option by requiring the classification quality to be greater or equal than the accuracy with no rejection (A(0) = Q(0)): In the operating region O we are guaranteed not to do harm by using the reject option. It should be clear that the operating region is dependent from the combination of classifier and rejector. Based on the concept of operating region for the reject option, we can define lower bounds for the nonrejected accuracy and the rejection quality. This can be easily obtained by noting that if Q(r) ≥ Q(0), then A(0) − r/2 , 1−r
(18)
A(0) . 1 − A(0)
(19)
and φ(r) ≥
OF CLASSIFIERS WITH REJECTION THROUGH MAXIMIZATION OF THE CLASSIFICATION QUALITY
The design of classifiers by the maximization of the accuracy, achieved by the minimization of a 0−1 loss function and leading to a maximum a posteriori classifier, can be extended with the concept of classification quality. Instead of maximizing the number of correctly classified samples, the maximization of the classification quality will lead to the joint maximization of the correctly classified samples not rejected and the incorrectly classified samples rejected. Let L denote the loss function derived from Q 0, if yi = zi , ri = 0, 1, if y = z , r = 1, i i i L(zi , yi , ri ) = (20) 1, if y = 6 z , r i i i = 0, 0, if yi 6= zi , ri = 1, where z ∈ Ln denotes the true labeling, yinLn the assigned labeling, and r ∈ {0, 1}n a binary rejection vector (ri = 1 corresponding to rejection of the ith sample and ri = 0 to no rejection). The minimization of the loss function (20) becomes arg
O = {r : Q(r) ≥ Q(0)}.
A(r) ≥
VI. D ESIGN
Equation (19) illustrates the interplay between the performance of the classifier and the performance
min
y∈Ln ,r∈{0,1}n
n XX
L(zi , yi , ri )p(zi |x).
z∈Ln i=1
(21) Let p(y|x) = [p(y1 |x) . . . p(yn |x)]T denote the probability vector associated with the labeling y, the minimization (21) can be reformulated as arg
max
y∈Ln ,r∈{0,1}n
1T p(y|x) + 1T r − 2p(y|x)T r,
(22) equating to the maximization of the number of accurate samples and number of rejected samples, and minimization of the number of accurate rejected samples. The problem (22) can be approximated by y t+1 = arg maxn (1 − 2r t )T p(y|x) y∈L
r t+1 = arg max n (1 − 2p(y|x)t+1)T r r∈{0,1}
(23)
8
1
12 10
0.8
8 0.6 6 0.4
0.2
nonrejected accuracy classification quality operating region
0 0
rejection quality operating region
4 2
0.2
0.4 0.6 rejected fraction
0.8
1
0 0
0.2
0.4 0.6 rejected fraction
0.8
1
Performance measures associated with the classification with rejection in histopathology images in Fig. 1. Evolution of nonrejected accuracy and classification quality (left) and rejection quality (right) with varying values of rejected fraction. Initial (no rejection) accuracy of 66.39%, maximum classification quality of 71.07% corresponding to a rejected fraction of 27.18% and a nonrejected accuracy of 75.72%. Operating region active between 0% and 52.37% of rejected fraction, with corresponding nonrejected accuracies of 66.39% and 84.41% respectively. Fig. 3.
VII. E XPERIMENTAL R ESULTS Real Data — performance analysis We apply the measures to evaluate the performance of classification systems with rejection in real data. To this end, we assess the performance of a classifier with rejection in histopathology image classification (Fig. 1). The classifier with rejection follows the approach proposed in [17]. Supervised classification with context arises from the combination of LORSAL [18] to learn the class models, with SegSALSA [19] to enforce the context. The resulting classification is then ordered by degree of confidence and the rejection results from the ordering of the samples according to the confidence. The performance measures allow us to conclude not only that there are performance improvements achieved by using rejection with classification, but also to determine the levels of rejection. Synthetic Data — design of classifiers with rejection Following the minimization of the loss function derived in (20) by the split optimization problem in (23), we can derive an algorithm for classification where the parameters for the classifier and for the rejector are learnt from the training set. A 4 class problem is used to illustrate the potential of classifier design through the maximization of the classification quality. The samples follow a Gaussian distribution xi ∼ N (µyi , Σ), with µ1 =
[1, 1], µ2 = [1, −1], µ3 = [−1, −1], µ4 = [−1, 1], and Σ = sI. From a training set of randomly selected samples, µ and Σ are learnt. The rejection is obtained by a thresholding of the class probabilities, i.e. ri = 1 if p(yi |x, µ, τ ) < τ . The classifier and rejector are trained following the formulation in (23), alternating between (1) learning the classification parameters that maximize the accuracy in the nonrejected training samples, and (2) learning the threshold of the class probabilities that maximizes rejection of misclassified samples and minimizes the rejection of correctly classified samples. This results on a classifier design where the training set is pruned and the parameters of the rejector are learnt without requiring a separater training set. By varying the hardness of the classification problem (with different degrees of overlap of the Gaussians), we observe the ability to adapt the rejector to cope with harder problems (Fig. 4 and table I). TABLE I
P ERFORMANCE
MEASURES FOR CLASSIFICATION WITH REJECTION IN F IG . 4.
noise level
initial accuracy
rej. fraction
nonrej. accuracy
class. quality
rej. quality
0.50 0.75 1.00 1.25
95.22% 82.27% 70.43% 61.93%
0.60% 1.04% 9.91% 11.12%
95.64% 82.82% 73.65% 64.20%
95.24% 82.65% 72.22% 63.31%
18.56 9.148 3.461 2.088
VIII. C ONCLUSIONS We presented a set of three measures to evaluate the performance of a general classification system with rejection. The measures allow for the comparison of performance both of the classification system, the rejection system, and the coupled system. We derive bounds for the measures, present a reformulation of the measures for situations where it is not possible to obtain a labeling of the rejected samples, and show the adequacy of the measures to assess the usefulness of the reject option. R EFERENCES [1] J. Quevedo, A. Bahhamonde, M. P´erez-Enciso, and O. Luaces, “Disease liability prediction from large scale genotyping data using classifiers with a reject option,” IEEE/ACM Transactions on Computation Biology and Bioinformatics (TCBB), vol. 9, no. 1, pp. 88–97, 2012.
9
3
4
5
6
4 3 2
4 3 2 2
1
2 1 1
0
0
0
0
−1 −1 −1
−2 −2 −2 −3
−2
−4 −3 −4
−3 −3
−2
−1
0
1
2
3
−4 −4
−3
−2
−1
0
1
2
3
4
−5 −5
−4
−3
−2
−1
0
1
2
3
4
5
−6 −6
5
5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
−2
−2
−2
−2
−3
−3
−3
−3
−4
−4
−4
−5 −5
−4
−3
−2
−1
0
1
2
Σ = 0.50I
3
4
5
−5 −5
−4
−3
−2
−1
0
1
2
3
4
5
−5 −5
Σ = 0.75I
−4
−2
0
2
4
6
−4
−4
−3
−2
−1
0
1
2
Σ = 1.00I
3
4
5
−5 −5
−4
−3
−2
−1
0
1
2
3
4
5
Σ = 1.25I
Classifier with rejection derived from the minimization of the loss function associated with the classification quality. 4 class problem with highly overlapped classes. Ground truth (top row) and classification results with rejection (bottom row). Rejected samples colored black.
Fig. 4.
[2] F. Condessa, J. Bioucas-Dias, C. A. Castro, J. A. Ozolek, and J. Kovaˇcevi´c, “Classification with rejection option using contextual information,” in Proc. IEEE Int. Symp. Biomed. Imag., San Francisco, CA, Apr. 2013, pp. 1340–1343. [3] A. Vailaya, M. Figueiredo, A. Jain, and H. Zhang, “Image classification for context-based indexing,” IEEE Transactions on Image Processing, vol. 10, no. 1, pp. 117–130, 2001. [4] C. K. Chow, “On optimum recognition error and reject tradeoff,” IEEE Trans. Inf. Theory, vol. 16, no. 1, pp. 41–46, Jan. 1970. [5] M. Wegkamp, “Lasso type classifiers with a reject option,” Electronic Journal of Statistics, pp. 155–168, 2007. [6] P. Bartlett and M. Wegkamp, “Classification methods with reject option using a hinge loss,” Journal Machine Learning Research, vol. 9, pp. 1823–1840, Aug. 2008. [7] M. Yuan and M. Wegkamp, “Classification methods with reject option based on convex risk minimization,” Journal Machine Learning Research, vol. 11, pp. 111–130, Mar. 2010. [8] G. Fumera and F. Roli, “Support vector machines with embedded reject option,” in Proc. Int. Workshop on Pattern Recognition with Support Vector Machines (SVM2002), Niagara Falls, Niagara Falls, Canada, Aug. 2002, pp. 68–82, Springer-Verlag. [9] Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu, “Support vector machines with a reject option,” in Advances in Neural Information Processing Systems, pp. 537–544, 2009. [10] M. Wegkamp and M. Yuan, “Support vector machines with a reject option,” Bernoulli, vol. 17, no. 4, pp. 1368–1385, 2011. [11] I. Pillai, G. Fumera, and F. Roli, “Multi-label classification with a reject option,” Patt. Recogn., vol. 46, no. 8, pp. 2256 – 2266, 2013. [12] G. Fumera, F. Roli, and G. Giacinto, “Reject option with multiple thresholds,” Patt. Recogn., vol. 33, no. 12, pp. 2099– 2101, Dec. 2000. [13] G. Fumera and F. Roli, “Analysis of error-reject trade-off in linearly combined multiple classifiers,” Patt. Recogn., vol. 37, no. 6, pp. 1245 – 1265, 2004. [14] R. Sousa and J. Cardoso, “The data replication method for the
[15]
[16]
[17]
[18]
[19]
classification with reject option,” AI Communications, vol. 26, no. 3, pp. 281–302, 2013. G. Fumera, I. Pillai, and F. Roli, “Classification with reject option in text categorisation systems,” in Proc. 12th International Conference on Image Analysis and Processing, Washington, DC, USA, 2003, ICIAP ’03, pp. 582–587, IEEE Computer Society. T. Landgrebe, D. Tax, P Pacl´ık, and R. Duin, “The interaction between classification and reject performance for distancebased reject-option classifiers,” Pattern Recognition Letters, vol. 27, no. 8, pp. 908–917, 2006. F. Condessa, J. Bioucas-Dias, and J. Kovaˇcevi´c, “Robust hyperspectral image classification with rejection fields,” in Proc. IEEE Workshop Hyperspectr. Image Signal Process.: Evolution in Remote Sensing, Tokyo, June 2015, Submitted. J. Li, J. Bioucas-Dias, and A. Plaza, “Hyperspectral image segmentation using a new bayesian approach with active learning,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 10, pp. 3947–3960, Oct. 2011. J. Bioucas-Dias, F. Condessa, and J. Kovaˇcevi´c, “Alternating direction optimization for image segmentation using hidden Markov measure field models,” in Proc. SPIE Conf. Image Process., San Francisco, CA, Feb. 2014.