Automatic Generation of Optimum Classification ... - Semantic Scholar

Report 3 Downloads 140 Views
Automatic Generation of Optimum Classification Cascades Ezzat El-Sherif† Sherif Abdelazeem† M. Fathy Abu El-Yazeed‡ [email protected] [email protected] [email protected] † Electronics Engineering Dept., The American University in Cairo ‡ Electronics and Communications Engineering Dept., Cairo University Abstract In this paper, we present a novel technique to automatically generate optimum classification cascades. Given a powerful classifier SF with satisfactory accuracy and a set of N classifiers, our algorithm builds the fastest cascade that achieves an accuracy not less than that of SF. The algorithm is fully automatic and has a complexity of O(N2) which means it is fast and scalable to large values of N.

1. Introduction Suppose we have a classification task on which we have already found a complex classification technique that achieves satisfactory accuracy. Suppose also while such classification technique is very powerful, its time complexity is unacceptably high. This scenario happens frequently in real life as many powerful but very time consuming techniques have been devised in recent years (e.g. SVM and multiclassifier systems). Our goal would be to build a system that preserves the accuracy of that complex classifier while having much better timing performance. One solution to this problem is to use a classification cascade system. In such a system, all the patterns to be classified first go through a first stage; those patterns that are classified with confidence score higher than a certain threshold leave the system with the labels given to them by the first stage. The patterns that are classified with confidence scores lower than the threshold are rejected to the second stage. In the same manner, the patterns pass through different stages until they reach the powerful last stage that does not reject any patterns. Figure 1 illustrates this idea. The idea of classification cascades has been wellknown for long time but has not attracted much attention in spite of its practical importance. Recently,

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

and since the prominent work of Viola and Jones [1], the idea of cascade has been attracting considerable attention in the context of object detection which is a rare-event classification problem. To avoid any confusion, we will call the cascades used in the context of object detection "detection cascades" while we will call the cascades used in regular classification problems "classification cascades" in which we are interested in this paper. Building classification cascades could be seen as an optimization problem and might be tackled using various optimization techniques (e.g. particle swarm [2], and simulated annealing [3]). The most elegant approach was of Chellapilla et al. [4] using DepthFirst search. However, the algorithm is of complexity O(QN), where Q is the number of thresholds quantization levels [4]. This means that the algorithm is very slow and not scalable to large values of N. stage 1

no > T1?

yes

stage 2

decision

no > T2?

yes

..

final stage

decision

decision

Figure 1. Typical classification cascade system.

In this paper, we present an algorithm to automatically generate classification cascades given a set of N classifiers and a powerful classifier with satisfactory accuracy. The algorithm is of complexity O(N2) which means it is fast and scalable to large values of N.

2. Problem statement We first present our notation. We denote an unordered set by boldface character surrounded by curly braces, and its elements by the same character but in italics and subscripted by numbers (e.g. A1 , A2 , A3 ,... ∈ {A} ). Note that the subscripts of the unordered set are arbitrary and hold no ordering

significance. An ordered set (or an array) is denoted by just a boldface character, and its elements by the same character but in italics and subscripted by numbers according to their order in the array (e.g. A1 , A2 , A3 ,... ∈ A , where A1 is the first element in A, and A2 is the second element, etc.). B ⊂ A means that all the elements of the ordered set B exists in the ordered set A with the same order. B ⊂ {A} means that all the elements of the ordered set B exist in the unordered set {A}. B ≡ {A} means that the elements of B are the same as that of A but with order, i.e. B is an ordered version of A. We enumerate the elements of an unordered set {A} as follows {A} = { A1 , A2 ,...} and the elements of the ordered set A as follows A = [ A1 A2 ...] . C=[A B] means that the ordered set C is a concatenation of the two ordered sets A and B. In this paper we will represent a cascade by an ordered set whose elements are the classification stages ordered in the set from left to right. Here we state our problem. Given a set of N classifiers {S} = {S1 , S 2 , S 3 ,..., S N } and a powerful classifier S F ∉ {S} that achieves a satisfactory accuracy. The problem is to select an ordered set S opt ⊂ {S} and a corresponding ordered set of opt thresholds T that makes [S opt S F ] if put in a cascade structure give the optimal cascade. We mean by optimal cascade the one that gives the least possible complexity with accuracy not less than that of S F .

3. An algorithm for automatically generating optimal classification cascades We start with partitioning the available dataset into three parts: training set, validation set, and test set. The training set is used for training the available classifiers for building the cascade: {S} and SF. The validation set is used by the algorithm that is going to be described in this section to build the optimal cascade. The test set is used to test the performance of the overall system. In this section, we present an algorithm that automatically generates optimal classification cascades in the sense described in the previous section. The algorithm is composed of three major steps: i. Find the set of thresholds {T}. Each element of {T} is a threshold for the corresponding classifier in {S} such that T opt ⊂ {T} . ii. Sort the set {S} to form S ord ≡ {S} such that S opt ⊂ S ord . With the same ordering pattern of Sord, sort {T} to form Tord. iii. From Sord, select Sopt.

3.1. Step 1: Find {T} for {S} Our procedure for finding {T} of {S} will be as follows. Using every classifier S i ∈ {S} , classify the patterns of the validations set and order them according to the confidence scores they are given by Si. Traverse these ordered patterns from the one that has been classified by the highest confidence score to the lowest. While traversing the patterns, monitor whether the patterns are correctly or falsely classified. Once you hit a pattern that is falsely classified, check whether this same pattern is falsely classified by SF or not. If yes, then this pattern would not contribute to the errors of the overall cascade and can be safely ignored, and we continue traversing the patterns. We stop when we hit a pattern that is falsely classified by the classifier under consideration Si but correctly classified by SF. Then we set the threshold Ti of the classifier Si to be the confidence score of the pattern we stopped at. We do the same for all the classifiers in the set {S} to form the corresponding set of thresholds {T}.

3.2. Step 2: Sort {S} to form S ord ≡ {S} The criterion by which we sort {S} is based on the following assumption. Assumption 1 Using sufficiently tough thresholds Ti and T j for the two classifiers S i and S j , respectively, if the classifier S j has a lower rejection rate than S i , then it is said that S j is more powerful than S i ; and S i would reject all the patterns that S j would reject. The "rejection rate" of a certain classifier S i using threshold Ti is the number of rejected patterns divided by the number of validation set patterns if the threshold Ti is applied on its output. Assumption 1, while not perfectly realistic, is reasonable. Since S j is more powerful than S i , then it will be of little possibility for S i to confidently classify some pattern that was hard for S j to classify. Concerning classification cascade design, Assumption 1 leads to the following conclusion. Putting a stage with high rejections (a weak classifier) after a stage with low rejections (a strong classifier) in a cascade will have no effect except increasing the complexity of the cascade. This is because the weak classifier will reject all the patterns that are coming to it from the strong classifier. This means that the only reasonable criterion of sorting the classifiers in the cascade is to sort them by decreasing rejection rates.

By this principle, we sort {S} to give Sord. The corresponding set of thresholds for Sord will be Tord. Now we are guaranteed that S opt ⊂ S ord .

3.3. Step 3: Select Sopt out of Sord Now we want to select S opt ⊂ S ord . This selection process, if done exhaustively, is of complexity O(2 N ) . Hence, exhaustive search would not be feasible for large number of N. In this section we suggest an S opt ⊂ S ord algorithm for finding of 2 complexity O( N ) . Now, we are going to deduce a formula for the average complexity of a cascade given the complexities and rejection rates of the constituent classifiers. As an example, let S′ = [S1′ S 2′ S3′ ....S M′ ] ⊂ S ord have thresholds T′ = [T1′T2′ T3′ ....TM′ ] , complexities C′ = [C1′ C2′ C3′ ....CM′ ] , and rejection rates R ′ = [ R1′ R 2′ R3′ ....R M′ ] . Let C F be the complexity of S F . Figure 2 represents each stage in the cascade by a rectangular representing the validation set. The shaded portion of the rectangular is the portion that is not rejected from the validation set. Assumption 1 is evident in Figure 2 as each stage rejections include the rejections of preceding stage. It is obvious from Figure 2 that the cascade [S ′ C F ] has an overall complexity Ctot, where,

known problem. However, for completeness, we will present it in the context of our problem. Assume that each node tries to find the shortest path from itself to SF as well as the distance of this path. All the nodes can easily collect this information if we started by last stage and proceeded backwardly to the first. For example, in Figure 3 we start by S 4ord . The shortest path from S 4ord to SF is obviously [ S 4ord SF] as this is the only possible path. The distance of this path is R 4ord C F . For node S 3ord , we have two possible paths: [ S 3ord S 4ord SF] and [ S 3ord SF]. Hence, we compare the two paths distances: R3ord C 4ord + R 4ord C F and R3ord C F , respectively. The shortest path as well as its distance are found and saved at node S 3ord . Then, we proceed to S 2ord . Node S 2ord has just 3 options: to jump to S 3ord , to jump to S 4ord , or to jump to SF. If we jumped form S 2ord to S 3ord , and if we are interested in just shortest paths, the path from S 3ord to SF would be previously decided by S 3ord and need not to be recalculated; and the complexity of the path in this case is R 2ord C 3ord + the distance of shortest path from S 3ord to SF. The rest two options for S 2ord (to jump to S 4ord and to jump to SF.) are also examined and the option of the least distance is saved at node S 2ord . The same procedure is done for S 1ord and S 0ord . Finally, the shortest path from S 0ord to SF is the cascade of least complexity. This procedure is described in Algorithm 1.

Ctot = C1′ + R1′C2′ + R2′C3′ + ......+ RM′ −1C′M +RM′ CF (1)

Equation (1) suggests a very simple algorithm to find S opt ⊂ S ord . The algorithm is going to be described throug an example. Assume that Sord is composed of four stages: S1ord , S 2ord , S 3ord , S 4ord , and the last stage S F . We can represent such scheme by successive nodes in a digraph as shown in Figure 3. For convenience, we added a dummy node S 0ord before S 1ord . Node S 0ord is the source of all the patterns to be classified and has zero complexity and a rejection rate of 1 (i.e. C 0ord = 0, and R 0ord = 1 ). The problem now is to get the path from S 0ord to SF that leads to the least complex cascade. Now, define the distance from node Siord to node Sjord for j>i to be equal to RiordCjord. Hence, each cascade can be represented by some path in Sord. For example, the path indicated in Figure 3(i) has a distance of C1ord + R1ord C 2ord + R 2ord C 4ord + R 4ord C F which is equal to the cascade complexity. Another possible path of complexity C 2ord + R 2ord C 3ord + R3ord C F is shown in Figure 3(ii). The problem of finding the least complex cascade can be seen then as finding the shortest path in a directed acyclic graph (DAG) which is a well-

R2′

R2′

′ RM

R3′

...

C1′ C2′ C3′ CF C M′ Figure 2. A cascade represented by accepted and rejected patterns. The shaded region represents the accepted patterns.

Figure 3. Representing a cascade by a digraph.

Algorithm 1 for i = N down to 0 { distances=[ ]. for j = i+1 up to N // if i+1>N, skip the loop { d = distance from S iord to S ord + j shortest distance from S ord to SF. j distances = [distances d]. }//end j d = distance from S iord to SF. distances = [distances d]. Get the path corresponding to the minimum element in the array ‘distances’ and save this information at node S iord . }// end i

The cascade of least complexity is the shortest path found by S 0ord .

4. Experiments The dataset we used in our experiments is the MNIST. The MNIST has a total of 70,000 digits which we partitioned into 3 parts: i) a training set, which is composed of 50,000 digits and used for training the classifiers, ii) a validation set which is composed of 10,000 digits and used for optimizing the cascade system, and iii) a test set, which is used for final testing of the cascade. We then transformed each digit image of all sets into 200-element feature vector using gradient feature extraction technique [5]. To form {S}, we generated 48 classifiers of different types (linear, neural networks, RBF SVMs), having different structures, and having different features subsets as inputs. The most powerful classifier (which we used as SF) was found to be the RBF SVM with the all 200 gradient features as input which committed 66 errors on the test set (out of 10,000) and has a complexity of 5638.2 (the complexity of any classifier is measured by its recognition time divided by lowest recognition time in {S}). Applying our algorithm to {S}, it picked Sopt that was found to be composed of 5 stages (the total number of classifiers in the cascade is then 6 after adding SF). The complexity of the resulting cascade is 187.8 and the number of committed errors is 70. While the number of selected classifiers is small compared to the number of generated ones, introducing large number of classifiers to the algorithm and letting it select the most appropriate ones proved to be effective. To validate this idea, we randomly selected 5 classifiers out of {S} (the total number of classifiers in the cascade is then 6 after adding SF) and ran the algorithm over them. We repeated this 15 times and calculated the average complexity of the resulting cascades and the average

number of corresponding errors. The result is shown in Table 1 which is clearly inferior to the result of applying the algorithm on all the elements of {S}. We are now going to compare our technique with the depth-first search (DFS) devised by Chellapilla et al. [4]. Given a set of N ordered classifiers, the DFS algorithm searches systematically all possible cascade structures with Q permissible threshold values to find the optimum cascade. Using some heuristics, all the search space need not be visited and extensive pruning of search space is possible. Because the high complexity of DFS algorithm (O(QN)), it was not possible to run it on the whole 48 classifiers. Hence, we ran DFS on the same random selections we previously used with our algorithm. The result is shown in Table 1 which is also inferior to the case when we used our algorithm on the whole {S}. This clarifies the importance of using large number of classifiers N, and hence, the importance of the low complexity of our proposed algorithm. Table 1. Results on the test set Most powerful classifier (SF) Our algorithm on whole {S} Our algorithm on 5 randomly selected stages from {S} DFS on 5 randomly selected stages from {S}

Complexity 5638.2 187.8

Errors 66 70

376.9

67.3

358.7

66.5

5. Conclusion In this paper, we proposed an algorithm to automatically generate classification cascades. The algorithm is fast and scalable. Experiments showed that our algorithm is efficient and builds classification cascades that substantially reduce the overall system complexity while preserving the accuracy.

References [1] Viola, Jones,“Rapid object detection using a boosted cascade of simple features”, ICPR,pp.511-518, 2001. [2] Oliveira, Britto, Sabourin,“Optimizing class-related thresholds with particle swarm optimization”, IJCNN, vol. 3, pp.1511- 1516. [3] Chellapilla, Shilman, Simard, “Combining Multiple Classifiers for Faster Optical Character Recognition”, DAS, pp. 358-367, 2006. [4] Chellapilla, Shilman, Simard, “Optimally Combining a Cascade of Classifiers”, DRR, 2006. [5] Liu, Nakashima, Sako, Fujisawa, “Handwritten digit recognition: benchmarking of state-of-the-art techniques,” Pattern Recognition, vol. 36, pp. 2271–2285, 2003.