Making a Shallow Network Deep: Growing a Tree from Decision Regions of a Boosting Classifier
[email protected] Department of Engineering University of Cambridge Trumpington Street, Cambridge CB2 1PZ, UK
Roberto Cipolla
*indicates equal contribution.
Tae-Kyun Kim* http://mi.eng.cam.ac.uk/~tkk22
Ignas Budvytis*
[email protected] x
x
H1(x) : a boosting
W1
1
classifier
h1
H2(x)
hi
……
hT
……
0
R6
R5
R2
R1
c0 c1
a1
c0 c1
ai
c0 c1
aT
R7
0
R3 0
S (a)
1
R4
W2
1
W3
W2
W3
C
0
0
0
0
R2
0
0
1
0
R3
0
1
0
0
R4
0
1
1
1
R5
1
0
0
1
R6
1
0
1
1
R7
1
1
0
1
R8
1
1
1
x
W1W2W3 v W1W2W3 v W1W2W3 v W1W2W3
H(x)=S ai h i (x)
(b)
W1 R1
W1 0
1
W2
1 1
0
R5,R6,R7,R8 W3
0
R1,R2
0
0
R3
1 1
R4
W1 v W1W2W3
Figure 3: Boolean expression minimisation for an optimally short tree. (a) A Figure 1: Boosting as a tree. (a) A boosting cascade is seen as an imbalanced boosting classifier splits a space by binary weak learners (left). The regions are tree, where each node is a boosting classifier. (b) A boosting classifier has a very shallow and flat network where each node is a decision-stump i.e. weak-learner.
MPEG-7 f ace data
S uper tree
B oos ting clas s ifier
Boosting No. of weak False False Average learners positives negatives path length
5 12
A P L : 20 weaklearners
2
8 16
2
16 11 3 20
1 4
27 39 110 128 140 152 92 75 54
18 4 9 13
172 9611 160 182 150 154 55 138 73 13 4018 88103 53 87 184 2920 38 8124 121 158 23 147 86 149 68 79 61 80 45 31 19 67 24 10 157 30 148 159 170 187 181 105 146 72 91171 036 107 91 58 76 93 102 85 70 16 26 153 164 14141 176 165 142 56 122 66 94 6343 84 42 100 180 186 169 118 168 179 135 131 77 64 51 69 104 123 36 34 50 15 35 166 167 178 133 49 120 134 59 163 175 101 119 129 144 155156 65 185 177 145 98112 57 95 82 143 115 4
74 17 28 12944 60 78 7647 32 48 21 14 116 99117 132 8333 22 5
11
2
1
6 14
A P L : 3.8 weaklearners
8 15
False negatives
Average path length
Caltech bg dataset
MIT+CMU f ace test set
Super tree (cascade) False False Average positives negatives path length
20
501
120
20
501
120
11.70
476
122
7.51
40
264
126
40
264
126
23.26
231
127
12.23
60
222
143
60
222
143
37.24
212
142
14.38
100
148
146
100
148 (144)
146 (149)
69.28 (37.4)
(145)
(152)
(15.1)
200
120
143
200
120 (146)
143 (148)
146.19 (38.1)
(128)
(146)
(15.8)
BANCA f ace set
Figure 4: Experimental results on the face images. Example face images are
2
17
Fast exit (cascade) False positives
13
3
10 19
(a)
6
3
7
7 108 25 37127 106 130 62 161 173 151 162 174 97 126 139 89183 114 71 111 46 113 52 137 109 81125 189 188
represented by the boolean table and the boolean expression is minimised (middle). An optimal short tree is built on the minimum expression (right).
shown in right.
(b)
Figure 2: Converting a boosting classifier into a tree for speeding up. (a) The decision regions of a boosting classifier (top) are smooth compared to a conventional decision tree (bottom). (b) The proposed conversion preserves the Boosting decision regions and has many short paths speeding up 5 times.
This paper presents a novel way to speed up the classification time of a boosting classifier. We make the shallow (flat) network deep (hierarchical) by growing a tree from the decision regions of a given boosting classifier. This provides many short paths for speeding up and preserves the Boosting decision regions, which are reasonably smooth for good generalisation. We express the conversion as a Boolean optimisation problem. Boosting as a tree: A cascade of boosting classifiers, which could be seen as a degenerate tree (see Figure 1(a)), effectively improves the classification speed. Designing a cascade, however, involves manual efforts for setting a number of parameters: the number of classifier stages, the number of weak-learners and the threshold per stage. In this work, we propose a novel way to reduce down the classification time of a boosting classifier not relying on a design of cascade. The chance for improvement comes from the fact that a standard boosting classifier can be seen as a very shallow network, see Figure 1(b), where each weak-learner is a decision-stump and all weak-learners are used to make a decision. Conversion of a boosting classifier into a tree: Whereas a boosting classifier places decision stumps in a flat structure, a decision tree has a deep and hierarchical structure (see Figure 1(b) and 2(b)). The different structures lead to different behaviours: Boosting has a better generalisation via reasonably smooth decision regions but is not optimal in classification time. Whereas a conventional decision tree forms complex decision regions trying classification of all training points, a boosting classifier exhibits a reasonable smoothness in decision regions (see Figure 2(a)). We propose a method to grow a tree from the decision regions of a boosting classifier. As shown in Figure 2(b), the tree obtained, called super tree, preserves the Boosting decision regions: it places a leaf node on every region that is important to form the identical decision boundary (i.e. accuracy). In the mean time, Super tree has many short paths that reduce the average number of weak-learners to use when classifying a data point. In the example, super tree on average needs 3.8 weak-learners to perform classification whereas the boosting classifier needs 20. Boolean optimisation: A standard boosting classifier is represented by the weighted sum of binary weak-learners as H(x) = ∑m i=1 αi hi (x),
where αi is the weight and hi the i-th binary weak-learner in {−1, 1}. The boosting classifier splits a data space into 2m primitive regions by m binary weak-learners. Regions Ri , i = 1, ..., 2m are expressed as boolean codes (i.e. each weak-learner hi corresponds to a binary variable wi ). See Figure 3 for an example, where the boolean table is comprised of 23 regions. The region class label c is determined by the boosting sum. Region R8 in the example does not occupy the 2D input space and thus receives the don’t care label marked “x” being ignored when representing decision regions. The boolean expression for the table in Figure 3 can be minimised by optimally joining the regions that share the same class label or don’t care label as w1 w2 w3 ∨ w1 w2 w3 ∨ w1 w2 w3 ∨ w1 w2 w3 −→ w1 ∨ w1 w2 w3 where ∨ denotes OR operator. The minimised expression has a smaller number of terms. Only the two terms, w1 and w1 w2 w3 are remained representing the joint regions R5 − R8 and R4 respectively. A short tree is then built from the minimised boolean expression by placing more frequent variables at the top of the tree (see Figure 3(right)). Standard methods for Boolean expression minimisation, which has been previously studied for circuit design, are limited to a small number of binary variables i.e. weak-learners. Furthermore, all regions are treated with equal importance. We propose a novel boolean optimisation method for obtaining a reasonably short tree for a large number of weak-learners of a boosting classifier. The classifier information is efficiently packed by using the region coding and a tree is grown by maximising the region information gain. Further details are about a better way of packing the region information and the two stage cascade allowing the conversion with any number of weak-learners. See the paper for details. Experiments: Experiments on the synthetic and face image data sets show that the obtained tree significantly speeds up both a standard boosting classifier and Fast-exit, a prior-art for fast boosting classification, at the same accuracy. The proposed method as a general meta-algorithm is also shown useful for a boosting cascade, since it speeds up individual stage classifiers by different gains. Figure 4 compares the average path lengths of the methods at the fixed accuracy at 0 threshold in the experiment using the face images. The proposed method is further demonstrated for rapid object tracking and segmentation problems. See the technical report at the authors’ website.