Deep Boosting Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Umar Syed (Google Research)
MEHRYAR MOHRI
MOHRI@
COURANT INSTITUTE & GOOGLE RESEARCH.
Deep Boosting Essence
Mohri@
page 2
Ensemble Methods in ML Combining several base classifiers to create a more accurate one.
• • • • •
Bagging (Breiman 1996). AdaBoost (Freund and Schapire 1997). Stacking (Smyth and Wolpert 1999). Bayesian averaging (MacKay 1996). Other averaging schemes e.g., (Freund et al. 2004).
Often very effective in practice. Benefit of favorable learning guarantees. Mohri@
page 3
Convex Combinations Base classifier set H.
• •
boosting stumps. decision trees with limited depth or number of leaves.
Ensemble combinations: convex hull of base classifier set. T
conv(H) =
t ht : t=1
Mohri@
T t
0;
t
1; t, ht
H
t=1
page 4
.
Ensembles - Margin Bound (Koltchinskii and Panchencko, 2002)
Theorem: let H be a family of real-valued functions. Fix > 0 . Then, for any > 0 , with probability at least 1 , the T following holds for all f = t=1 t ht conv(H) :
2 b R(f ) RS,⇢ (f ) + Rm (H) + ⇢
•
Mohri@
where RS,
1 (f ) = m
s
log 1 , 2m
m
1yi f (xi )
.
i=1
page 5
Questions Can we use a much richer or deeper base classifier set?
• •
Mohri@
richer families needed for difficult tasks. but generalization bound indicates risk of overfitting.
page 6
AdaBoost (Freund and Schapire, 1997)
Description: coordinate descent applied to
F( ) =
e i=1
T
m
m yi f (xi )
exp
= i=1
t ht (xi )
yi
.
t=1
Guarantees: ensemble margin bound.
• •
Mohri@
but AdaBoost does not maximize the margin! some margin maximizing algorithms such as arc-gv are outperformed by AdaBoost! (Reyzin and Schapire, 2006)
page 7
Suspicions Complexity of hypotheses used:
•
arc-gv tends to use deeper decision trees to achieve a larger margin.
Notion of margin:
• •
minimal margin perhaps not the appropriate notion. margin distribution is key. can we shed more light on these questions?
Mohri@
page 8
Question Main question: how can we design ensemble algorithms that can succeed even with very deep decision trees or other complex sets?
• • • •
Mohri@
theory. algorithms. experimental results. model selection.
page 9
Theory
Deep Boosting - Mohri@
page
Base Classifier Set H Decomposition in terms of sub-families or their union.
H2 H1
Mohri@
H4 H5
H3
H1 H2
···
H1 · · ·
Hp
H1
page 11
Ensemble Family Non-negative linear ensembles F = conv(
p k=1 Hk ):
T
f=
t ht t=1
with
t
0,
T t=1
1, ht
t
H2 H1
Mohri@
Hkt .
H4 H5
H3
page 12
Ideas Use hypotheses drawn from Hks with larger ks but allocate more weight to hypotheses drawn from smaller ks.
• •
Mohri@
how can we determine quantitatively the amounts of mixture weights apportioned to different families? can we provide learning guarantees guiding these choices?
page 13
Learning Guarantee
(Cortes, MM, and Syed, 2014)
Theorem: Fix > 0. Then, for any > 0 , with probability at T least 1 , the following holds for all f = t=1 t ht F : R(f )
RS, (f ) +
4
T t Rm (Hkt ) t=1
Mohri@
+O
log p 2m
.
page 14
Consequences Complexity term with explicit dependency on mixture weights.
• •
Mohri@
quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to inspire, or directly define an ensemble algorithm.
page 15
Algorithms
Deep Boosting - Mohri@
page
Set-Up H1 , . . . , Hp : disjoint sub-families of functions taking values in [ 1, +1]. Further assumption (not necessary): symmetric subh Hk . families, i.e. h Hk Notation:
• rj = Rm (Hk ) with hj j
Mohri@
Hkj .
page 17
Derivation 0 with
Learning bound suggests seeking to minimize
1 m
Mohri@
m
1 yi i=1
PT
t=1
t ht (xi )
+
4
T t=1
t
1
T t rt . t=1
page 18
Convex Surrogates ( u) be a decreasing convex function upper Let u 1u 0 , with differentiable. bounding u Two principal choices:
• •
Mohri@
Exponential loss:
( u) = exp( u). Logistic loss: ( u) = log2 (1 + exp( u)).
page 19
Optimization Problem
(Cortes, MM, and Syed, 2014)
Moving the constraint to the objective and using the fact that the sub-families are symmetric leads to:
1 min RN m where ,
Mohri@
j hj (xi )
1 yi i=1
N
N
m
j=1
+
( rj + )| t=1
j |,
0 , and for each hypothesis, keep either h or -h.
page 20
DeepBoost Algorithm Coordinate descent applied to convex objective.
• •
Mohri@
non-differentiable function. definition of maximum coordinate descent.
page 21
Direction & Step Maximum direction: definition based on the error t,j
1 = 1 2
E [yi hj (xi )] ,
i Dt
where Dt is the distribution over sample at iteration t. Step:
Mohri@
•
closed-form expressions for exponential and logistic losses.
•
general case: line search.
page 22
Pseudocode j
Mohri@
= rj + .
page 23
Connections with Previous Work For
• •
• Mohri@
= 0 , DeepBoost coincides with
AdaBoost (Freund and Schapire 1997), run with union of subfamilies, for the exponential loss. additive Logistic Regression (Friedman et al., 1998), run with union of sub-families, for the logistic loss.
For
•
=
= 0 and
= 0 , DeepBoost coincides with
L1-regularized AdaBoost (Raetsch, Mika, and Warmuth 2001), for the exponential loss. L1-regularized Logistic Regression (Duchi and Singer 2009), for the logistic loss. page 24
Experiments
Deep Boosting - Mohri@
page
Rad. Complexity Estimates Benefit of data-dependent analysis:
• •
empirical estimates of each Rm (Hk ). example: for kernel function Kk ,
RS (Hk )
•
alternatively, upper bounds in terms of growth functions,
Rm (Hk )
Mohri@
Tr[Kk ] . m
2 log
Hk (m)
m
.
page 26
Experiments (1) Family of base classifiers defined by boosting stumps:
•
boosting
•
stumps stumps H1
in dimension d ,
(threshold functions).
H1stumps (m)
2 log(2md) . m
Rm (H1stumps )
•
stumps H2 ,
decision trees of depth 2, with the same question at the internal nodes of depth 1.
•
in dimension d ,
H2stumps
Rm (H2stumps ) Mohri@
2md , thus
(m)
(2m)2 d(d2
2 log(2m2 d(d m
1)
, thus
1))
. page 27
Experiments (1) Base classifier set:
stumps H1
stumps H2 .
Data sets:
• • •
same UCI Irvine data sets as (Breiman 1999) and (Reyzin and Schapire 2006). OCR data sets used by (Reyzin and Schapire 2006): ocr17, ocr49. MNIST data sets: ocr17-mnist, ocr49-mnist.
Experiments with exponential loss. Comparison with AdaBoost and AdaBoost-L1. Mohri@
page 28
Experiments - Stumps Exp Loss
(Cortes, MM, and Syed, 2014)
Table 1: Results for boosted decision stumps and the exponential loss function. breastcancer Error (std dev) Avg tree size Avg no. of trees
AdaBoost H1stumps 0.0429 (0.0248) 1 100
AdaBoost H2stumps 0.0437 (0.0214) 2 100
AdaBoost-L1 0.0408 (0.0223) 1.436 43.6
DeepBoost 0.0373 (0.0225) 1.215 21.6
ocr17 Error (std dev) Avg tree size Avg no. of trees
AdaBoost H1stumps 0.0085 0.0072 1 100
AdaBoost H2stumps 0.008 0.0054 2 100
AdaBoost-L1 0.0075 0.0068 1.086 37.8
DeepBoost 0.0070 (0.0048) 1.369 36.9
ionosphere Error (std dev) Avg tree size Avg no. of trees
AdaBoost H1stumps 0.1014 (0.0414) 1 100
AdaBoost H2stumps 0.075 (0.0413) 2 100
AdaBoost-L1 0.0708 (0.0331) 1.392 39.35
DeepBoost 0.0638 (0.0394) 1.168 17.45
ocr49 Error (std dev) Avg tree size Avg no. of trees
AdaBoost H1stumps 0.0555 0.0167 1 100
AdaBoost H2stumps 0.032 0.0114 2 100
AdaBoost-L1 0.03 0.0122 1.99 99.3
DeepBoost 0.0275 (0.0095) 1.96 96
german Error (std dev) Avg tree size Avg no. of trees
AdaBoost H1stumps 0.243 (0.0445) 1 100
AdaBoost H2stumps 0.2505 (0.0487) 2 100
AdaBoost-L1 0.2455 (0.0438) 1.54 54.1
DeepBoost 0.2395 (0.0462) 1.76 76.5
ocr17-mnist Error (std dev) Avg tree size Avg no. of trees
AdaBoost H1stumps 0.0056 0.0017 1 100
AdaBoost H2stumps 0.0048 0.0014 2 100
AdaBoost-L1 0.0046 0.0013 2 100
DeepBoost 0.0040 (0.0014) 1.99 100
diabetes Error (std dev) Avg tree size Avg no. of trees
AdaBoost H1stumps 0.253 (0.0330) 1 100
AdaBoost H2stumps 0.260 (0.0518) 2 100
AdaBoost-L1 0.254 (0.04868) 1.9975 100
DeepBoost 0.253 (0.0510) 1.9975 100
ocr49-mnist Error (std dev) Avg tree size Avg no. of trees
AdaBoost H1stumps 0.0414 0.00539 1 100
AdaBoost H2stumps 0.0209 0.00521 2 100
AdaBoost-L1 0.0200 0.00408 1.9975 100
DeepBoost 0.0177 (0.00438) 1.9975 100
Mohri@
page 29
Experiments (2) Family of base classifiers defined by decision trees of depth k . For trees with at most n nodes:
Rm (Tn )
(4n + 2) log2 (d + 2) log(m + 1) . m
Base classifier set:
K trees H k=1 k
.
Same data sets as with Experiments (1). Both exponential and logistic loss. Comparison with AdaBoost and AdaBoost-L1, Logistic Regression and L1-Logistic Regression. Mohri@
page 30
Experiments - Trees Exp Loss
(Cortes, MM, and Syed, 2014)
breastcancer Error (std dev) Avg tree size Avg no. of trees
AdaBoost 0.0267 (0.00841) 29.1 67.1
AdaBoost-L1 0.0264 (0.0098) 28.9 51.7
DeepBoost 0.0243 (0.00797) 20.9 55.9
ocr17 Error (std dev) Avg tree size Avg no. of trees
AdaBoost 0.004 (0.00316) 15.0 88.3
AdaBoost-L1 0.003 (0.00100) 30.4 65.3
DeepBoost 0.002 (0.00100) 26.0 61.8
ionosphere Error (std dev) Avg tree size Avg no. of trees
AdaBoost 0.0661 (0.0315) 29.8 75.0
AdaBoost-L1 0.0657 (0.0257) 31.4 69.4
DeepBoost 0.0501 (0.0316) 26.1 50.0
ocr49 Error (std dev) Avg tree size Avg no. of trees
AdaBoost 0.0180 (0.00555) 30.9 92.4
AdaBoost-L1 0.0175 (0.00357) 62.1 89.0
DeepBoost 0.0175 (0.00510) 30.2 83.0
german Error (std dev) Avg tree size Avg no. of trees
AdaBoost 0.239 (0.0165) 3 91.3
AdaBoost-L1 0.239 (0.0201) 7 87.5
DeepBoost 0.234 (0.0148) 16.0 14.1
ocr17-mnist Error (std dev) Avg tree size Avg no. of trees
AdaBoost 0.00471 (0.0022) 15 88.7
AdaBoost-L1 0.00471 (0.0021) 33.4 66.8
DeepBoost 0.00409 (0.0021) 22.1 59.2
diabetes Error (std dev) Avg tree size Avg no. of trees
AdaBoost 0.249 (0.0272) 3 45.2
AdaBoost-L1 0.240 (0.0313) 3 28.0
DeepBoost 0.230 (0.0399) 5.37 19.0
ocr49-mnist Error (std dev) Avg tree size Avg no. of trees
AdaBoost 0.0198 (0.00500) 29.9 82.4
AdaBoost-L1 0.0197 (0.00512) 66.3 81.1
DeepBoost 0.0182 (0.00551) 30.1 80.9
Mohri@
page 31
Experiments - Trees Log Loss
(Cortes, MM, and Syed, 2014)
breastcancer Error (std dev) Avg tree size Avg no. of trees
LogReg 0.0351 (0.0101) 15 65.3
LogReg-L1 0.0264 (0.0120) 59.9 16.0
DeepBoost 0.0264 (0.00876) 14.0 23.8
ocr17 Error (std dev) Avg tree size Avg no. of trees
LogReg 0.00300 (0.00100) 15.0 75.3
LogReg-L1 0.00400 (0.00141) 7 53.8
DeepBoost 0.00250 (0.000866) 22.1 25.8
ionosphere Error (std dev) Avg tree size Avg no. of trees
LogReg 0.074 (0.0236) 7 44.7
LogReg-L1 0.060 (0.0219) 30.0 25.3
DeepBoost 0.043 (0.0188) 18.4 29.5
ocr49 Error (std dev) Avg tree size Avg no. of trees
LogReg 0.0205 (0.00654) 31.0 63.5
LogReg-L1 0.0200 (0.00245) 31.0 54.0
DeepBoost 0.0170 (0.00361) 63.2 37.0
german Error (std dev) Avg tree size Avg no. of trees
LogReg 0.233 (0.0114) 7 72.8
LogReg-L1 0.232 (0.0123) 7 66.8
DeepBoost 0.225 (0.0103) 14.4 67.8
ocr17-mnist Error (std dev) Avg tree size Avg no. of trees
LogReg 0.00422 (0.00191) 15 71.4
LogReg-L1 0.00417 (0.00188) 15 55.6
DeepBoost 0.00399 (0.00211) 25.9 27.6
diabetes Error (std dev) Avg tree size Avg no. of trees
LogReg 0.250 (0.0374) 3 46.0
LogReg-L1 0.246 (0.0356) 3 45.5
DeepBoost 0.246 (0.0356) 3 45.5
ocr49-mnist Error (std dev) Avg tree size Avg no. of trees
LogReg 0.0211 (0.00412) 28.7 79.3
LogReg-L1 0.0201 (0.00433) 33.5 61.7
DeepBoost 0.0201 (0.00411) 72.8 41.9
Mohri@
page 32
Multi-Class Learning Guarantee
(Kuznetsov, MM, and Syed, 2014)
Theorem: Fix > 0. Then, for any > 0 , with probability at T least 1 , the following holds for all f = t=1 t ht F :
8c b R(f ) RS,⇢ (f ) + ⇢
T X t=1
e ↵t Rm (⇧1 (Hkt )) + O
s
log p ⇢2 m
!
.
with c number of classes. and
Mohri@
1 (Hk )
= {x
h(x, y) : y
Y, h
Hk }.
page 33
Extension to Multi-Class Similar data-dependent learning guarantee proven for the multi-class setting.
•
bound depending on mixture weights and complexity of sub-families.
Deep Boosting algorithm for multi-class:
• • • Mohri@
similar extension taking into account the complexities of sub-families. several variants depending on number of classes. different possible loss functions for each variant.
page 34
Experiments - Multi-Class Table 1: Empirical results for MDeepBoostSum, aBoost.
= exp. AB stands for Ad-
abalone Error (std dev) Avg tree size Avg no. of trees
AB.MR 0.713 (0.0130) 69.8 17.9
AB.MR-L1 0.696 (0.0132) 31.5 13.3
MDeepBoost 0.677 (0.0092) 23.8 15.3
handwritten Error (std dev) Avg tree size Avg no. of trees
AB.MR 0.016 (0.0047) 187.3 34.2
AB.MR-L1 0.011 (0.0026) 240.6 21.7
MDeepBoost 0.009 (0.0012) 203.0 24.2
letters Error (std dev) Avg tree size Avg no. of trees
AB.MR 0.042 (0.0023) 1942.6 24.2
AB.MR-L1 0.036 (0.0018) 1903.8 24.4
MDeepBoost 0.032 (0.0016) 1914.6 23.3
pageblocks Error (std dev) Avg tree size Avg no. of trees
AB.MR 0.020 (0.0037) 134.8 8.5
AB.MR-L1 0.017 (0.0021) 118.3 14.3
MDeepBoost 0.013 (0.0027) 124.9 6.6
pendigits Error (std dev) Avg tree size Avg no. of trees
AB.MR 0.008 (0.0015) 272.5 23.2
AB.MR-L1 0.006 (0.0023) 283.3 19.8
MDeepBoost 0.004 (0.0011) 259.2 21.4
satimage Error (std dev) Avg tree size Avg no. of trees
AB.MR 0.089 (0.0062) 557.9 7.6
AB.MR-L1 0.081 (0.0040) 478.8 7.3
MDeepBoost 0.073 (0.0045) 535.6 7.6
statlog Error (std dev) Avg tree size Avg no. of trees
AB.MR 0.011 (0.0059) 74.8 23.2
AB.MR-L1 0.006 (0.0035) 79.2 17.5
MDeepBoost 0.004 (0.0030) 61.8 17.6
yeast Error (std dev) Avg tree size Avg no. of trees
AB.MR 0.388 (0.0392) 100.6 8.7
AB.MR-L1 0.376 (0.0431) 111.7 6.5
MDeepBoost 0.352 (0.0402) 71.4 7.7
Mohri@
page 35
Experiments - Multi-Class Table 1: Empirical results for MDeepBoostCompSum, comparison with multinomial logistic regression. abalone Error (std dev) Avg tree size Avg no. of trees
LogReg 0.710 (0.0170) 162.1 22.2
LogReg-L1 0.700 (0.0102) 156.5 9.8
MDeepBoost 0.687 (0.0104) 28.0 10.2
handwritten Error (std dev) Avg tree size Avg no. of trees
LogReg 0.016 (0.0031) 237.7 32.3
LogReg-L1 0.012 (0.0020) 186.5 32.8
MDeepBoost 0.008 (0.0024) 153.8 35.9
letters Error (std dev) Avg tree size Avg no. of trees
LogReg 0.043 (0.0018) 1986.5 25.5
LogReg-L1 0.038 (0.0012) 1759.5 29.0
MDeepBoost 0.035 (0.0012) 1807.3 27.2
pageblocks Error (std dev) Avg tree size Avg no. of trees
LogReg 0.019 (0.0035) 127.4 4.5
LogReg-L1 0.016 (0.0025) 151.7 6.8
MDeepBoost 0.012 (0.0022) 147.9 7.4
pendigits Error (std dev) Avg tree size Avg no. of trees
LogReg 0.009 (0.0021) 306.3 21.9
LogReg-L1 0.007 (0.0014) 277.1 20.8
MDeepBoost 0.005 (0.0012) 262.7 19.7
satimage Error (std dev) Avg tree size Avg no. of trees
LogReg 0.091 (0.0066) 412.6 6.0
LogReg-L1 0.082 (0.0057) 454.6 5.8
MDeepBoost 0.074 (0.0056) 439.6 5.8
statlog Error (std dev) Avg tree size Avg no. of trees
LogReg 0.012 (0.0054) 74.3 22.3
LogReg-L1 0.006 (0.0020) 71.6 20.6
MDeepBoost 0.002 (0.0022) 65.4 17.5
yeast Error (std dev) Avg tree size Avg no. of trees
LogReg 0.381 (0.0467) 103.9 14.1
LogReg-L1 0.375 (0.0458) 83.3 9.3
MDeepBoost 0.354 (0.0468) 117.2 9.3
Mohri@
page 36
Other Related Algorithms Structural Maxent models (Cortes, Kuznetsov, MM, and Syed, ICML 2015): feature functions chosen from a union of very complex families. Deep Cascades (DeSalvo, MM, and Syed, ALT 2015): cascade of predictors with leaf predictors and node questions selected from very rich families.
Mohri@
page 37
Model Selection
Deep Boosting - Mohri@
page
Model Selection Problem: how to select hypothesis set H ?
• •
H too complex, no gen. bound, overfitting. H too simple, gen. bound, but underfitting. balance between estimation and approx. errors.
h
h
hBayes
H
Mohri@
page 39
Structural Risk Minimization
(Vapnik and Chervonenkis, 1974; Vapnik, 1995)
SRM: H =
•
Hk with H1
H2
k=1
···
Hk
···
solution: f = argmin RS (h) + pen(k, m). error
h Hk ,k 1
training error + penalty penalty
training error complexity Mohri@
page 40
Voted Risk Minimization Ideas:
• • •
no selection of specific Hk . instead, use all Hks: h =
p k=1
k hk , hk
Hk ,
.
hypothesis-dependent penalty: p k Rm (Hk ). k=1
Deep ensembles.
Mohri@
page 41
Conclusion Deep Boosting: ensemble learning with increasingly complex families.
• • • • • •
Mohri@
data-dependent theoretical analysis. algorithm based on learning bound. extension to multi-class. ranking and other losses. enhancement of many existing algorithms. compares favorably to AdaBoost and Logistic Regression or their L1-regularized variants in experiments.
page 42