Deep Boosting

Report 21 Downloads 56 Views
Deep Boosting Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Umar Syed (Google Research)

MEHRYAR MOHRI

MOHRI@

COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting Essence

Mohri@

page 2

Ensemble Methods in ML Combining several base classifiers to create a more accurate one.

• • • • •

Bagging (Breiman 1996). AdaBoost (Freund and Schapire 1997). Stacking (Smyth and Wolpert 1999). Bayesian averaging (MacKay 1996). Other averaging schemes e.g., (Freund et al. 2004).

Often very effective in practice. Benefit of favorable learning guarantees. Mohri@

page 3

Convex Combinations Base classifier set H.

• •

boosting stumps. decision trees with limited depth or number of leaves.

Ensemble combinations: convex hull of base classifier set. T

conv(H) =

t ht : t=1

Mohri@

T t

0;

t

1; t, ht

H

t=1

page 4

.

Ensembles - Margin Bound (Koltchinskii and Panchencko, 2002)

Theorem: let H be a family of real-valued functions. Fix > 0 . Then, for any > 0 , with probability at least 1 , the T following holds for all f = t=1 t ht conv(H) :

2 b R(f )  RS,⇢ (f ) + Rm (H) + ⇢



Mohri@

where RS,

1 (f ) = m

s

log 1 , 2m

m

1yi f (xi )

.

i=1

page 5

Questions Can we use a much richer or deeper base classifier set?

• •

Mohri@

richer families needed for difficult tasks. but generalization bound indicates risk of overfitting.

page 6

AdaBoost (Freund and Schapire, 1997)

Description: coordinate descent applied to

F( ) =

e i=1

T

m

m yi f (xi )

exp

= i=1

t ht (xi )

yi

.

t=1

Guarantees: ensemble margin bound.

• •

Mohri@

but AdaBoost does not maximize the margin! some margin maximizing algorithms such as arc-gv are outperformed by AdaBoost! (Reyzin and Schapire, 2006)

page 7

Suspicions Complexity of hypotheses used:



arc-gv tends to use deeper decision trees to achieve a larger margin.

Notion of margin:

• •

minimal margin perhaps not the appropriate notion. margin distribution is key. can we shed more light on these questions?

Mohri@

page 8

Question Main question: how can we design ensemble algorithms that can succeed even with very deep decision trees or other complex sets?

• • • •

Mohri@

theory. algorithms. experimental results. model selection.

page 9

Theory

Deep Boosting - Mohri@

page

Base Classifier Set H Decomposition in terms of sub-families or their union.

H2 H1

Mohri@

H4 H5

H3

H1 H2

···

H1 · · ·

Hp

H1

page 11

Ensemble Family Non-negative linear ensembles F = conv(

p k=1 Hk ):

T

f=

t ht t=1

with

t

0,

T t=1

1, ht

t

H2 H1

Mohri@

Hkt .

H4 H5

H3

page 12

Ideas Use hypotheses drawn from Hks with larger ks but allocate more weight to hypotheses drawn from smaller ks.

• •

Mohri@

how can we determine quantitatively the amounts of mixture weights apportioned to different families? can we provide learning guarantees guiding these choices?

page 13

Learning Guarantee

(Cortes, MM, and Syed, 2014)

Theorem: Fix > 0. Then, for any > 0 , with probability at T least 1 , the following holds for all f = t=1 t ht F : R(f )

RS, (f ) +

4

T t Rm (Hkt ) t=1

Mohri@

+O

log p 2m

.

page 14

Consequences Complexity term with explicit dependency on mixture weights.

• •

Mohri@

quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to inspire, or directly define an ensemble algorithm.

page 15

Algorithms

Deep Boosting - Mohri@

page

Set-Up H1 , . . . , Hp : disjoint sub-families of functions taking values in [ 1, +1]. Further assumption (not necessary): symmetric subh Hk . families, i.e. h Hk Notation:

• rj = Rm (Hk ) with hj j

Mohri@

Hkj .

page 17

Derivation 0 with

Learning bound suggests seeking to minimize

1 m

Mohri@

m

1 yi i=1

PT

t=1

t ht (xi )

+

4

T t=1

t

1

T t rt . t=1

page 18

Convex Surrogates ( u) be a decreasing convex function upper Let u 1u 0 , with differentiable. bounding u Two principal choices:

• •

Mohri@

Exponential loss:

( u) = exp( u). Logistic loss: ( u) = log2 (1 + exp( u)).

page 19

Optimization Problem

(Cortes, MM, and Syed, 2014)

Moving the constraint to the objective and using the fact that the sub-families are symmetric leads to:

1 min RN m where ,

Mohri@

j hj (xi )

1 yi i=1

N

N

m

j=1

+

( rj + )| t=1

j |,

0 , and for each hypothesis, keep either h or -h.

page 20

DeepBoost Algorithm Coordinate descent applied to convex objective.

• •

Mohri@

non-differentiable function. definition of maximum coordinate descent.

page 21

Direction & Step Maximum direction: definition based on the error t,j

1 = 1 2

E [yi hj (xi )] ,

i Dt

where Dt is the distribution over sample at iteration t. Step:

Mohri@



closed-form expressions for exponential and logistic losses.



general case: line search.

page 22

Pseudocode j

Mohri@

= rj + .

page 23

Connections with Previous Work For

• •

• Mohri@

= 0 , DeepBoost coincides with

AdaBoost (Freund and Schapire 1997), run with union of subfamilies, for the exponential loss. additive Logistic Regression (Friedman et al., 1998), run with union of sub-families, for the logistic loss.

For



=

= 0 and

= 0 , DeepBoost coincides with

L1-regularized AdaBoost (Raetsch, Mika, and Warmuth 2001), for the exponential loss. L1-regularized Logistic Regression (Duchi and Singer 2009), for the logistic loss. page 24

Experiments

Deep Boosting - Mohri@

page

Rad. Complexity Estimates Benefit of data-dependent analysis:

• •

empirical estimates of each Rm (Hk ). example: for kernel function Kk ,

RS (Hk )



alternatively, upper bounds in terms of growth functions,

Rm (Hk )

Mohri@

Tr[Kk ] . m

2 log

Hk (m)

m

.

page 26

Experiments (1) Family of base classifiers defined by boosting stumps:



boosting



stumps stumps H1

in dimension d ,

(threshold functions).

H1stumps (m)

2 log(2md) . m

Rm (H1stumps )



stumps H2 ,

decision trees of depth 2, with the same question at the internal nodes of depth 1.



in dimension d ,

H2stumps

Rm (H2stumps ) Mohri@

2md , thus

(m)

(2m)2 d(d2

2 log(2m2 d(d m

1)

, thus

1))

. page 27

Experiments (1) Base classifier set:

stumps H1

stumps H2 .

Data sets:

• • •

same UCI Irvine data sets as (Breiman 1999) and (Reyzin and Schapire 2006). OCR data sets used by (Reyzin and Schapire 2006): ocr17, ocr49. MNIST data sets: ocr17-mnist, ocr49-mnist.

Experiments with exponential loss. Comparison with AdaBoost and AdaBoost-L1. Mohri@

page 28

Experiments - Stumps Exp Loss

(Cortes, MM, and Syed, 2014)

Table 1: Results for boosted decision stumps and the exponential loss function. breastcancer Error (std dev) Avg tree size Avg no. of trees

AdaBoost H1stumps 0.0429 (0.0248) 1 100

AdaBoost H2stumps 0.0437 (0.0214) 2 100

AdaBoost-L1 0.0408 (0.0223) 1.436 43.6

DeepBoost 0.0373 (0.0225) 1.215 21.6

ocr17 Error (std dev) Avg tree size Avg no. of trees

AdaBoost H1stumps 0.0085 0.0072 1 100

AdaBoost H2stumps 0.008 0.0054 2 100

AdaBoost-L1 0.0075 0.0068 1.086 37.8

DeepBoost 0.0070 (0.0048) 1.369 36.9

ionosphere Error (std dev) Avg tree size Avg no. of trees

AdaBoost H1stumps 0.1014 (0.0414) 1 100

AdaBoost H2stumps 0.075 (0.0413) 2 100

AdaBoost-L1 0.0708 (0.0331) 1.392 39.35

DeepBoost 0.0638 (0.0394) 1.168 17.45

ocr49 Error (std dev) Avg tree size Avg no. of trees

AdaBoost H1stumps 0.0555 0.0167 1 100

AdaBoost H2stumps 0.032 0.0114 2 100

AdaBoost-L1 0.03 0.0122 1.99 99.3

DeepBoost 0.0275 (0.0095) 1.96 96

german Error (std dev) Avg tree size Avg no. of trees

AdaBoost H1stumps 0.243 (0.0445) 1 100

AdaBoost H2stumps 0.2505 (0.0487) 2 100

AdaBoost-L1 0.2455 (0.0438) 1.54 54.1

DeepBoost 0.2395 (0.0462) 1.76 76.5

ocr17-mnist Error (std dev) Avg tree size Avg no. of trees

AdaBoost H1stumps 0.0056 0.0017 1 100

AdaBoost H2stumps 0.0048 0.0014 2 100

AdaBoost-L1 0.0046 0.0013 2 100

DeepBoost 0.0040 (0.0014) 1.99 100

diabetes Error (std dev) Avg tree size Avg no. of trees

AdaBoost H1stumps 0.253 (0.0330) 1 100

AdaBoost H2stumps 0.260 (0.0518) 2 100

AdaBoost-L1 0.254 (0.04868) 1.9975 100

DeepBoost 0.253 (0.0510) 1.9975 100

ocr49-mnist Error (std dev) Avg tree size Avg no. of trees

AdaBoost H1stumps 0.0414 0.00539 1 100

AdaBoost H2stumps 0.0209 0.00521 2 100

AdaBoost-L1 0.0200 0.00408 1.9975 100

DeepBoost 0.0177 (0.00438) 1.9975 100

Mohri@

page 29

Experiments (2) Family of base classifiers defined by decision trees of depth k . For trees with at most n nodes:

Rm (Tn )

(4n + 2) log2 (d + 2) log(m + 1) . m

Base classifier set:

K trees H k=1 k

.

Same data sets as with Experiments (1). Both exponential and logistic loss. Comparison with AdaBoost and AdaBoost-L1, Logistic Regression and L1-Logistic Regression. Mohri@

page 30

Experiments - Trees Exp Loss

(Cortes, MM, and Syed, 2014)

breastcancer Error (std dev) Avg tree size Avg no. of trees

AdaBoost 0.0267 (0.00841) 29.1 67.1

AdaBoost-L1 0.0264 (0.0098) 28.9 51.7

DeepBoost 0.0243 (0.00797) 20.9 55.9

ocr17 Error (std dev) Avg tree size Avg no. of trees

AdaBoost 0.004 (0.00316) 15.0 88.3

AdaBoost-L1 0.003 (0.00100) 30.4 65.3

DeepBoost 0.002 (0.00100) 26.0 61.8

ionosphere Error (std dev) Avg tree size Avg no. of trees

AdaBoost 0.0661 (0.0315) 29.8 75.0

AdaBoost-L1 0.0657 (0.0257) 31.4 69.4

DeepBoost 0.0501 (0.0316) 26.1 50.0

ocr49 Error (std dev) Avg tree size Avg no. of trees

AdaBoost 0.0180 (0.00555) 30.9 92.4

AdaBoost-L1 0.0175 (0.00357) 62.1 89.0

DeepBoost 0.0175 (0.00510) 30.2 83.0

german Error (std dev) Avg tree size Avg no. of trees

AdaBoost 0.239 (0.0165) 3 91.3

AdaBoost-L1 0.239 (0.0201) 7 87.5

DeepBoost 0.234 (0.0148) 16.0 14.1

ocr17-mnist Error (std dev) Avg tree size Avg no. of trees

AdaBoost 0.00471 (0.0022) 15 88.7

AdaBoost-L1 0.00471 (0.0021) 33.4 66.8

DeepBoost 0.00409 (0.0021) 22.1 59.2

diabetes Error (std dev) Avg tree size Avg no. of trees

AdaBoost 0.249 (0.0272) 3 45.2

AdaBoost-L1 0.240 (0.0313) 3 28.0

DeepBoost 0.230 (0.0399) 5.37 19.0

ocr49-mnist Error (std dev) Avg tree size Avg no. of trees

AdaBoost 0.0198 (0.00500) 29.9 82.4

AdaBoost-L1 0.0197 (0.00512) 66.3 81.1

DeepBoost 0.0182 (0.00551) 30.1 80.9

Mohri@

page 31

Experiments - Trees Log Loss

(Cortes, MM, and Syed, 2014)

breastcancer Error (std dev) Avg tree size Avg no. of trees

LogReg 0.0351 (0.0101) 15 65.3

LogReg-L1 0.0264 (0.0120) 59.9 16.0

DeepBoost 0.0264 (0.00876) 14.0 23.8

ocr17 Error (std dev) Avg tree size Avg no. of trees

LogReg 0.00300 (0.00100) 15.0 75.3

LogReg-L1 0.00400 (0.00141) 7 53.8

DeepBoost 0.00250 (0.000866) 22.1 25.8

ionosphere Error (std dev) Avg tree size Avg no. of trees

LogReg 0.074 (0.0236) 7 44.7

LogReg-L1 0.060 (0.0219) 30.0 25.3

DeepBoost 0.043 (0.0188) 18.4 29.5

ocr49 Error (std dev) Avg tree size Avg no. of trees

LogReg 0.0205 (0.00654) 31.0 63.5

LogReg-L1 0.0200 (0.00245) 31.0 54.0

DeepBoost 0.0170 (0.00361) 63.2 37.0

german Error (std dev) Avg tree size Avg no. of trees

LogReg 0.233 (0.0114) 7 72.8

LogReg-L1 0.232 (0.0123) 7 66.8

DeepBoost 0.225 (0.0103) 14.4 67.8

ocr17-mnist Error (std dev) Avg tree size Avg no. of trees

LogReg 0.00422 (0.00191) 15 71.4

LogReg-L1 0.00417 (0.00188) 15 55.6

DeepBoost 0.00399 (0.00211) 25.9 27.6

diabetes Error (std dev) Avg tree size Avg no. of trees

LogReg 0.250 (0.0374) 3 46.0

LogReg-L1 0.246 (0.0356) 3 45.5

DeepBoost 0.246 (0.0356) 3 45.5

ocr49-mnist Error (std dev) Avg tree size Avg no. of trees

LogReg 0.0211 (0.00412) 28.7 79.3

LogReg-L1 0.0201 (0.00433) 33.5 61.7

DeepBoost 0.0201 (0.00411) 72.8 41.9

Mohri@

page 32

Multi-Class Learning Guarantee

(Kuznetsov, MM, and Syed, 2014)

Theorem: Fix > 0. Then, for any > 0 , with probability at T least 1 , the following holds for all f = t=1 t ht F :

8c b R(f )  RS,⇢ (f ) + ⇢

T X t=1

e ↵t Rm (⇧1 (Hkt )) + O

s

log p ⇢2 m

!

.

with c number of classes. and

Mohri@

1 (Hk )

= {x

h(x, y) : y

Y, h

Hk }.

page 33

Extension to Multi-Class Similar data-dependent learning guarantee proven for the multi-class setting.



bound depending on mixture weights and complexity of sub-families.

Deep Boosting algorithm for multi-class:

• • • Mohri@

similar extension taking into account the complexities of sub-families. several variants depending on number of classes. different possible loss functions for each variant.

page 34

Experiments - Multi-Class Table 1: Empirical results for MDeepBoostSum, aBoost.

= exp. AB stands for Ad-

abalone Error (std dev) Avg tree size Avg no. of trees

AB.MR 0.713 (0.0130) 69.8 17.9

AB.MR-L1 0.696 (0.0132) 31.5 13.3

MDeepBoost 0.677 (0.0092) 23.8 15.3

handwritten Error (std dev) Avg tree size Avg no. of trees

AB.MR 0.016 (0.0047) 187.3 34.2

AB.MR-L1 0.011 (0.0026) 240.6 21.7

MDeepBoost 0.009 (0.0012) 203.0 24.2

letters Error (std dev) Avg tree size Avg no. of trees

AB.MR 0.042 (0.0023) 1942.6 24.2

AB.MR-L1 0.036 (0.0018) 1903.8 24.4

MDeepBoost 0.032 (0.0016) 1914.6 23.3

pageblocks Error (std dev) Avg tree size Avg no. of trees

AB.MR 0.020 (0.0037) 134.8 8.5

AB.MR-L1 0.017 (0.0021) 118.3 14.3

MDeepBoost 0.013 (0.0027) 124.9 6.6

pendigits Error (std dev) Avg tree size Avg no. of trees

AB.MR 0.008 (0.0015) 272.5 23.2

AB.MR-L1 0.006 (0.0023) 283.3 19.8

MDeepBoost 0.004 (0.0011) 259.2 21.4

satimage Error (std dev) Avg tree size Avg no. of trees

AB.MR 0.089 (0.0062) 557.9 7.6

AB.MR-L1 0.081 (0.0040) 478.8 7.3

MDeepBoost 0.073 (0.0045) 535.6 7.6

statlog Error (std dev) Avg tree size Avg no. of trees

AB.MR 0.011 (0.0059) 74.8 23.2

AB.MR-L1 0.006 (0.0035) 79.2 17.5

MDeepBoost 0.004 (0.0030) 61.8 17.6

yeast Error (std dev) Avg tree size Avg no. of trees

AB.MR 0.388 (0.0392) 100.6 8.7

AB.MR-L1 0.376 (0.0431) 111.7 6.5

MDeepBoost 0.352 (0.0402) 71.4 7.7

Mohri@

page 35

Experiments - Multi-Class Table 1: Empirical results for MDeepBoostCompSum, comparison with multinomial logistic regression. abalone Error (std dev) Avg tree size Avg no. of trees

LogReg 0.710 (0.0170) 162.1 22.2

LogReg-L1 0.700 (0.0102) 156.5 9.8

MDeepBoost 0.687 (0.0104) 28.0 10.2

handwritten Error (std dev) Avg tree size Avg no. of trees

LogReg 0.016 (0.0031) 237.7 32.3

LogReg-L1 0.012 (0.0020) 186.5 32.8

MDeepBoost 0.008 (0.0024) 153.8 35.9

letters Error (std dev) Avg tree size Avg no. of trees

LogReg 0.043 (0.0018) 1986.5 25.5

LogReg-L1 0.038 (0.0012) 1759.5 29.0

MDeepBoost 0.035 (0.0012) 1807.3 27.2

pageblocks Error (std dev) Avg tree size Avg no. of trees

LogReg 0.019 (0.0035) 127.4 4.5

LogReg-L1 0.016 (0.0025) 151.7 6.8

MDeepBoost 0.012 (0.0022) 147.9 7.4

pendigits Error (std dev) Avg tree size Avg no. of trees

LogReg 0.009 (0.0021) 306.3 21.9

LogReg-L1 0.007 (0.0014) 277.1 20.8

MDeepBoost 0.005 (0.0012) 262.7 19.7

satimage Error (std dev) Avg tree size Avg no. of trees

LogReg 0.091 (0.0066) 412.6 6.0

LogReg-L1 0.082 (0.0057) 454.6 5.8

MDeepBoost 0.074 (0.0056) 439.6 5.8

statlog Error (std dev) Avg tree size Avg no. of trees

LogReg 0.012 (0.0054) 74.3 22.3

LogReg-L1 0.006 (0.0020) 71.6 20.6

MDeepBoost 0.002 (0.0022) 65.4 17.5

yeast Error (std dev) Avg tree size Avg no. of trees

LogReg 0.381 (0.0467) 103.9 14.1

LogReg-L1 0.375 (0.0458) 83.3 9.3

MDeepBoost 0.354 (0.0468) 117.2 9.3

Mohri@

page 36

Other Related Algorithms Structural Maxent models (Cortes, Kuznetsov, MM, and Syed, ICML 2015): feature functions chosen from a union of very complex families. Deep Cascades (DeSalvo, MM, and Syed, ALT 2015): cascade of predictors with leaf predictors and node questions selected from very rich families.

Mohri@

page 37

Model Selection

Deep Boosting - Mohri@

page

Model Selection Problem: how to select hypothesis set H ?

• •

H too complex, no gen. bound, overfitting. H too simple, gen. bound, but underfitting. balance between estimation and approx. errors.

h

h

hBayes

H

Mohri@

page 39

Structural Risk Minimization

(Vapnik and Chervonenkis, 1974; Vapnik, 1995)

SRM: H =



Hk with H1

H2

k=1

···

Hk

···

solution: f = argmin RS (h) + pen(k, m). error

h Hk ,k 1

training error + penalty penalty

training error complexity Mohri@

page 40

Voted Risk Minimization Ideas:

• • •

no selection of specific Hk . instead, use all Hks: h =

p k=1

k hk , hk

Hk ,

.

hypothesis-dependent penalty: p k Rm (Hk ). k=1

Deep ensembles.

Mohri@

page 41

Conclusion Deep Boosting: ensemble learning with increasingly complex families.

• • • • • •

Mohri@

data-dependent theoretical analysis. algorithm based on learning bound. extension to multi-class. ranking and other losses. enhancement of many existing algorithms. compares favorably to AdaBoost and Logistic Regression or their L1-regularized variants in experiments.

page 42