Top-down particle filtering for Bayesian decision trees

Comment

Report 5 Downloads 70 Views

Top-down particle filtering for Bayesian decision trees Balaji Lakshminarayanan1 , Daniel M. Roy2 and Yee Whye Teh3 1. Gatsby Unit, UCL, 2. University of Cambridge and 3. University of Oxford

Outline

Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

Outline

Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

Introduction

I I I

N Input: attributes X = {xi }N i=1 , labels Y = {yi }i=1 (i.i.d)

yi ∈ {1, . . . , K } (classification) or yi ∈ R (regression)

Goal: Model p(y |x)

Introduction

I I I I I

N Input: attributes X = {xi }N i=1 , labels Y = {yi }i=1 (i.i.d)

yi ∈ {1, . . . , K } (classification) or yi ∈ R (regression)

Goal: Model p(y |x)

Assume p(y |x) is specified by decision tree T Bayesian decision trees: I

I

Posterior: p(T |Y , X ) ∝ p(Y |T , X ) p(T |X ) | {z } | {z } prior P likelihood Prediction: p(y∗ |x∗ ) = T p(T |Y , X )p(y∗ |x∗ , T )

Example: Classification tree

1

x1 > 0.5 0

−

+ +

1

x2 +

x2 > 0.35

θ0

+

0 θ10

−

−

B0 B11 B10

1

+ +

θ11

θ: Multinomial parameters at leaf nodes

0

x1

1

Example: Regression tree

1

−1

7 6

x1 > 0.5 0

1

x2

B0 B11 7

x2 > 0.35

θ0

8 0

−3

−5

B10

1

θ10

θ: Gaussian parameters at leaf nodes

1 0

θ11

0

x1

1

Motivation

I

Classic non-Bayesian induction algorithms (e.g. CART) learn a single tree in a top-down manner using greedy heuristics (post-pruning and/or bagging necessary)

I

MCMC for Bayesian decision trees: [Chipman et al., 1998]: local Monte Carlo modifications to the tree structure (less prone to over fitting but slow to mix)

I

Our contribution: Sequential Monte Carlo (SMC) algorithm that approximates the posterior, in a top-down manner

I

Take home message: SMC provides better computation vs predictive performance tradeoff than MCMC

Bayesian decision trees: likelihood

p(T |Y , X ) ∝ p(Y |T , X ) p(T |X ) | {z } | {z } likelihood

prior

Likelihood I I

Assume xn falls in the j th leaf node of T

Likelihood for nth data point: p(yn | xn , T , θ) = p(yn |θj , xn ) Y Y Y p(Y | T , X , Θ) = p(yn | xn , T , θ) = p(yn |θj ) n

j∈leaves(T) n∈N(j)

Likelihood I I

Assume xn falls in the j th leaf node of T

Likelihood for nth data point: p(yn | xn , T , θ) = p(yn |θj , xn ) Y Y Y p(Y | T , X , Θ) = p(yn | xn , T , θ) = p(yn |θj ) n

I

j∈leaves(T) n∈N(j)

Better: integrate out θj , use marginal likelihood Y Z Y p(Y | T , X ) = p(yn |θj )p(θj )dθj j∈leaves(T) θj n∈N(j)

I

Classification: Dirichlet - Multinomial

I

Regression: Normal - Normal Inverse Gamma

Outline

Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

Bayesian decision trees: prior

p(T |Y , X ) ∝ p(Y |T , X ) p(T |X ) | {z } | {z } likelihood

prior

Partial trees

0. Start with empty tree. 1

−

+ + x2 +

B

−

− +

+ + 0

x1

1

Partial trees

1. Choose to split root node with feature 1 and threshold 0.5. 1

−

+ + x2 +

: x1 > 0.5

−

−

B0 B1 +

+ +

0

1

0

x1

1

Partial trees

2. Choose to not split node 0. 1

−

+ + x2 +

: x1 > 0.5

−

−

B0 B1 +

+ +

0

1

0

x1

1

Partial trees

3. Choose to split node 1 with with feature 2 and threshold 0.35. 1

−

+ +

: x1 > 0.5

x2 + 1 : x2 > 0.35

0

+

−

−

B0 B11 B10

+ +

10

11

0

x1

1

Partial trees 4. Choose to not split node 10. 5. Choose to not split node 11. 1

−

+ +

: x1 > 0.5

x2 + 1 : x2 > 0.35

0

+

−

−

B0 B11 B10

+ +

10

11

0

x1

1

Sequence of random variables for a tree

: x1 > 0.5

1 : x2 > 0.35

0

10

1. ρ = 1, κ = 1, τ = 0.5 2. ρ0 = 0 3. ρ1 = 1, κ1 = 2, τ1 = 0.35 4. ρ10 = 0 5. ρ11 = 0

11

Sequential prior over decision trees

I

Probability of split (assuming a valid split exists): −βs p(j split) = αs · 1 + depth(j)

I

αs ∈ (0, 1), βs ∈ [0, ∞)

κj , τj sampled uniformly from the range of valid splits

Sequential prior over decision trees

I

Probability of split (assuming a valid split exists): −βs p(j split) = αs · 1 + depth(j)

αs ∈ (0, 1), βs ∈ [0, ∞)

I

κj , τj sampled uniformly from the range of valid splits

I

Prior distribution: p(T, κ, τ |X ) =

Y

p(j not split)

j∈leaves(T)

×

Y j∈nonleaves(T)

p(j split)p(κj , τj )

Outline

Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

Bayesian decision trees: posterior

p(T |Y , X ) ∝ p(Y |T , X ) p(T |X ) | {z } | {z } likelihood

prior

SMC algorithm for Bayesian decision trees

I

Importance sampler: Draw T (c) ∼ q(·) p(Y |X ) =

X T

p(Y , T |X ) ≈

C X 1 p(T (c) ) p(Y |X , T (c) ) C q(T (c) ) c=1 | {z } w (c)

SMC algorithm for Bayesian decision trees

I

Importance sampler: Draw T (c) ∼ q(·) p(Y |X ) =

X T

p(Y , T |X ) ≈

C X 1 p(T (c) ) p(Y |X , T (c) ) C q(T (c) ) c=1 | {z } w (c)

I I

Normalize: w ¯ (c) =

Pw c0

(c)

w (c

0)

Approximate posterior: p(T |Y , X ) ≈

X c

w ¯ (c) δ(T = T (c) )

SMC algorithm for Bayesian decision trees (contd.) I

Sequential importance sampler (SIS): p(Tn ) = p(T0 )

n Y n0 =1

p(T |T n0

n0 −1

) q(Tn ) = q0 (T0 )

n Y n0 =1

qn0 (Tn0 |Tn0 −1 )

, p(Y |X , Tn ) p(Y T1 ) |X p(Y |X , Tn ) = p(Y |X , T0 ) ··· (() (,( p(Y Tn−1 p(Y |X , T ) 0 ((|X

SMC algorithm for Bayesian decision trees (contd.) I

Sequential importance sampler (SIS): p(Tn ) = p(T0 )

n Y n0 =1

p(T |T n0

n0 −1

) q(Tn ) = q0 (T0 )

n Y n0 =1

qn0 (Tn0 |Tn0 −1 )

, p(Y |X , Tn ) p(Y T1 ) |X p(Y |X , Tn ) = p(Y |X , T0 ) ··· (() (,( p(Y Tn−1 p(Y |X , T ) 0 ((|X 1 p(Tn ) p(Y |X , Tn ) C q(Tn ) n Y p(Tn0 | Tn0 −1 ) p(Y | X , Tn0 ) = w0 qn0 (Tn0 | Tn0 −1 ) p(Y | X , Tn0 −1 ) n0 =1 | {z }

w=

local likelihood

I

Sequential Monte Carlo (SMC): SIS + adaptive resampling steps

I

Every node is processed just once: no multi-path issues

Outline

Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

Experimental setup

I

Datasets: I I

magic-04 : N = 19K , D = 10, K = 2. pendigits: N = 11K , D = 16, K = 10.

I

70% - 30% train-test split

I

Numbers averaged across 10 different initializations

Outline

Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

SMC design choices I

Proposals I I

prior proposal: qn (ρj , κj , τj ) = p(ρj , κj , τj ) optimal proposal: qn (ρj = stop) ∝ p(j not split)p(YN(j) |XN(j) ),

qn (ρj = split, κj , τj ) ∝ p(j split)p(κj , τj )

× p(YN(j0) |XN(j0) ) p(YN(j1) |XN(j1) ) . {z }| {z } | left child

I

Set of nodes considered for expansion at iteration n I I

I

right child

node-wise: next node layer-wise: all nodes at depth n

Multinomial resampling

Effect of SMC design choices

log p(Y |X) (test)

−0.40

−0.40

−0.45

−0.45 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]

−0.50 −0.55

log p(Y |X) (test)

−0.35

−0.35

101

102

103

Mean Time (s)

104

105

SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]

−0.50 −0.55 101

102

Number of particles

Figure: Results on magic-04 dataset

103

Effect of irrelevant features on SMC design choices madelon: N = 2.6K , D = 500, K = 2 (96% of the features are irrelevant) −0.45

log p(Y |X) (test)

log p(Y |X) (test)

−0.45

−0.50

−0.50

−0.55

−0.55

−0.60

−0.60

−0.65 −0.70

SMC optimal [node] SMC prior [node]

SMC optimal [node] SMC prior [node]

102

103

Mean Time (s)

104

−0.65 −0.70 101

102

103

Number of particles

Figure: Results on madelon dataset

104

Outline

Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

Predictive performance vs computation: SMC vs MCMC

I I

Fix hyper parameters α = 5, αs = 0.95, βs = 0.5 MCMC [Chipman et al., 1998]: one of the 4 proposals: I I I I

grow prune change swap

I

MCMC averages predictions over all previous trees

I

Vary number of particles in SMC, number of MCMC iterations and compare runtime vs performance

Predictive performance vs computation: SMC vs MCMC

0.86

−0.36 −0.38

SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)

−0.40 −0.42 102

103

104

Mean Time (s)

Accuracy (test)

log p(Y |X) (test)

−0.34

0.85 0.84 SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)

0.83 0.82 102

103

104

Mean Time (s)

Figure: Results on magic-04 dataset

Take home message

SMC (prior, node-wise) is at least an order of magnitude faster than MCMC

Outline

Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

Conclusion

I

SMC for fast Bayesian inference for decision trees I I I

mimick the top-down generative process of decision trees use ‘local’ likelihoods + resampling steps to guide tree growth For a fixed computational budget, SMC outperforms MCMC

Conclusion

I

SMC for fast Bayesian inference for decision trees I I I

I

mimick the top-down generative process of decision trees use ‘local’ likelihoods + resampling steps to guide tree growth For a fixed computational budget, SMC outperforms MCMC

Future directions I I

Particle-MCMC for Bayesian Additive Regression Trees Mondrian process prior: projective and exchangeable prior for decision trees [Roy and Teh, 2009]

Thank you!

Code available at http://www.gatsby.ucl.ac.uk/~balaji

Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search. J. Am. Stat. Assoc., pages 935–948. Roy, D. M. and Teh, Y. W. (2009). The Mondrian process. In Adv. Neural Information Proc. Systems, volume 21, pages 1377–1384.