Top-down particle filtering for Bayesian decision trees Balaji Lakshminarayanan1 , Daniel M. Roy2 and Yee Whye Teh3 1. Gatsby Unit, UCL, 2. University of Cambridge and 3. University of Oxford
Outline
Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion
Outline
Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion
Introduction
I I I
N Input: attributes X = {xi }N i=1 , labels Y = {yi }i=1 (i.i.d)
yi ∈ {1, . . . , K } (classification) or yi ∈ R (regression)
Goal: Model p(y |x)
Introduction
I I I I I
N Input: attributes X = {xi }N i=1 , labels Y = {yi }i=1 (i.i.d)
yi ∈ {1, . . . , K } (classification) or yi ∈ R (regression)
Goal: Model p(y |x)
Assume p(y |x) is specified by decision tree T Bayesian decision trees: I
I
Posterior: p(T |Y , X ) ∝ p(Y |T , X ) p(T |X ) | {z } | {z } prior P likelihood Prediction: p(y∗ |x∗ ) = T p(T |Y , X )p(y∗ |x∗ , T )
Example: Classification tree
1
x1 > 0.5 0
−
+ +
1
x2 +
x2 > 0.35
θ0
+
0 θ10
−
−
B0 B11 B10
1
+ +
θ11
θ: Multinomial parameters at leaf nodes
0
x1
1
Example: Regression tree
1
−1
7 6
x1 > 0.5 0
1
x2
B0 B11 7
x2 > 0.35
θ0
8 0
−3
−5
B10
1
θ10
θ: Gaussian parameters at leaf nodes
1 0
θ11
0
x1
1
Motivation
I
Classic non-Bayesian induction algorithms (e.g. CART) learn a single tree in a top-down manner using greedy heuristics (post-pruning and/or bagging necessary)
I
MCMC for Bayesian decision trees: [Chipman et al., 1998]: local Monte Carlo modifications to the tree structure (less prone to over fitting but slow to mix)
I
Our contribution: Sequential Monte Carlo (SMC) algorithm that approximates the posterior, in a top-down manner
I
Take home message: SMC provides better computation vs predictive performance tradeoff than MCMC
Bayesian decision trees: likelihood
p(T |Y , X ) ∝ p(Y |T , X ) p(T |X ) | {z } | {z } likelihood
prior
Likelihood I I
Assume xn falls in the j th leaf node of T
Likelihood for nth data point: p(yn | xn , T , θ) = p(yn |θj , xn ) Y Y Y p(Y | T , X , Θ) = p(yn | xn , T , θ) = p(yn |θj ) n
j∈leaves(T) n∈N(j)
Likelihood I I
Assume xn falls in the j th leaf node of T
Likelihood for nth data point: p(yn | xn , T , θ) = p(yn |θj , xn ) Y Y Y p(Y | T , X , Θ) = p(yn | xn , T , θ) = p(yn |θj ) n
I
j∈leaves(T) n∈N(j)
Better: integrate out θj , use marginal likelihood Y Z Y p(Y | T , X ) = p(yn |θj )p(θj )dθj j∈leaves(T) θj n∈N(j)
I
Classification: Dirichlet - Multinomial
I
Regression: Normal - Normal Inverse Gamma
Outline
Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion
Bayesian decision trees: prior
p(T |Y , X ) ∝ p(Y |T , X ) p(T |X ) | {z } | {z } likelihood
prior
Partial trees
0. Start with empty tree. 1
−
+ + x2 +
B
−
− +
+ + 0
x1
1
Partial trees
1. Choose to split root node with feature 1 and threshold 0.5. 1
−
+ + x2 +
: x1 > 0.5
−
−
B0 B1 +
+ +
0
1
0
x1
1
Partial trees
2. Choose to not split node 0. 1
−
+ + x2 +
: x1 > 0.5
−
−
B0 B1 +
+ +
0
1
0
x1
1
Partial trees
3. Choose to split node 1 with with feature 2 and threshold 0.35. 1
−
+ +
: x1 > 0.5
x2 + 1 : x2 > 0.35
0
+
−
−
B0 B11 B10
+ +
10
11
0
x1
1
Partial trees 4. Choose to not split node 10. 5. Choose to not split node 11. 1
−
+ +
: x1 > 0.5
x2 + 1 : x2 > 0.35
0
+
−
−
B0 B11 B10
+ +
10
11
0
x1
1
Sequence of random variables for a tree
: x1 > 0.5
1 : x2 > 0.35
0
10
1. ρ = 1, κ = 1, τ = 0.5 2. ρ0 = 0 3. ρ1 = 1, κ1 = 2, τ1 = 0.35 4. ρ10 = 0 5. ρ11 = 0
11
Sequential prior over decision trees
I
Probability of split (assuming a valid split exists): −βs p(j split) = αs · 1 + depth(j)
I
αs ∈ (0, 1), βs ∈ [0, ∞)
κj , τj sampled uniformly from the range of valid splits
Sequential prior over decision trees
I
Probability of split (assuming a valid split exists): −βs p(j split) = αs · 1 + depth(j)
αs ∈ (0, 1), βs ∈ [0, ∞)
I
κj , τj sampled uniformly from the range of valid splits
I
Prior distribution: p(T, κ, τ |X ) =
Y
p(j not split)
j∈leaves(T)
×
Y j∈nonleaves(T)
p(j split)p(κj , τj )
Outline
Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion
Bayesian decision trees: posterior
p(T |Y , X ) ∝ p(Y |T , X ) p(T |X ) | {z } | {z } likelihood
prior
SMC algorithm for Bayesian decision trees
I
Importance sampler: Draw T (c) ∼ q(·) p(Y |X ) =
X T
p(Y , T |X ) ≈
C X 1 p(T (c) ) p(Y |X , T (c) ) C q(T (c) ) c=1 | {z } w (c)
SMC algorithm for Bayesian decision trees
I
Importance sampler: Draw T (c) ∼ q(·) p(Y |X ) =
X T
p(Y , T |X ) ≈
C X 1 p(T (c) ) p(Y |X , T (c) ) C q(T (c) ) c=1 | {z } w (c)
I I
Normalize: w ¯ (c) =
Pw c0
(c)
w (c
0)
Approximate posterior: p(T |Y , X ) ≈
X c
w ¯ (c) δ(T = T (c) )
SMC algorithm for Bayesian decision trees (contd.) I
Sequential importance sampler (SIS): p(Tn ) = p(T0 )
n Y n0 =1
p(T |T n0
n0 −1
) q(Tn ) = q0 (T0 )
n Y n0 =1
qn0 (Tn0 |Tn0 −1 )
, p(Y |X , Tn ) p(Y T1 ) |X p(Y |X , Tn ) = p(Y |X , T0 ) ··· (() (,( p(Y Tn−1 p(Y |X , T ) 0 ((|X
SMC algorithm for Bayesian decision trees (contd.) I
Sequential importance sampler (SIS): p(Tn ) = p(T0 )
n Y n0 =1
p(T |T n0
n0 −1
) q(Tn ) = q0 (T0 )
n Y n0 =1
qn0 (Tn0 |Tn0 −1 )
, p(Y |X , Tn ) p(Y T1 ) |X p(Y |X , Tn ) = p(Y |X , T0 ) ··· (() (,( p(Y Tn−1 p(Y |X , T ) 0 ((|X 1 p(Tn ) p(Y |X , Tn ) C q(Tn ) n Y p(Tn0 | Tn0 −1 ) p(Y | X , Tn0 ) = w0 qn0 (Tn0 | Tn0 −1 ) p(Y | X , Tn0 −1 ) n0 =1 | {z }
w=
local likelihood
I
Sequential Monte Carlo (SMC): SIS + adaptive resampling steps
I
Every node is processed just once: no multi-path issues
Outline
Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion
Experimental setup
I
Datasets: I I
magic-04 : N = 19K , D = 10, K = 2. pendigits: N = 11K , D = 16, K = 10.
I
70% - 30% train-test split
I
Numbers averaged across 10 different initializations
Outline
Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion
SMC design choices I
Proposals I I
prior proposal: qn (ρj , κj , τj ) = p(ρj , κj , τj ) optimal proposal: qn (ρj = stop) ∝ p(j not split)p(YN(j) |XN(j) ),
qn (ρj = split, κj , τj ) ∝ p(j split)p(κj , τj )
× p(YN(j0) |XN(j0) ) p(YN(j1) |XN(j1) ) . {z }| {z } | left child
I
Set of nodes considered for expansion at iteration n I I
I
right child
node-wise: next node layer-wise: all nodes at depth n
Multinomial resampling
Effect of SMC design choices
log p(Y |X) (test)
−0.40
−0.40
−0.45
−0.45 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]
−0.50 −0.55
log p(Y |X) (test)
−0.35
−0.35
101
102
103
Mean Time (s)
104
105
SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]
−0.50 −0.55 101
102
Number of particles
Figure: Results on magic-04 dataset
103
Effect of irrelevant features on SMC design choices madelon: N = 2.6K , D = 500, K = 2 (96% of the features are irrelevant) −0.45
log p(Y |X) (test)
log p(Y |X) (test)
−0.45
−0.50
−0.50
−0.55
−0.55
−0.60
−0.60
−0.65 −0.70
SMC optimal [node] SMC prior [node]
SMC optimal [node] SMC prior [node]
102
103
Mean Time (s)
104
−0.65 −0.70 101
102
103
Number of particles
Figure: Results on madelon dataset
104
Outline
Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion
Predictive performance vs computation: SMC vs MCMC
I I
Fix hyper parameters α = 5, αs = 0.95, βs = 0.5 MCMC [Chipman et al., 1998]: one of the 4 proposals: I I I I
grow prune change swap
I
MCMC averages predictions over all previous trees
I
Vary number of particles in SMC, number of MCMC iterations and compare runtime vs performance
Predictive performance vs computation: SMC vs MCMC
0.86
−0.36 −0.38
SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)
−0.40 −0.42 102
103
104
Mean Time (s)
Accuracy (test)
log p(Y |X) (test)
−0.34
0.85 0.84 SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)
0.83 0.82 102
103
104
Mean Time (s)
Figure: Results on magic-04 dataset
Take home message
SMC (prior, node-wise) is at least an order of magnitude faster than MCMC
Outline
Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion
Conclusion
I
SMC for fast Bayesian inference for decision trees I I I
mimick the top-down generative process of decision trees use ‘local’ likelihoods + resampling steps to guide tree growth For a fixed computational budget, SMC outperforms MCMC
Conclusion
I
SMC for fast Bayesian inference for decision trees I I I
I
mimick the top-down generative process of decision trees use ‘local’ likelihoods + resampling steps to guide tree growth For a fixed computational budget, SMC outperforms MCMC
Future directions I I
Particle-MCMC for Bayesian Additive Regression Trees Mondrian process prior: projective and exchangeable prior for decision trees [Roy and Teh, 2009]
Thank you!
Code available at http://www.gatsby.ucl.ac.uk/~balaji
Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search. J. Am. Stat. Assoc., pages 935–948. Roy, D. M. and Teh, Y. W. (2009). The Mondrian process. In Adv. Neural Information Proc. Systems, volume 21, pages 1377–1384.