Top-down particle filtering for Bayesian Decision Trees Review by: David Carlson
Balaji Lakshminarayanan, Daniel M. Roy, and Yee Whye Teh
Review of Lakshminarayanan 2013
Bayesian Decision Trees
Overview
I
Decision tree learning is a popular approach for both classification and regression tasks
I
Bayesian formulations of these problems have been shown to give competitive performance, but are slow
I
This paper introduces a new, top-down sequential monte carlo sampling scheme that gives the same performance as the MCMC schemes but at least an order of magnitude faster
Review of Lakshminarayanan 2013
Bayesian Decision Trees
Introduction
d As a general framework, let X = {xn }N n=1 , xn ∈ R be a set of input vectors, and Y = {yn }N n=1 be their corresponding labels.
The goal is to be able to predict yn from it’s corresponding input vector xn , and in this paper our labels are going to be a single category yn ∈ {1, . . . , K }. We represent a binary tree that generates the data as a tuple T = (T , κ, τ )
Review of Lakshminarayanan 2013
Bayesian Decision Trees
Some notation A rooted, strictly binary tree T is a finite tree with a single root, designated by the string . The leaves of the tree are designated by ∂T . All internal nodes (i.e. all nodes that are not leaves, T \∂T ) are split into a left child and a right child. For an internal node p, these are designated p0 and p1, and the children split the block B(p) into two blocks B(p0) and B(p1). Each internal node has a single dimension that it splits, κ(p), and a cut on that dimension τ (p). We can then state the children blocks as: B(p0) = B(p) ∩ {z ∈ Rd : zk(p) ≤ τ (p)}
B(p1) = B(p) ∩ {z ∈ Rd : zk(p) > τ (p)}
Let N(p) be the set of data indices in node p, and XN(p) and YN(p) be the data points in block B(p). Review of Lakshminarayanan 2013
Bayesian Decision Trees
(1)
Decision Trees
Top-down particle filtering for Bayesian decision trees I10 ?
✏ : {x1 > 0.5}
B(✏)
1 : {x2 > 0.35}
0
10
B(0)
B(1)
B(11)
?
B(10)
?
B(0)
?
I20 ?
11
?
Figure 1. A decision tree T = (T, , ⌧ ) represents a hierarchical partitioning of a space. Here, the space is the unit square and the tree T contains the nodes {✏, 0, 1, 10, 11}. The root node ✏ represents the whole space B(✏) = RD , while its two children 0 and 1, represent the two halves of the cut ((✏), ⌧ (✏)) = (1, 0.5), where (✏) = 1 represents the dimension of the cut, and ⌧ (✏) = 0.5 represents the location of the cut along that dimension. (The origin is at the bottom left of each figure, and the x-axis is dimension 1. The red stars and blue circles represent observed data points.) The second cut, ((1), ⌧ (1)) = (2, 0.35), splits the block B(1) into the two halves B(11) and B(10). When defining the prior over decision trees given by Chipman et al. (1998), it will be necessary to refer to the “extent” of the data in a block. E.g., I10 and I20 are the extent of the data in dimensions 1 and 2, respectively, in block B(0). For each node p, the set Dp contains those dimensions with non-trivial extent. Here, D0 = {1, 2}, but D10 = {2}, because there is no variation in dimension 1.
of a decision tree in stages, beginning with the trivtree. For larger ↵s and smaller s the typical trees are ial tree T0 = {✏} containing only the root node. At larger, while the deeper p is in the tree the less likely it each stage i, Ti is produced from Ti 1 by choosing one will be cut. If p is cut, the dimension (p) and then loleaf in Ti 1 and either growing two children nodes or cation ⌧ (p) of the cut are sampled uniformly from Dp p stopping the leaf. Once stopped, a leaf is ineligible for and I(p) , respectively. Note that the choice for the future growth. The identity of the chosen leaf is detersupport of the distribution over cut dimensions and ministic, while the choice to grow or stop is stochastic. locations are such that both children of p will, with The process proceeds all leaves are stopped, probability one, contain Reviewuntil of Lakshminarayanan 2013 and Bayesian Decision Trees at least one input vector. Fi-
Generating a Decision Trees In a decision tree, we start at a single root node, designated by the empty string , and then for each node p in our decision tree, we: 1. If the features of the data don’t vary, then this is determinisitically set to a leaf node 2. Split the node with probability becomes a leaf node
αs , (1+|p|)βs
otherwise this
3. Choose a feature dimension κ(p) ∈ D p uniformly at random, where D p denotes which features vary in node p 4. Choose a value τ (p) to split the tree uniformly at random p from Ik(p) , which is the range from the smallest value in dimension k(p) to the largest value in dimension k(p) for the data in node p, XN(p) 5. Class labels are drawn from a Dirichlet-multinomial Review of Lakshminarayanan 2013
Bayesian Decision Trees
Conditional Distributions Given our generative model, we can write the conditional distribution given the input vectors as: h(T , κ, τ |X ) = ×
Q
Q
p∈∂T
p∈T \∂T
1−
1(|D p |>0) αs (1 + |p|)βs αs 1 p β s p (1 + |p|) |D ||Iκ(p)
(2)
The data likelihood term for a Dirichlet-multinomial (with a symmetric α/K prior for the Dirichlet for each node p can be given: QK Γ(α) k=1 γ(mpk + α/K ) l(Yn(p) |Xn(p) ) = P K Γ(α/K ) γ( Kk=1 mpk + α) Review of Lakshminarayanan 2013
Bayesian Decision Trees
(3)
Sequential Monte Carlo
MCMC methods for fitting this model were proposed in the 90s (see Buntine (1992) and Chipman et. al. (1998) for examples). The contribution of this paper is a “top-down particle filtering” approach to fitting the model, which the authors claim give a 10x speedup. To understand how this method works, consider just a single path at first. Let (Ti , κi , τi ) be the state of the tree after going through i steps of the generation process. At each stage i, let Ei denote the set of unstopped leaves (i.e. nodes that have not been stopped nor split). At each stage i, we need to choose a candidate leaf Ci ∈ Ei
Review of Lakshminarayanan 2013
Bayesian Decision Trees
Sampling the next stage of the tree
Given our current truncated tree Ti−1 , we want to sample the next stage of the tree, Ti |Ti−1 . First, a we to choose the node to consider for expansion. The authors consider two choices, Ci = Ei , where every available node is considered for expnsion, and Ci is set to the oldest (i.e. the available node highest in the tree.) We also have to consider a proposal kernel Q(Ti |Ti−1 ).
Review of Lakshminarayanan 2013
Bayesian Decision Trees
Proposal Kernel The first proposal kernel being considered is the simple case where we use the generative prior, which is going to be called “SMC Prior” in the experiments: Q(Ti |Ti−1 ) = P(Ti |Ti−1 )
(4)
The next proposal kernel being considered is posterior probabilities of the trees where we treat the tree as complete: Q(Ti |Ti−1 ) = PYi (Ti |Ti−1 )
PYi (Ti |Ti−1 )
∝ g(Y |Ti , X )P(Ti |Ti−1 )
(5) (6)
This is called the one-step optimal procedure, because if the tree ended in a single step this would be the optimal proposal. Note that we can consider multiple nodes at the same time because they are independent (ρ denotes a split event): Y Qi (Ti |Ti−1 ) = Qi (ρi,p , κi (p), τi (p)) (7) p∈Ct−1
Review of Lakshminarayanan 2013
Bayesian Decision Trees
Sequential Monte Carlo (SMC) Framework Now that we have the framework for building a single path down the tree, we can discuss the SMC framework to get an approximate posterior. Letting each particle at stage i − 1 be (m) represented by m with corresponding weight wi−1 , then for (m)
(m)
(m)
each particle we will sample Ti from Qi (Ti |Ti−1 ). Then we set: (m) (m) (m) |Ti−1 )g(Y |Ti , X ) (m) (m) P(Ti wi = wi−1 (8) (m) (m) (m) Q(Ti |Ti−1 )g(Y |Ti−1 , X ) P (m) And renormalize so that m wi = 1. If the effective sample size is small, then the particles are resampled from the current truncated posterior estimate.
Review of Lakshminarayanan 2013
Bayesian Decision Trees
Algorithm 1 SMC for Bayesian decision tree learning Inputs: Training data (X, Y ) Number of particles M (m) (m) Initialize: T0 = E0 = {✏} (m) (m) ⌧0 = 0 = ; (m) (m) w0 = f (Y |T0 ) P (m) W0 = m w 0
for i = 1 : MAX-STAGES do for m = 1 : M do (m) (m) Sample Ti from Qi (· | Ti 1 ) (m)
(m)
(m)
(m)
(m)
where Ti := (Ti , i , ⌧i , Ei ) Update weights: (Here P, Qi denote their densities.) (m)
wi
(m)
= =
P(Ti
(m)
) g(Y | Ti
(m)
Qi (Ti
, X)
(8)
(m) (m) 1 ) P(Ti 1 )
| Ti
(m) (m) | Ti 1 ) (m) P(Ti wi 1 (m) (m) Qi (Ti | Ti 1 )
g(Y g(Y
(m) | Ti , X) (m) | Ti 1 , X)
end for P (m) Compute normalization: Wi = m wi (m) (m) Normalize weights: (8m) w ¯i = wi /Wi P 1 (m) if ¯ i )2 < ESS-THRESHOLD then m (w P (m0 ) (8m) Resample indices jm from m0 w ¯i m0 (m) (jm ) (m) (8m) Ti Ti ; wi Wi /M end if (m) if (8m) Ei = ; then exit for loop end if end for return Estimated marginal probability Wi /M and (m) (m) (m) (m) weighted samples {wi , Ti , i , ⌧i }M m=1 . Review of Lakshminarayanan 2013
Bayesian Decision Trees
(9)
Experiments
The proposed algorithms are compared the MCMC algorithm of Chipman et all (1998) and the CART algorithm (Breiman et al 1984) on the following datasets: I
MAGIC gamma telescope data 2004 (magic-04): N=19020, D=10, K=2
I
Pen-based recognition of handwritten digits (pen-digits): N=10992, D=16, K=10
I
The madelon dataset: N=2600, D=500, K=2
Data was split into 70% training and 30% testing datasets, and α = 5.0, αs = 0.95, and βs = 0.5.
Review of Lakshminarayanan 2013
Bayesian Decision Trees
0.4
0.6
0.6
0.8 1.0 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]
1.2 1.4 101
102
103
Mean Time (s)
log p(Y |X) (test)
log p(Y |X) (test)
TOP-DOWN PARTICLE FILTERING FOR BAYESIAN DECISION TREES
0.4
0.40
0.45 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]
0.50
101
102
103
Mean Time (s)
104
105
log p(Y |X) (test)
log p(Y |X) (test)
1.0 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]
1.2
102
Number of particles
103
0.35
0.35
0.55
0.8
1.4 101
104
10
0.40
0.45 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]
0.50
0.55 101
102
Number of particles
103
Figure 2. Results on pen-digits (top), and magic-04 (bottom). Left column plots test log p(y|x) vs runtime, while right column plots test log p(y|x) vs number of particles. The blue circles and red squares represent optimal and prior proposals respectively. The solid and dashed lines represent node-wise and layer-wise proposals respectively. The setup identical to the previous section. TheDecision results are shown in Figure Review ofisLakshminarayanan 2013 Bayesian Trees
TOP-DOWN PARTICLE FILTERING FOR BAYESIAN DECISION TREES
0.45
log p(Y |X) (test)
log p(Y |X) (test)
0.45 0.50 0.55 0.60 0.65
SMC optimal [node] SMC prior [node]
0.70
102
103
Mean Time (s)
0.55
SMC optimal [node] SMC prior [node]
0.60 0.65 102
103
Number of particles
104
0.80
0.75
Accuracy (test)
Accuracy (test)
0.50
0.70 101
104
0.80
0.70 0.65 0.60 0.55 102
103
Mean Time (s)
0.75 0.70 0.65 0.60 0.55
SMC optimal [node] SMC prior [node]
0.50
0.50 101
104
SMC optimal [node] SMC prior [node]
102
103
Number of particles
Figure 3. Results on madelon dataset: The top and bottom rows display log p(y|x) and accuracy on the test data against runtime (left) and the number of particles (right) respectively. The blue circles and red squares represent optimal and prior proposals respectively. 0.2 0.4 Review
0.96
of Lakshminarayanan 2013
st)
st)
11
0.94 Bayesian Decision Trees 0.92
104
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0 10
Accuracy (test)
log p(Y |X) (test)
display log p(y|x) and accuracy on the test data against runtime (left) and the number of particles (right) respectively. The blue circles and red squares represent optimal and prior proposals respectively.
101
102
103
Number of particles per island
2000
200
50 20 10 5
Number of islands
104
1
0.96 0.94 0.92 0.90 0.88 0.86 0.84 0.82 0.80 0.78 0 10
101
102
103
Number of particles per island
2000
200
50 20 10 5
Number of islands
104
1
Figure 4. Results on pen-digits: Test log p(y|x) (left) and accuracy (right) vs I and M/I for fixed M = 2000. use fewer than 100 particles per island and (ii) when M/I 100, the choices of I 2 [5, 100] outperform I = 1. Since the islands are independent, the computation across islands is ‘embarrassingly parallelizable’.
Review of Lakshminarayanan 2013
Bayesian Decision Trees
TOP-DOWN PARTICLE FILTERING FOR BAYESIAN DECISION TREES
0.94
0.25
log p(Y |X) (test)
0.35 0.40 0.45 SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)
0.50 0.55 0.60 10
2
10
3
Mean Time (s)
10
Accuracy (test)
0.92
0.30
0.90 0.88 0.86
SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)
0.84 0.82 102
4
103
Mean Time (s)
104
0.86
0.36 0.38 SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)
0.40 0.42 102
103
104
Mean Time (s)
Accuracy (test)
0.34
log p(Y |X) (test)
13
0.85 0.84 SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)
0.83 0.82 102
103
104
Mean Time (s)
Figure 5. Results on pen-digits (top row), and magic-04 (bottom row). Left column plots test log p(y|x) vs runtime, while right column plots test accuracy vs runtime. The blue cirlces, red squares and black diamonds represent optimal, prior proposals and MCMC respectively. as the highly-optimized implementation of CART. For this reason, we plot CART Review Lakshminarayanan Decision Trees accuracy as of a horizontal bar. The2013 accuracy Bayesian and log predictive probability on test