Top-down particle filtering for Bayesian Decision Trees - Duke ECE

Comment

Report 3 Downloads 49 Views

Top-down particle filtering for Bayesian Decision Trees Review by: David Carlson

Balaji Lakshminarayanan, Daniel M. Roy, and Yee Whye Teh

Review of Lakshminarayanan 2013

Bayesian Decision Trees

Overview

I

Decision tree learning is a popular approach for both classification and regression tasks

I

Bayesian formulations of these problems have been shown to give competitive performance, but are slow

I

This paper introduces a new, top-down sequential monte carlo sampling scheme that gives the same performance as the MCMC schemes but at least an order of magnitude faster

Review of Lakshminarayanan 2013

Bayesian Decision Trees

Introduction

d As a general framework, let X = {xn }N n=1 , xn ∈ R be a set of input vectors, and Y = {yn }N n=1 be their corresponding labels.

The goal is to be able to predict yn from it’s corresponding input vector xn , and in this paper our labels are going to be a single category yn ∈ {1, . . . , K }. We represent a binary tree that generates the data as a tuple T = (T , κ, τ )

Review of Lakshminarayanan 2013

Bayesian Decision Trees

Some notation A rooted, strictly binary tree T is a finite tree with a single root, designated by the string . The leaves of the tree are designated by ∂T . All internal nodes (i.e. all nodes that are not leaves, T \∂T ) are split into a left child and a right child. For an internal node p, these are designated p0 and p1, and the children split the block B(p) into two blocks B(p0) and B(p1). Each internal node has a single dimension that it splits, κ(p), and a cut on that dimension τ (p). We can then state the children blocks as: B(p0) = B(p) ∩ {z ∈ Rd : zk(p) ≤ τ (p)}

B(p1) = B(p) ∩ {z ∈ Rd : zk(p) > τ (p)}

Let N(p) be the set of data indices in node p, and XN(p) and YN(p) be the data points in block B(p). Review of Lakshminarayanan 2013

Bayesian Decision Trees

(1)

Decision Trees

Top-down particle filtering for Bayesian decision trees I10 ?

✏ : {x1 > 0.5}

B(✏)

1 : {x2 > 0.35}

0

10

B(0)

B(1)

B(11)

?

B(10)

?

B(0)

?

I20 ?

11

?

Figure 1. A decision tree T = (T, , ⌧ ) represents a hierarchical partitioning of a space. Here, the space is the unit square and the tree T contains the nodes {✏, 0, 1, 10, 11}. The root node ✏ represents the whole space B(✏) = RD , while its two children 0 and 1, represent the two halves of the cut ((✏), ⌧ (✏)) = (1, 0.5), where (✏) = 1 represents the dimension of the cut, and ⌧ (✏) = 0.5 represents the location of the cut along that dimension. (The origin is at the bottom left of each figure, and the x-axis is dimension 1. The red stars and blue circles represent observed data points.) The second cut, ((1), ⌧ (1)) = (2, 0.35), splits the block B(1) into the two halves B(11) and B(10). When defining the prior over decision trees given by Chipman et al. (1998), it will be necessary to refer to the “extent” of the data in a block. E.g., I10 and I20 are the extent of the data in dimensions 1 and 2, respectively, in block B(0). For each node p, the set Dp contains those dimensions with non-trivial extent. Here, D0 = {1, 2}, but D10 = {2}, because there is no variation in dimension 1.

of a decision tree in stages, beginning with the trivtree. For larger ↵s and smaller s the typical trees are ial tree T0 = {✏} containing only the root node. At larger, while the deeper p is in the tree the less likely it each stage i, Ti is produced from Ti 1 by choosing one will be cut. If p is cut, the dimension (p) and then loleaf in Ti 1 and either growing two children nodes or cation ⌧ (p) of the cut are sampled uniformly from Dp p stopping the leaf. Once stopped, a leaf is ineligible for and I(p) , respectively. Note that the choice for the future growth. The identity of the chosen leaf is detersupport of the distribution over cut dimensions and ministic, while the choice to grow or stop is stochastic. locations are such that both children of p will, with The process proceeds all leaves are stopped, probability one, contain Reviewuntil of Lakshminarayanan 2013 and Bayesian Decision Trees at least one input vector. Fi-

Generating a Decision Trees In a decision tree, we start at a single root node, designated by the empty string , and then for each node p in our decision tree, we: 1. If the features of the data don’t vary, then this is determinisitically set to a leaf node 2. Split the node with probability becomes a leaf node

αs , (1+|p|)βs

otherwise this

3. Choose a feature dimension κ(p) ∈ D p uniformly at random, where D p denotes which features vary in node p 4. Choose a value τ (p) to split the tree uniformly at random p from Ik(p) , which is the range from the smallest value in dimension k(p) to the largest value in dimension k(p) for the data in node p, XN(p) 5. Class labels are drawn from a Dirichlet-multinomial Review of Lakshminarayanan 2013

Bayesian Decision Trees

Conditional Distributions Given our generative model, we can write the conditional distribution given the input vectors as: h(T , κ, τ |X ) = ×

Q

Q

p∈∂T

p∈T \∂T

1−

1(|D p |>0) αs (1 + |p|)βs αs 1 p β s p (1 + |p|) |D ||Iκ(p)

(2)

The data likelihood term for a Dirichlet-multinomial (with a symmetric α/K prior for the Dirichlet for each node p can be given: QK Γ(α) k=1 γ(mpk + α/K ) l(Yn(p) |Xn(p) ) = P K Γ(α/K ) γ( Kk=1 mpk + α) Review of Lakshminarayanan 2013

Bayesian Decision Trees

(3)

Sequential Monte Carlo

MCMC methods for fitting this model were proposed in the 90s (see Buntine (1992) and Chipman et. al. (1998) for examples). The contribution of this paper is a “top-down particle filtering” approach to fitting the model, which the authors claim give a 10x speedup. To understand how this method works, consider just a single path at first. Let (Ti , κi , τi ) be the state of the tree after going through i steps of the generation process. At each stage i, let Ei denote the set of unstopped leaves (i.e. nodes that have not been stopped nor split). At each stage i, we need to choose a candidate leaf Ci ∈ Ei

Review of Lakshminarayanan 2013

Bayesian Decision Trees

Sampling the next stage of the tree

Given our current truncated tree Ti−1 , we want to sample the next stage of the tree, Ti |Ti−1 . First, a we to choose the node to consider for expansion. The authors consider two choices, Ci = Ei , where every available node is considered for expnsion, and Ci is set to the oldest (i.e. the available node highest in the tree.) We also have to consider a proposal kernel Q(Ti |Ti−1 ).

Review of Lakshminarayanan 2013

Bayesian Decision Trees

Proposal Kernel The first proposal kernel being considered is the simple case where we use the generative prior, which is going to be called “SMC Prior” in the experiments: Q(Ti |Ti−1 ) = P(Ti |Ti−1 )

(4)

The next proposal kernel being considered is posterior probabilities of the trees where we treat the tree as complete: Q(Ti |Ti−1 ) = PYi (Ti |Ti−1 )

PYi (Ti |Ti−1 )

∝ g(Y |Ti , X )P(Ti |Ti−1 )

(5) (6)

This is called the one-step optimal procedure, because if the tree ended in a single step this would be the optimal proposal. Note that we can consider multiple nodes at the same time because they are independent (ρ denotes a split event): Y Qi (Ti |Ti−1 ) = Qi (ρi,p , κi (p), τi (p)) (7) p∈Ct−1

Review of Lakshminarayanan 2013

Bayesian Decision Trees

Sequential Monte Carlo (SMC) Framework Now that we have the framework for building a single path down the tree, we can discuss the SMC framework to get an approximate posterior. Letting each particle at stage i − 1 be (m) represented by m with corresponding weight wi−1 , then for (m)

(m)

(m)

each particle we will sample Ti from Qi (Ti |Ti−1 ). Then we set: (m) (m) (m) |Ti−1 )g(Y |Ti , X ) (m) (m) P(Ti wi = wi−1 (8) (m) (m) (m) Q(Ti |Ti−1 )g(Y |Ti−1 , X ) P (m) And renormalize so that m wi = 1. If the effective sample size is small, then the particles are resampled from the current truncated posterior estimate.

Review of Lakshminarayanan 2013

Bayesian Decision Trees

Algorithm 1 SMC for Bayesian decision tree learning Inputs: Training data (X, Y ) Number of particles M (m) (m) Initialize: T0 = E0 = {✏} (m) (m) ⌧0 = 0 = ; (m) (m) w0 = f (Y |T0 ) P (m) W0 = m w 0

for i = 1 : MAX-STAGES do for m = 1 : M do (m) (m) Sample Ti from Qi (· | Ti 1 ) (m)

(m)

(m)

(m)

(m)

where Ti := (Ti , i , ⌧i , Ei ) Update weights: (Here P, Qi denote their densities.) (m)

wi

(m)

= =

P(Ti

(m)

) g(Y | Ti

(m)

Qi (Ti

, X)

(8)

(m) (m) 1 ) P(Ti 1 )

| Ti

(m) (m) | Ti 1 ) (m) P(Ti wi 1 (m) (m) Qi (Ti | Ti 1 )

g(Y g(Y

(m) | Ti , X) (m) | Ti 1 , X)

end for P (m) Compute normalization: Wi = m wi (m) (m) Normalize weights: (8m) w ¯i = wi /Wi P 1 (m) if ¯ i )2 < ESS-THRESHOLD then m (w P (m0 ) (8m) Resample indices jm from m0 w ¯i m0 (m) (jm ) (m) (8m) Ti Ti ; wi Wi /M end if (m) if (8m) Ei = ; then exit for loop end if end for return Estimated marginal probability Wi /M and (m) (m) (m) (m) weighted samples {wi , Ti , i , ⌧i }M m=1 . Review of Lakshminarayanan 2013

Bayesian Decision Trees

(9)

Experiments

The proposed algorithms are compared the MCMC algorithm of Chipman et all (1998) and the CART algorithm (Breiman et al 1984) on the following datasets: I

MAGIC gamma telescope data 2004 (magic-04): N=19020, D=10, K=2

I

Pen-based recognition of handwritten digits (pen-digits): N=10992, D=16, K=10

I

The madelon dataset: N=2600, D=500, K=2

Data was split into 70% training and 30% testing datasets, and α = 5.0, αs = 0.95, and βs = 0.5.

Review of Lakshminarayanan 2013

Bayesian Decision Trees

0.4

0.6

0.6

0.8 1.0 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]

1.2 1.4 101

102

103

Mean Time (s)

log p(Y |X) (test)

log p(Y |X) (test)

TOP-DOWN PARTICLE FILTERING FOR BAYESIAN DECISION TREES

0.4

0.40

0.45 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]

0.50

101

102

103

Mean Time (s)

104

105

log p(Y |X) (test)

log p(Y |X) (test)

1.0 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]

1.2

102

Number of particles

103

0.35

0.35

0.55

0.8

1.4 101

104

10

0.40

0.45 SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer]

0.50

0.55 101

102

Number of particles

103

Figure 2. Results on pen-digits (top), and magic-04 (bottom). Left column plots test log p(y|x) vs runtime, while right column plots test log p(y|x) vs number of particles. The blue circles and red squares represent optimal and prior proposals respectively. The solid and dashed lines represent node-wise and layer-wise proposals respectively. The setup identical to the previous section. TheDecision results are shown in Figure Review ofisLakshminarayanan 2013 Bayesian Trees

TOP-DOWN PARTICLE FILTERING FOR BAYESIAN DECISION TREES

0.45

log p(Y |X) (test)

log p(Y |X) (test)

0.45 0.50 0.55 0.60 0.65

SMC optimal [node] SMC prior [node]

0.70

102

103

Mean Time (s)

0.55

SMC optimal [node] SMC prior [node]

0.60 0.65 102

103

Number of particles

104

0.80

0.75

Accuracy (test)

Accuracy (test)

0.50

0.70 101

104

0.80

0.70 0.65 0.60 0.55 102

103

Mean Time (s)

0.75 0.70 0.65 0.60 0.55

SMC optimal [node] SMC prior [node]

0.50

0.50 101

104

SMC optimal [node] SMC prior [node]

102

103

Number of particles

Figure 3. Results on madelon dataset: The top and bottom rows display log p(y|x) and accuracy on the test data against runtime (left) and the number of particles (right) respectively. The blue circles and red squares represent optimal and prior proposals respectively. 0.2 0.4 Review

0.96

of Lakshminarayanan 2013

st)

st)

11

0.94 Bayesian Decision Trees 0.92

104

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0 10

Accuracy (test)

log p(Y |X) (test)

display log p(y|x) and accuracy on the test data against runtime (left) and the number of particles (right) respectively. The blue circles and red squares represent optimal and prior proposals respectively.

101

102

103

Number of particles per island

2000

200

50 20 10 5

Number of islands

104

1

0.96 0.94 0.92 0.90 0.88 0.86 0.84 0.82 0.80 0.78 0 10

101

102

103

Number of particles per island

2000

200

50 20 10 5

Number of islands

104

1

Figure 4. Results on pen-digits: Test log p(y|x) (left) and accuracy (right) vs I and M/I for fixed M = 2000. use fewer than 100 particles per island and (ii) when M/I 100, the choices of I 2 [5, 100] outperform I = 1. Since the islands are independent, the computation across islands is ‘embarrassingly parallelizable’.

Review of Lakshminarayanan 2013

Bayesian Decision Trees

TOP-DOWN PARTICLE FILTERING FOR BAYESIAN DECISION TREES

0.94

0.25

log p(Y |X) (test)

0.35 0.40 0.45 SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)

0.50 0.55 0.60 10

2

10

3

Mean Time (s)

10

Accuracy (test)

0.92

0.30

0.90 0.88 0.86

SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)

0.84 0.82 102

4

103

Mean Time (s)

104

0.86

0.36 0.38 SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)

0.40 0.42 102

103

104

Mean Time (s)

Accuracy (test)

0.34

log p(Y |X) (test)

13

0.85 0.84 SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy)

0.83 0.82 102

103

104

Mean Time (s)

Figure 5. Results on pen-digits (top row), and magic-04 (bottom row). Left column plots test log p(y|x) vs runtime, while right column plots test accuracy vs runtime. The blue cirlces, red squares and black diamonds represent optimal, prior proposals and MCMC respectively. as the highly-optimized implementation of CART. For this reason, we plot CART Review Lakshminarayanan Decision Trees accuracy as of a horizontal bar. The2013 accuracy Bayesian and log predictive probability on test

Recommend Documents

Top-down particle filtering for Bayesian decision trees

Particle Gibbs for Bayesian Additive Regression Trees

Bayesian Inference: Particle Filtering - Semantic Scholar

Particle Filtering Approach to Bayesian Formant Tracking

Using Rejuvenation to improve Particle Filtering for Bayesian Word ...

Presented - Duke ECE