CS 188: Ar)ficial Intelligence
Decision Trees and Neural Nets
Pieter Abbeel, Dan Klein University of California, Berkeley
Today § Formalizing Learning § Consistency § Simplicity
§ Decision Trees § Expressiveness § Informa)on Gain § OverfiGng
§ Neural Nets
Induc)ve Learning
Induc)ve Learning (Science) § Simplest form: learn a func)on from examples § A target func)on: g § Examples: input-‐output pairs (x, g(x)) § E.g. x is an email and g(x) is spam / ham § E.g. x is a house and g(x) is its selling price § Problem: § Given a hypothesis space H § Given a training set of examples xi § Find a hypothesis h(x) such that h ~ g
§ Includes:
§ Classifica)on (mul)nomial outputs) § Regression (real outputs)
§ How do perceptron and naïve Bayes fit in? (H, h, g, etc.)
Induc)ve Learning § Curve fiGng (regression, func)on approxima)on):
§ Consistency vs. simplicity § Ockham’s razor
Consistency vs. Simplicity § Fundamental tradeoff: bias vs. variance § Usually algorithms prefer consistency by default (why?) § Several ways to opera)onalize “simplicity” § Reduce the hypothesis space § Assume more: e.g. independence assump)ons, as in naïve Bayes § Have fewer, be]er features / a]ributes: feature selec)on § Other structural limita)ons (decision lists vs trees)
§ Regulariza)on § Smoothing: cau)ous use of small counts § Many other generaliza)on parameters (pruning cutoffs today) § Hypothesis space stays big, but harder to get to the outskirts
Decision Trees
Reminder: Features § Features, aka a]ributes § Some)mes: TYPE=French § Some)mes: fTYPE=French(x) = 1
Decision Trees § Compact representa)on of a func)on: § Truth table § Condi)onal probability table § Regression values
§ True func)on § Realizable: in H
Expressiveness of DTs § Can express any func)on of the features
§ However, we hope for compact trees
Comparison: Perceptrons § What is the expressiveness of a perceptron over these features?
§ For a perceptron, a feature’s contribu)on is either posi)ve or nega)ve
§ If you want one feature’s effect to depend on another, you have to add a new conjunc)on feature § E.g. adding “PATRONS=full ∧ WAIT = 60” allows a perceptron to model the interac)on between the two atomic features
§ DTs automa)cally conjoin features / a]ributes
§ Features can have different effects in different branches of the tree!
§ Difference between modeling rela)ve evidence weigh)ng (NB) and complex evidence interac)on (DTs) § Though if the interac)ons are too complex, may not find the DT greedily
Hypothesis Spaces § How many dis)nct decision trees with n Boolean a]ributes? = number of Boolean func)ons over n a]ributes = number of dis)nct truth tables with 2n rows = 2^(2n) § E.g., with 6 Boolean a]ributes, there are 18,446,744,073,709,551,616 trees
§ How many trees of depth 1 (decision stumps)?
= number of Boolean func)ons over 1 a]ribute = number of truth tables with 2 rows, )mes n = 4n § E.g. with 6 Boolean a]ributes, there are 24 decision stumps
§ More expressive hypothesis space:
§ Increases chance that target func)on can be expressed (good) § Increases number of hypotheses consistent with training set (bad, why?) § Means we can get be]er predic)ons (lower bias) § But we may get worse predic)ons (higher variance)
Decision Tree Learning § Aim: find a small tree consistent with the training examples § Idea: (recursively) choose “most significant” a]ribute as root of (sub)tree
Choosing an A]ribute § Idea: a good a]ribute splits the examples into subsets that are (ideally) “all posi)ve” or “all nega)ve”
§ So: we need a measure of how “good” a split is, even if the results aren’t perfectly separated out
Entropy and Informa)on § Informa)on answers ques)ons § The more uncertain about the answer ini)ally, the more informa)on in the answer § Scale: bits § § § §
Answer to Boolean ques)on with prior ? Answer to 4-‐way ques)on with prior ? Answer to 4-‐way ques)on with prior ? Answer to 3-‐way ques)on with prior ?
§ A probability p is typical of: § A uniform distribu)on of size 1/p § A code of length log 1/p
Entropy § General answer: if prior is : § Informa)on is the expected code length
1 bit
§ Also called the entropy of the distribu)on § § § §
0 bits
More uniform = higher entropy More values = higher entropy More peaked = lower entropy Rare values almost “don’t count” 0.5 bit
Informa)on Gain § Back to decision trees! § For each split, compare entropy before and aper § Difference is the informa)on gain § Problem: there’s more than one distribu)on aper split!
§ Solu)on: use expected entropy, weighted by the number of examples
Next Step: Recurse § Now we need to keep growing the tree! § Two branches are done (why?) § What to do under “full”? § See what examples are there…
Example: Learned Tree § Decision tree learned from these 12 examples:
§ Substan)ally simpler than “true” tree
§ A more complex hypothesis isn't jus)fied by data
§ Also: it’s reasonable, but wrong
Example: Miles Per Gallon
40 Examples
mpg good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad
cylinders displacement horsepower 4 6 4 8 6 4 4 8 : : : 8 8 8 4 6 4 4 8 4 5
low medium medium high medium low low high : : : high high high low medium medium low high low medium
low medium medium high medium medium medium high : : : high medium high low medium low low high medium medium
weight
acceleration modelyear maker
low medium medium high medium low low high : : : high high high low medium low medium high low medium
high medium low low medium medium low low : : : low high low low high low high low medium medium
75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78
asia america europe america america asia asia america : : : america america america america america america america america europe europe
Find the First Split § Look at informa)on gain for each a]ribute § Note that each a]ribute is correlated with the target! § What do we split on?
Result: Decision Stump
Second Level
Final Tree
Reminder: OverfiGng § OverfiGng: § When you stop modeling the pa]erns in the training data (which generalize) § And start modeling the noise (which doesn’t)
§ We had this before: § Naïve Bayes: needed to smooth § Perceptron: early stopping
MPG Training Error
The test set error is much worse than the training set error…
…why?
Consider this split
Significance of a Split § Star)ng with:
§ Three cars with 4 cylinders, from Asia, with medium HP § 2 bad MPG § 1 good MPG
§ What do we expect from a three-‐way split? § Maybe each example in its own subset? § Maybe just what we saw in the last slide?
§ Probably shouldn’t split if the counts are so small they could be due to chance § A chi-‐squared test can tell us how likely it is that devia)ons from a perfect split are due to chance* § Each split will have a significance value, pCHANCE
Keeping it General y = a XOR b
§ Pruning: § Build the full decision tree § Begin at the bo]om of the tree § Delete splits in which pCHANCE > MaxPCHANCE § Con)nue working upward un)l there are no more prunable nodes § Note: some chance nodes may not get pruned because they were “redeemed” later
a
b 0 0 1 1
y 0 1 0 1
0 1 1 0
Pruning example § With MaxPCHANCE = 0.1:
Note the improved test set accuracy compared with the unpruned tree
Regulariza)on § MaxPCHANCE is a regulariza)on parameter § Generally, set it using held-‐out data (as usual)
Accuracy
Training Held-out / Test
Decreasing
MaxPCHANCE
Increasing
Small Trees High Bias
Large Trees High Variance
Two Ways of Controlling OverfiGng § Limit the hypothesis space § E.g. limit the max depth of trees § Easier to analyze (coming up)
§ Regularize the hypothesis selec)on § E.g. chance cutoff § Disprefer most of the hypotheses unless data is clear § Usually done in prac)ce
Neural Networks
Reminder: Perceptron § Inputs are feature values § Each feature has a weight § Sum is the ac)va)on
§ If the ac)va)on is: § Posi)ve, output +1 § Nega)ve, output -‐1
f1 f2 f3
w1 w2 w3
Σ
>0?
Two-‐Layer Perceptron Network f1 f2 f3 f1 f2 f3
f1 f2 f3
w11 w21 w31
Σ
>0?
w12 w22 w32
Σ
>0?
Σ
>0?
f1 f2 f3
w1 w2 w3
Σ
>0?
w13 w23 w33
Two-‐Layer Perceptron Network f1 f2 f3 f1 f2 f3
w11 w21 w31
Σ
>0?
Σ
>0?
w1
w12 w22 w32
w2
w3 f1 f2 f3
w13 w23 w33
Σ
>0?
Σ
>0?
Two-‐Layer Perceptron Network w11 w21 w31 f1 f2
f3
Σ
>0?
Σ
>0?
w1
w12 w22 w32
w2
Σ
>0?
w3 w13 w23 w33
Σ
>0?
Two-‐Layer Perceptron Network w11 w21 w31 f1 f2
f3
Σ
>0?
Σ
>0?
w1
w12 w22 w32
w2
w3 w13 w23 w33
Σ
>0?
Σ
>0?
Learning w § Training examples
§ Objec)ve:
§ Procedure: § Hill Climbing
Hill Climbing § Simple, general idea: § § § §
Start wherever Repeat: move to the best neighboring state If no neighbors be]er than current, quit Neighbors = small perturba)ons of w
§ What’s bad about this approach? § Complete? § Op)mal?
§ What’s par)cularly tricky when hill-‐climbing for the mul)-‐layer perceptron?
Two-‐Layer Perceptron Network w11 w21 w31 f1 f2
f3
Σ
>0?
Σ
>0?
w1
w12 w22 w32
w2
Σ
w3 w13 w23 w33
Σ
>0?
Two-‐Layer Perceptron Network w11 w21 w31 f1 f2
f3
Σ
>0?
Σ
>0?
w1
w12 w22 w32
w2
w3 w13 w23 w33
Σ
>0?
Σ
>0?
Two-‐Layer Neural Network w11 w21 w31 f1 f2
f3
Σ
>0?
Σ
>0?
w1
w12 w22 w32
w2
Σ
w3 w13 w23 w33
Σ
>0?
Neural Networks Proper)es § Theorem (Universal Func)on Approximators). A two-‐layer neural network with a sufficient number of neurons can approximate any con)nuous func)on to any desired accuracy. § Prac)cal considera)ons § Can be seen as learning the features § Large number of neurons § Danger for overfiGng
§ Hill-‐climbing procedure can get stuck in bad local op)ma
Summary § Formaliza)on of learning § Target func)on § Hypothesis space § Generaliza)on
§ Decision Trees § § § §
Can encode any func)on Top-‐down learning (not perfect!) Informa)on gain Bo]om-‐up pruning to prevent overfiGng
§ Neural Networks § Learn features § Universal func)on approximators § Difficult to train
Next … § Applica)ons!