Announcements: Midterm 2 § Monday 4/20, 6-‐9pm § Rooms: § § § §
2040 Valley LSB [Last names beginning with A-‐C] 2060 Valley LSB [Last names beginning with D-‐H] 145 Dwinelle [Last names beginning with I-‐L] 155 Dwinelle [Last names beginning with M-‐Z]
§ Topics § Lectures 12 (probability) through 21 (perceptrons) (inclusive) § + corresponding homework, projects, secUons
§ Midterm 2 prep page: § Past exams § Special Midterm 2 office hours § PracUce Midterm 2 (opUonal)
§ One point of EC on Midterm 2 for compleUng § Due: Saturday 4/18 at 11:59pm (through Gradescope)
Announcements: Next Week § MONDAY 4/20: Office hours as posted on edX and the exam itself
§ TUESDAY 4/21: No lecture, secUon, exam prep session, or office hours (we’ll be grading your exams)
§ WEDNESDAY 4/22: No office hours, secUon, or exam prep session
§ THURSDAY 4/23: Lecture and office hours resume
§ FRIDAY 4/24: Office hours
Announcements: Final Contest! (OpUonal/EC)
§ MONDAY 4/20: TentaUve Release Date!
§ TUESDAY 4/28 at 11:59pm: Final Submission Due
§ MONDAY 4/20 – TUESDAY 4/28: Leaderboard / Achievements / EC
§ THURSDAY 4/30 in Lecture: Announcement of winners
CS 188: ArUficial Intelligence
Decision Trees and Neural Nets
Instructor: Pieter Abbeel -‐-‐-‐ University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at hkp://ai.berkeley.edu.]
Today § Formalizing Learning § Consistency § Simplicity
§ Decision Trees § Expressiveness § InformaUon Gain § Overfimng
§ Neural Nets
InducUve Learning
InducUve Learning (Science) § Simplest form: learn a funcUon from examples § A target funcUon: g § Examples: input-‐output pairs (x, g(x)) § E.g. x is an email and g(x) is spam / ham § E.g. x is a house and g(x) is its selling price § Problem: § Given a hypothesis space H § Given a training set of examples xi § Find a hypothesis h(x) such that h ~ g
§ Includes:
§ ClassificaUon (mulUnomial outputs) § Regression (real outputs)
§ How do perceptron and naïve Bayes fit in? (H, h, g, etc.)
InducUve Learning § Curve fimng (regression, funcUon approximaUon):
§ Consistency vs. simplicity § Ockham’s razor
Consistency vs. Simplicity § Fundamental tradeoff: bias vs. variance § Usually algorithms prefer consistency by default (why?) § Several ways to operaUonalize “simplicity” § Reduce the hypothesis space § Assume more: e.g. independence assumpUons, as in naïve Bayes § Have fewer, beker features / akributes: feature selecUon § Other structural limitaUons (decision lists vs trees)
§ RegularizaUon § Smoothing: cauUous use of small counts § Many other generalizaUon parameters (pruning cutoffs today) § Hypothesis space stays big, but harder to get to the outskirts
Decision Trees
Reminder: Features § Features, aka akributes § SomeUmes: TYPE=French § SomeUmes: fTYPE=French(x) = 1
Reminder: Features § Features, aka akributes § SomeUmes: TYPE=French § SomeUmes: fTYPE=French(x) = 1
Decision Trees § Compact representaUon of a funcUon: § Truth table § CondiUonal probability table § Regression values
§ True funcUon § Realizable: in H
Expressiveness of DTs § Can express any funcUon of the features
§ However, we hope for compact trees
Comparison: Perceptrons § What is the expressiveness of a perceptron over these features?
§ For a perceptron, a feature’s contribuUon is either posiUve or negaUve
§ If you want one feature’s effect to depend on another, you have to add a new conjuncUon feature § E.g. adding “PATRONS=full ∧ WAIT = 60” allows a perceptron to model the interacUon between the two atomic features
§ DTs automaUcally conjoin features / akributes
§ Features can have different effects in different branches of the tree!
§ Difference between modeling relaUve evidence weighUng (NB) and complex evidence interacUon (DTs) § Though if the interacUons are too complex, may not find the DT greedily
Hypothesis Spaces § How many disUnct decision trees with n Boolean akributes? = number of Boolean funcUons over n akributes = number of disUnct truth tables with 2n rows = 2^(2n) § E.g., with 6 Boolean akributes, there are 18,446,744,073,709,551,616 trees
§ How many trees of depth 1 (decision stumps)?
= number of Boolean funcUons over 1 akribute = number of truth tables with 2 rows, Umes n = 4n § E.g. with 6 Boolean akributes, there are 24 decision stumps
§ More expressive hypothesis space:
§ Increases chance that target funcUon can be expressed (good) § Increases number of hypotheses consistent with training set (bad, why?) § Means we can get beker predicUons (lower bias) § But we may get worse predicUons (higher variance)
Decision Tree Learning § Aim: find a small tree consistent with the training examples § Idea: (recursively) choose “most significant” akribute as root of (sub)tree
Choosing an Akribute § Idea: a good akribute splits the examples into subsets that are (ideally) “all posiUve” or “all negaUve”
§ So: we need a measure of how “good” a split is, even if the results aren’t perfectly separated out
Entropy and InformaUon § InformaUon answers quesUons § The more uncertain about the answer iniUally, the more informaUon in the answer § Scale: bits § § § §
Answer to Boolean quesUon with prior ? Answer to 4-‐way quesUon with prior ? Answer to 4-‐way quesUon with prior ? Answer to 3-‐way quesUon with prior ?
§ A probability p is typical of: § A uniform distribuUon of size 1/p § A code of length log 1/p
Entropy § General answer: if prior is : § InformaUon is the expected code length
1 bit
§ Also called the entropy of the distribuUon § § § §
0 bits
More uniform = higher entropy More values = higher entropy More peaked = lower entropy Rare values almost “don’t count” 0.5 bit
InformaUon Gain § Back to decision trees! § For each split, compare entropy before and ayer § Difference is the informaUon gain § Problem: there’s more than one distribuUon ayer split!
§ SoluUon: use expected entropy, weighted by the number of examples
InformaUon Gain § Back to decision trees! § For each split, compare entropy before and ayer § Difference is the informaUon gain § Problem: there’s more than one distribuUon ayer split!
§ SoluUon: use expected entropy, weighted by the number of examples § Note: hidden problem here! Gain needs to be adjusted for large-‐domain splits – why?
Next Step: Recurse § Now we need to keep growing the tree! § Two branches are done (why?) § What to do under “full”? § See what examples are there…
Example: Learned Tree § Decision tree learned from these 12 examples:
§ SubstanUally simpler than “true” tree
§ A more complex hypothesis isn't jusUfied by data
§ Also: it’s reasonable, but wrong
Example: Miles Per Gallon
40 Examples
mpg good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad
cylinders displacement horsepower 4 6 4 8 6 4 4 8 : : : 8 8 8 4 6 4 4 8 4 5
low medium medium high medium low low high : : : high high high low medium medium low high low medium
low medium medium high medium medium medium high : : : high medium high low medium low low high medium medium
weight
acceleration modelyear maker
low medium medium high medium low low high : : : high high high low medium low medium high low medium
high medium low low medium medium low low : : : low high low low high low high low medium medium
75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78
asia america europe america america asia asia america : : : america america america america america america america america europe europe
Find the First Split § Look at informaUon gain for each akribute § Note that each akribute is correlated with the target! § What do we split on?
Result: Decision Stump
Second Level
Final Tree
Reminder: Overfimng § Overfimng: § When you stop modeling the pakerns in the training data (which generalize) § And start modeling the noise (which doesn’t)
§ We had this before: § Naïve Bayes: needed to smooth § Perceptron: early stopping
MPG Training Error
The test set error is much worse than the training set error…
…why?
Consider this split
Significance of a Split § StarUng with:
§ Three cars with 4 cylinders, from Asia, with medium HP § 2 bad MPG § 1 good MPG
§ What do we expect from a three-‐way split? § Maybe each example in its own subset? § Maybe just what we saw in the last slide?
§ Probably shouldn’t split if the counts are so small they could be due to chance § A chi-‐squared test can tell us how likely it is that deviaUons from a perfect split are due to chance* § Each split will have a significance value, pCHANCE
Keeping it General § Pruning: § Build the full decision tree § Begin at the bokom of the tree § Delete splits in which pCHANCE > MaxPCHANCE § ConUnue working upward unUl there are no more prunable nodes § Note: some chance nodes may not get pruned because they were “redeemed” later
y = a XOR b a
b 0 0 1 1
y 0 1 0 1
0 1 1 0
Pruning example § With MaxPCHANCE = 0.1:
Note the improved test set accuracy compared with the unpruned tree
RegularizaUon § MaxPCHANCE is a regularizaUon parameter § Generally, set it using held-‐out data (as usual)
Accuracy
Training Held-out / Test
Decreasing Small Trees High Bias
MaxPCHANCE
Increasing Large Trees High Variance
Two Ways of Controlling Overfimng § Limit the hypothesis space § E.g. limit the max depth of trees § Easier to analyze
§ Regularize the hypothesis selecUon § E.g. chance cutoff § Disprefer most of the hypotheses unless data is clear § Usually done in pracUce
Learning Curves § Another important trend: § More data is beker! § The same learner will generally do beker with more data § (Except for cases where the target is absurdly simple)
Neural Networks
Reminder: Perceptron § Inputs are feature values § Each feature has a weight § Sum is the acUvaUon
§ If the acUvaUon is: § PosiUve, output +1 § NegaUve, output -‐1
f1 f2 f3
w1 w2 w3
Σ
>0?
Two-‐Layer Perceptron Network f1 f2 f3 f1 f2 f3
f1 f2 f3
w11 w21 w31
Σ
>0?
w12 w22 w32
Σ
>0?
w33
Σ
>0?
f2 f3
w13 w23
f1
w1 w2 w3
Σ
>0?
Two-‐Layer Perceptron Network f1 f2 f3 f1 f2 f3
w11 w21 w31
Σ
>0? w1
w12 w22 w32
Σ
>0?
w2
w3 f1 f2 f3
w13 w23 w33
Σ
>0?
Σ
>0?
Two-‐Layer Perceptron Network w11 w21 w31 f1 f2
f3
Σ
>0? w1
w12 w22 w32
Σ
>0?
w2
w3 w13 w23 w33
Σ
>0?
Σ
>0?
Two-‐Layer Perceptron Network w11 w21 w31 f1 f2
f3
Σ
>0? w1
w12 w22 w32
Σ
>0?
w2
w3 w13 w23 w33
Σ
>0?
Σ
>0?
Learning w § Training examples
§ ObjecUve:
§ Procedure: § Hill Climbing
Hill Climbing § Simple, general idea: § § § §
Start wherever Repeat: move to the best neighboring state If no neighbors beker than current, quit Neighbors = small perturbaUons of w
§ What’s bad about this approach? § Complete? § OpUmal?
§ What’s parUcularly tricky when hill-‐climbing for the mulU-‐layer perceptron?
Two-‐Layer Perceptron Network w11 w21 w31 f1 f2
f3
Σ
>0? w1
w12 w22 w32
Σ
>0?
w2
w3 w13 w23 w33
Σ
>0?
Σ
>0?
Two-‐Layer Perceptron Network w11 w21 w31 f1 f2
f3
Σ
>0? w1
w12 w22 w32
Σ
>0?
w2
w3 w13 w23 w33
Σ
>0?
Σ
Two-‐Layer Neural Network w11 w21 w31 f1 f2
f3
Σ
>0? w1
w12 w22 w32
Σ
>0?
w2
w3 w13 w23 w33
Σ
>0?
Σ
Neural Networks ProperUes § Theorem (Universal FuncUon Approximators). A two-‐layer neural network with a sufficient number of neurons can approximate any conUnuous funcUon to any desired accuracy. § PracUcal consideraUons § Can be seen as learning the features § Large number of neurons § Danger for overfimng
§ Hill-‐climbing procedure can get stuck in bad local opUma
Summary § FormalizaUon of learning § Target funcUon § Hypothesis space § GeneralizaUon
§ Decision Trees § § § §
Can encode any funcUon Top-‐down learning (not perfect!) InformaUon gain Bokom-‐up pruning to prevent overfimng
§ Neural Networks § Learn features § Universal funcUon approximators § Difficult to train
Next: Advanced ApplicaUons!