CS 188: Areficial Intelligence Decision Trees and Neural Nets Today ...

Comment

Report 1 Downloads 19 Views

Today

CS 188: Ar)ﬁcial Intelligence

Decision Trees and Neural Nets

§  Formalizing Learning §  Consistency §  Simplicity

§  Decision Trees §  Expressiveness §  Informa)on Gain §  OverﬁGng Pieter Abbeel, Dan Klein University of California, Berkeley

§  Neural Nets

Induc)ve Learning

Induc)ve Learning (Science) §  Simplest form: learn a func)on from examples §  A target func)on: g §  Examples: input-‐output pairs (x, g(x)) §  E.g. x is an email and g(x) is spam / ham §  E.g. x is a house and g(x) is its selling price §  Problem: §  Given a hypothesis space H §  Given a training set of examples xi §  Find a hypothesis h(x) such that h ~ g

§  Includes:

§  Classiﬁca)on (mul)nomial outputs) §  Regression (real outputs)

§  How do perceptron and naïve Bayes ﬁt in? (H, h, g, etc.)

Induc)ve Learning §  Curve ﬁGng (regression, func)on approxima)on):

Consistency vs. Simplicity §  Fundamental tradeoﬀ: bias vs. variance §  Usually algorithms prefer consistency by default (why?) §  Several ways to opera)onalize “simplicity” §  Reduce the hypothesis space §  Assume more: e.g. independence assump)ons, as in naïve Bayes §  Have fewer, be]er features / a]ributes: feature selec)on §  Other structural limita)ons (decision lists vs trees)

§  Regulariza)on

§  Consistency vs. simplicity §  Ockham’s razor

§  Smoothing: cau)ous use of small counts §  Many other generaliza)on parameters (pruning cutoﬀs today) §  Hypothesis space stays big, but harder to get to the outskirts

Decision Trees

Reminder: Features §  Features, aka a]ributes §  Some)mes: TYPE=French §  Some)mes: fTYPE=French(x) = 1

Decision Trees §  Compact representa)on of a func)on: §  Truth table §  Condi)onal probability table §  Regression values

Expressiveness of DTs §  Can express any func)on of the features

§  True func)on §  Realizable: in H

§  However, we hope for compact trees

Comparison: Perceptrons §  What is the expressiveness of a perceptron over these features?

Hypothesis Spaces §  How many dis)nct decision trees with n Boolean a]ributes? = number of Boolean func)ons over n a]ributes = number of dis)nct truth tables with 2n rows = 2^(2n) §  E.g., with 6 Boolean a]ributes, there are 18,446,744,073,709,551,616 trees

§  For a perceptron, a feature’s contribu)on is either posi)ve or nega)ve

§  If you want one feature’s eﬀect to depend on another, you have to add a new conjunc)on feature §  E.g. adding “PATRONS=full ∧ WAIT = 60” allows a perceptron to model the interac)on between the two atomic features

§  DTs automa)cally conjoin features / a]ributes

§  Features can have diﬀerent eﬀects in diﬀerent branches of the tree!

§  Diﬀerence between modeling rela)ve evidence weigh)ng (NB) and complex evidence interac)on (DTs) §  Though if the interac)ons are too complex, may not ﬁnd the DT greedily

§  How many trees of depth 1 (decision stumps)?

= number of Boolean func)ons over 1 a]ribute = number of truth tables with 2 rows, )mes n = 4n §  E.g. with 6 Boolean a]ributes, there are 24 decision stumps

§  More expressive hypothesis space:

§  Increases chance that target func)on can be expressed (good) §  Increases number of hypotheses consistent with training set (bad, why?) §  Means we can get be]er predic)ons (lower bias) §  But we may get worse predic)ons (higher variance)

Decision Tree Learning §  Aim: ﬁnd a small tree consistent with the training examples §  Idea: (recursively) choose “most signiﬁcant” a]ribute as root of (sub)tree

Choosing an A]ribute §  Idea: a good a]ribute splits the examples into subsets that are (ideally) “all posi)ve” or “all nega)ve”

§  So: we need a measure of how “good” a split is, even if the results aren’t perfectly separated out

Entropy and Informa)on §  Informa)on answers ques)ons §  The more uncertain about the answer ini)ally, the more informa)on in the answer §  Scale: bits §  §  §  § 

Entropy §  General answer: if prior is : §  Informa)on is the expected code length

1 bit

Answer to Boolean ques)on with prior ? Answer to 4-‐way ques)on with prior ? Answer to 4-‐way ques)on with prior ? Answer to 3-‐way ques)on with prior ?

§  A probability p is typical of: §  A uniform distribu)on of size 1/p §  A code of length log 1/p

§  Also called the entropy of the distribu)on §  §  §  § 

0 bits

More uniform = higher entropy More values = higher entropy More peaked = lower entropy Rare values almost “don’t count” 0.5 bit

Informa)on Gain §  Back to decision trees! §  For each split, compare entropy before and aper §  Diﬀerence is the informa)on gain §  Problem: there’s more than one distribu)on aper split!

Next Step: Recurse §  Now we need to keep growing the tree! §  Two branches are done (why?) §  What to do under “full”? §  See what examples are there…

§  Solu)on: use expected entropy, weighted by the number of examples

Example: Learned Tree

Example: Miles Per Gallon mpg

40 Examples

§  Decision tree learned from these 12 examples:

§  Substan)ally simpler than “true” tree

§  A more complex hypothesis isn't jus)ﬁed by data

§  Also: it’s reasonable, but wrong

good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad

cylinders displacement horsepower 4 6 4 8 6 4 4 8 : : :

Find the First Split

low medium medium high medium medium medium high : : : high medium high low medium low low high medium medium

weight

acceleration modelyear maker

low medium medium high medium low low high : : : high high high low medium low medium high low medium

high medium low low medium medium low low : : : low high low low high low high low medium medium

75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78

asia america europe america america asia asia america : : : america america america america america america america america europe europe

Result: Decision Stump

§  Look at informa)on gain for each a]ribute §  Note that each a]ribute is correlated with the target! §  What do we split on?

Second Level

8 8 8 4 6 4 4 8 4 5

low medium medium high medium low low high : : : high high high low medium medium low high low medium

Final Tree

MPG Training Error

Reminder: OverﬁGng §  OverﬁGng: §  When you stop modeling the pa]erns in the training data (which generalize) §  And start modeling the noise (which doesn’t)

§  We had this before: §  Naïve Bayes: needed to smooth §  Perceptron: early stopping

The test set error is much worse than the training set error…

…why?

Signiﬁcance of a Split §  Star)ng with:

§  Three cars with 4 cylinders, from Asia, with medium HP §  2 bad MPG §  1 good MPG

Consider this split

§  What do we expect from a three-‐way split? §  Maybe each example in its own subset? §  Maybe just what we saw in the last slide?

§  Probably shouldn’t split if the counts are so small they could be due to chance §  A chi-‐squared test can tell us how likely it is that devia)ons from a perfect split are due to chance* §  Each split will have a signiﬁcance value, pCHANCE

Keeping it General §  Pruning: §  Build the full decision tree §  Begin at the bo]om of the tree §  Delete splits in which pCHANCE > MaxPCHANCE §  Con)nue working upward un)l there are no more prunable nodes §  Note: some chance nodes may not get pruned because they were “redeemed” later

y = a XOR b a

b 0 0 1 1

y 0 1 0 1

Pruning example §  With MaxPCHANCE = 0.1:

0 1 1 0

Note the improved test set accuracy compared with the unpruned tree

Regulariza)on

Two Ways of Controlling OverﬁGng §  Limit the hypothesis space

§  MaxPCHANCE is a regulariza)on parameter §  Generally, set it using held-‐out data (as usual)

§  E.g. limit the max depth of trees §  Easier to analyze (coming up)

Accuracy

Training

§  Regularize the hypothesis selec)on

Held-out / Test

Decreasing

MaxPCHANCE

§  E.g. chance cutoﬀ §  Disprefer most of the hypotheses unless data is clear §  Usually done in prac)ce

Increasing

Small Trees High Bias

Large Trees High Variance

Neural Networks

Reminder: Perceptron §  Inputs are feature values §  Each feature has a weight §  Sum is the ac)va)on

§  If the ac)va)on is:

f1

§  Posi)ve, output +1 §  Nega)ve, output -‐1

Two-‐Layer Perceptron Network f1 f2 f3 f1 f2 f3

f1 f2 f3

w31

f1

Σ

>0?

w22 w32

Σ

>0?

w33

f1 f2 f3

w13 w23

f2 f3

w12

w1 w2 w3

f1

Σ

>0?

f2 f3

>0?

w2 w3

Σ

>0?

w11 w21 w31

Σ

>0?

Σ

>0?

w1

w12 w22 w32

w2

w3 f1

Σ

f3

w1

Two-‐Layer Perceptron Network

w11 w21

f2

f2 f3

w13 w23 w33

Σ

>0?

Σ

>0?

Two-‐Layer Perceptron Network

Two-‐Layer Perceptron Network

w11 w21 w31 f1

w11

Σ

w21

>0?

w31

w1 f1

w12 w22

f2

w32

Σ

>0?

w2

Σ

w3

f3

w23 w33

Σ

w22 w32

Σ

>0?

w1

w2

Σ

w3

f3

w13

>0?

w12

f2

>0?

Σ

w13 w23

>0?

w33

Learning w

Σ

>0?

Hill Climbing

§  Training examples

§  Simple, general idea: §  §  §  § 

§  Objec)ve:

Start wherever Repeat: move to the best neighboring state If no neighbors be]er than current, quit Neighbors = small perturba)ons of w

§  What’s bad about this approach? §  Complete? §  Op)mal?

§  Procedure:

§  What’s par)cularly tricky when hill-‐climbing for the mul)-‐layer perceptron?

§  Hill Climbing

Two-‐Layer Perceptron Network

Two-‐Layer Perceptron Network

w11 w21 w31 f1 f2

f3

w11

Σ

w21

>0?

w31

w1 f1

w12 w22 w32

Σ

>0?

w2

w3 w13 w23 w33

Σ

>0?

Σ

>0?

f2

f3

Σ

>0?

Σ

>0?

w1

w12 w22 w32

w2

w3 w13 w23 w33

Σ

>0?

Σ

>0?

Two-‐Layer Neural Network

§  Theorem (Universal Func)on Approximators). A two-‐layer neural network with a suﬃcient number of neurons can approximate any con)nuous func)on to any desired accuracy.

w11 w21 w31 f1

Σ

>0?

Σ

>0?

w1

w12 w22

f2

w32

w2

w3

f3

w13 w23 w33

Σ

>0?

Neural Networks Proper)es

Σ

§  Prac)cal considera)ons §  Can be seen as learning the features §  Large number of neurons §  Danger for overﬁGng

§  Hill-‐climbing procedure can get stuck in bad local op)ma

Summary §  Formaliza)on of learning §  Target func)on §  Hypothesis space §  Generaliza)on

§  Decision Trees §  §  §  § 

Can encode any func)on Top-‐down learning (not perfect!) Informa)on gain Bo]om-‐up pruning to prevent overﬁGng

§  Neural Networks §  Learn features §  Universal func)on approximators §  Diﬃcult to train

Next … §  Applica)ons!