Announcements: Midterm 2 AWS

Comment

Report 3 Downloads 111 Views

Announcements: Midterm 2 §  Monday 4/20, 6-‐9pm §  Rooms: §  §  §  § 

2040 Valley LSB [Last names beginning with A-‐C] 2060 Valley LSB [Last names beginning with D-‐H] 145 Dwinelle [Last names beginning with I-‐L] 155 Dwinelle [Last names beginning with M-‐Z]

§  Topics §  Lectures 12 (probability) through 21 (perceptrons) (inclusive) §  + corresponding homework, projects, secUons

§  Midterm 2 prep page: §  Past exams §  Special Midterm 2 oﬃce hours §  PracUce Midterm 2 (opUonal)

§  One point of EC on Midterm 2 for compleUng §  Due: Saturday 4/18 at 11:59pm (through Gradescope)

Announcements: Next Week §  MONDAY 4/20: Oﬃce hours as posted on edX and the exam itself

§  TUESDAY 4/21: No lecture, secUon, exam prep session, or oﬃce hours (we’ll be grading your exams)

§  WEDNESDAY 4/22: No oﬃce hours, secUon, or exam prep session

§  THURSDAY 4/23: Lecture and oﬃce hours resume

§  FRIDAY 4/24: Oﬃce hours

Announcements: Final Contest! (OpUonal/EC)

§  MONDAY 4/20: TentaUve Release Date!

§  TUESDAY 4/28 at 11:59pm: Final Submission Due

§  MONDAY 4/20 – TUESDAY 4/28: Leaderboard / Achievements / EC

§  THURSDAY 4/30 in Lecture: Announcement of winners

CS 188: ArUﬁcial Intelligence

Decision Trees and Neural Nets

Instructor: Pieter Abbeel -‐-‐-‐ University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at hkp://ai.berkeley.edu.]

Today §  Formalizing Learning §  Consistency §  Simplicity

§  Decision Trees §  Expressiveness §  InformaUon Gain §  Overﬁmng

§  Neural Nets

InducUve Learning

InducUve Learning (Science) §  Simplest form: learn a funcUon from examples §  A target funcUon: g §  Examples: input-‐output pairs (x, g(x)) §  E.g. x is an email and g(x) is spam / ham §  E.g. x is a house and g(x) is its selling price §  Problem: §  Given a hypothesis space H §  Given a training set of examples xi §  Find a hypothesis h(x) such that h ~ g

§  Includes:

§  ClassiﬁcaUon (mulUnomial outputs) §  Regression (real outputs)

§  How do perceptron and naïve Bayes ﬁt in? (H, h, g, etc.)

InducUve Learning §  Curve ﬁmng (regression, funcUon approximaUon):

§  Consistency vs. simplicity §  Ockham’s razor

Consistency vs. Simplicity §  Fundamental tradeoﬀ: bias vs. variance §  Usually algorithms prefer consistency by default (why?) §  Several ways to operaUonalize “simplicity” §  Reduce the hypothesis space §  Assume more: e.g. independence assumpUons, as in naïve Bayes §  Have fewer, beker features / akributes: feature selecUon §  Other structural limitaUons (decision lists vs trees)

§  RegularizaUon §  Smoothing: cauUous use of small counts §  Many other generalizaUon parameters (pruning cutoﬀs today) §  Hypothesis space stays big, but harder to get to the outskirts

Decision Trees

Reminder: Features §  Features, aka akributes §  SomeUmes: TYPE=French §  SomeUmes: fTYPE=French(x) = 1

Reminder: Features §  Features, aka akributes §  SomeUmes: TYPE=French §  SomeUmes: fTYPE=French(x) = 1

Decision Trees §  Compact representaUon of a funcUon: §  Truth table §  CondiUonal probability table §  Regression values

§  True funcUon §  Realizable: in H

Expressiveness of DTs §  Can express any funcUon of the features

§  However, we hope for compact trees

Comparison: Perceptrons §  What is the expressiveness of a perceptron over these features?

§  For a perceptron, a feature’s contribuUon is either posiUve or negaUve

§  If you want one feature’s eﬀect to depend on another, you have to add a new conjuncUon feature §  E.g. adding “PATRONS=full ∧ WAIT = 60” allows a perceptron to model the interacUon between the two atomic features

§  DTs automaUcally conjoin features / akributes

§  Features can have diﬀerent eﬀects in diﬀerent branches of the tree!

§  Diﬀerence between modeling relaUve evidence weighUng (NB) and complex evidence interacUon (DTs) §  Though if the interacUons are too complex, may not ﬁnd the DT greedily

Hypothesis Spaces §  How many disUnct decision trees with n Boolean akributes? = number of Boolean funcUons over n akributes = number of disUnct truth tables with 2n rows = 2^(2n) §  E.g., with 6 Boolean akributes, there are 18,446,744,073,709,551,616 trees

§  How many trees of depth 1 (decision stumps)?

= number of Boolean funcUons over 1 akribute = number of truth tables with 2 rows, Umes n = 4n §  E.g. with 6 Boolean akributes, there are 24 decision stumps

§  More expressive hypothesis space:

§  Increases chance that target funcUon can be expressed (good) §  Increases number of hypotheses consistent with training set (bad, why?) §  Means we can get beker predicUons (lower bias) §  But we may get worse predicUons (higher variance)

Decision Tree Learning §  Aim: ﬁnd a small tree consistent with the training examples §  Idea: (recursively) choose “most signiﬁcant” akribute as root of (sub)tree

Choosing an Akribute §  Idea: a good akribute splits the examples into subsets that are (ideally) “all posiUve” or “all negaUve”

§  So: we need a measure of how “good” a split is, even if the results aren’t perfectly separated out

Entropy and InformaUon §  InformaUon answers quesUons §  The more uncertain about the answer iniUally, the more informaUon in the answer §  Scale: bits §  §  §  § 

Answer to Boolean quesUon with prior ? Answer to 4-‐way quesUon with prior ? Answer to 4-‐way quesUon with prior ? Answer to 3-‐way quesUon with prior ?

§  A probability p is typical of: §  A uniform distribuUon of size 1/p §  A code of length log 1/p

Entropy §  General answer: if prior is : §  InformaUon is the expected code length

1 bit

§  Also called the entropy of the distribuUon §  §  §  § 

0 bits

More uniform = higher entropy More values = higher entropy More peaked = lower entropy Rare values almost “don’t count” 0.5 bit

InformaUon Gain §  Back to decision trees! §  For each split, compare entropy before and ayer §  Diﬀerence is the informaUon gain §  Problem: there’s more than one distribuUon ayer split!

§  SoluUon: use expected entropy, weighted by the number of examples

InformaUon Gain §  Back to decision trees! §  For each split, compare entropy before and ayer §  Diﬀerence is the informaUon gain §  Problem: there’s more than one distribuUon ayer split!

§  SoluUon: use expected entropy, weighted by the number of examples §  Note: hidden problem here! Gain needs to be adjusted for large-‐domain splits – why?

Next Step: Recurse §  Now we need to keep growing the tree! §  Two branches are done (why?) §  What to do under “full”? §  See what examples are there…

Example: Learned Tree §  Decision tree learned from these 12 examples:

§  SubstanUally simpler than “true” tree

§  A more complex hypothesis isn't jusUﬁed by data

§  Also: it’s reasonable, but wrong

Example: Miles Per Gallon

40 Examples

mpg good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad

cylinders displacement horsepower 4 6 4 8 6 4 4 8 : : : 8 8 8 4 6 4 4 8 4 5

low medium medium high medium low low high : : : high high high low medium medium low high low medium

low medium medium high medium medium medium high : : : high medium high low medium low low high medium medium

weight

acceleration modelyear maker

low medium medium high medium low low high : : : high high high low medium low medium high low medium

high medium low low medium medium low low : : : low high low low high low high low medium medium

75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78

asia america europe america america asia asia america : : : america america america america america america america america europe europe

Find the First Split §  Look at informaUon gain for each akribute §  Note that each akribute is correlated with the target! §  What do we split on?

Result: Decision Stump

Second Level

Final Tree

Reminder: Overﬁmng §  Overﬁmng: §  When you stop modeling the pakerns in the training data (which generalize) §  And start modeling the noise (which doesn’t)

§  We had this before: §  Naïve Bayes: needed to smooth §  Perceptron: early stopping

MPG Training Error

The test set error is much worse than the training set error…

…why?

Consider this split

Signiﬁcance of a Split §  StarUng with:

§  Three cars with 4 cylinders, from Asia, with medium HP §  2 bad MPG §  1 good MPG

§  What do we expect from a three-‐way split? §  Maybe each example in its own subset? §  Maybe just what we saw in the last slide?

§  Probably shouldn’t split if the counts are so small they could be due to chance §  A chi-‐squared test can tell us how likely it is that deviaUons from a perfect split are due to chance* §  Each split will have a signiﬁcance value, pCHANCE

Keeping it General §  Pruning: §  Build the full decision tree §  Begin at the bokom of the tree §  Delete splits in which pCHANCE > MaxPCHANCE §  ConUnue working upward unUl there are no more prunable nodes §  Note: some chance nodes may not get pruned because they were “redeemed” later

y = a XOR b a

b 0 0 1 1

y 0 1 0 1

0 1 1 0

Pruning example §  With MaxPCHANCE = 0.1:

Note the improved test set accuracy compared with the unpruned tree

RegularizaUon §  MaxPCHANCE is a regularizaUon parameter §  Generally, set it using held-‐out data (as usual)

Accuracy

Training Held-out / Test

Decreasing Small Trees High Bias

MaxPCHANCE

Increasing Large Trees High Variance

Two Ways of Controlling Overﬁmng §  Limit the hypothesis space §  E.g. limit the max depth of trees §  Easier to analyze

§  Regularize the hypothesis selecUon §  E.g. chance cutoﬀ §  Disprefer most of the hypotheses unless data is clear §  Usually done in pracUce

Learning Curves §  Another important trend: §  More data is beker! §  The same learner will generally do beker with more data §  (Except for cases where the target is absurdly simple)

Neural Networks

Reminder: Perceptron §  Inputs are feature values §  Each feature has a weight §  Sum is the acUvaUon

§  If the acUvaUon is: §  PosiUve, output +1 §  NegaUve, output -‐1

f1 f2 f3

w1 w2 w3

Σ

>0?

Two-‐Layer Perceptron Network f1 f2 f3 f1 f2 f3

f1 f2 f3

w11 w21 w31

Σ

>0?

w12 w22 w32

Σ

>0?

w33

Σ

>0?

f2 f3

w13 w23

f1

w1 w2 w3

Σ

>0?

Two-‐Layer Perceptron Network f1 f2 f3 f1 f2 f3

w11 w21 w31

Σ

>0? w1

w12 w22 w32

Σ

>0?

w2

w3 f1 f2 f3

w13 w23 w33

Σ

>0?

Σ

>0?

Two-‐Layer Perceptron Network w11 w21 w31 f1 f2

f3

Σ

>0? w1

w12 w22 w32

Σ

>0?

w2

w3 w13 w23 w33

Σ

>0?

Σ

>0?

Two-‐Layer Perceptron Network w11 w21 w31 f1 f2

f3

Σ

>0? w1

w12 w22 w32

Σ

>0?

w2

w3 w13 w23 w33

Σ

>0?

Σ

>0?

Learning w §  Training examples

§  ObjecUve:

§  Procedure: §  Hill Climbing

Hill Climbing §  Simple, general idea: §  §  §  § 

Start wherever Repeat: move to the best neighboring state If no neighbors beker than current, quit Neighbors = small perturbaUons of w

§  What’s bad about this approach? §  Complete? §  OpUmal?

§  What’s parUcularly tricky when hill-‐climbing for the mulU-‐layer perceptron?

Two-‐Layer Perceptron Network w11 w21 w31 f1 f2

f3

Σ

>0? w1

w12 w22 w32

Σ

>0?

w2

w3 w13 w23 w33

Σ

>0?

Σ

>0?

Two-‐Layer Perceptron Network w11 w21 w31 f1 f2

f3

Σ

>0? w1

w12 w22 w32

Σ

>0?

w2

w3 w13 w23 w33

Σ

>0?

Σ

Two-‐Layer Neural Network w11 w21 w31 f1 f2

f3

Σ

>0? w1

w12 w22 w32

Σ

>0?

w2

w3 w13 w23 w33

Σ

>0?

Σ

Neural Networks ProperUes §  Theorem (Universal FuncUon Approximators). A two-‐layer neural network with a suﬃcient number of neurons can approximate any conUnuous funcUon to any desired accuracy. §  PracUcal consideraUons §  Can be seen as learning the features §  Large number of neurons §  Danger for overﬁmng

§  Hill-‐climbing procedure can get stuck in bad local opUma

Summary §  FormalizaUon of learning §  Target funcUon §  Hypothesis space §  GeneralizaUon

§  Decision Trees §  §  §  § 

Can encode any funcUon Top-‐down learning (not perfect!) InformaUon gain Bokom-‐up pruning to prevent overﬁmng

§  Neural Networks §  Learn features §  Universal funcUon approximators §  Diﬃcult to train

Next: Advanced ApplicaUons!

Recommend Documents

Announcements: Midterm 2 Announcements: Next Week ...

Midterm 2

midterm #2

Honors Midterm Review Part 2.pdf AWS

Soc 2700 - Midterm 2