Announcements: Midterm 2 AWS

Report 3 Downloads 111 Views
Announcements:  Midterm  2   §  Monday  4/20,  6-­‐9pm   §  Rooms:   §  §  §  § 

2040  Valley  LSB  [Last  names  beginning  with  A-­‐C]   2060  Valley  LSB  [Last  names  beginning  with  D-­‐H]   145  Dwinelle  [Last  names  beginning  with  I-­‐L]   155  Dwinelle  [Last  names  beginning  with  M-­‐Z]  

§  Topics   §  Lectures  12  (probability)  through  21  (perceptrons)   (inclusive)   §  +  corresponding  homework,  projects,  secUons  

§  Midterm  2  prep  page:   §  Past  exams   §  Special  Midterm  2  office  hours   §  PracUce  Midterm  2  (opUonal)  

§  One  point  of  EC  on  Midterm  2  for  compleUng   §  Due:  Saturday  4/18  at  11:59pm  (through  Gradescope)  

Announcements:  Next  Week   §  MONDAY  4/20:  Office  hours  as  posted  on  edX  and  the  exam  itself    

§  TUESDAY  4/21:  No  lecture,  secUon,  exam  prep  session,  or  office   hours  (we’ll  be  grading  your  exams)    

§  WEDNESDAY  4/22:  No  office  hours,  secUon,  or  exam  prep  session    

§  THURSDAY  4/23:  Lecture  and  office  hours  resume    

§  FRIDAY  4/24:  Office  hours  

Announcements:  Final  Contest!  (OpUonal/EC)  

§  MONDAY  4/20:  TentaUve  Release   Date!  

§  TUESDAY  4/28  at  11:59pm:  Final   Submission  Due  

§  MONDAY  4/20  –  TUESDAY  4/28:   Leaderboard  /  Achievements  /  EC  

§  THURSDAY  4/30  in  Lecture:   Announcement  of  winners  

CS  188:  ArUficial  Intelligence    

Decision  Trees  and  Neural  Nets  

Instructor:  Pieter  Abbeel  -­‐-­‐-­‐  University  of  California,  Berkeley   [These  slides  were  created  by  Dan  Klein  and  Pieter  Abbeel  for  CS188  Intro  to  AI  at  UC  Berkeley.    All  CS188  materials  are  available  at  hkp://ai.berkeley.edu.]  

Today   §  Formalizing  Learning   §  Consistency   §  Simplicity  

§  Decision  Trees   §  Expressiveness   §  InformaUon  Gain   §  Overfimng  

§  Neural  Nets  

InducUve  Learning  

InducUve  Learning  (Science)   §  Simplest  form:  learn  a  funcUon  from  examples   §  A  target  funcUon:  g   §  Examples:  input-­‐output  pairs  (x, g(x)) §  E.g.  x  is  an  email  and  g(x) is  spam  /  ham   §  E.g.  x  is  a  house  and  g(x) is  its  selling  price   §  Problem:   §  Given  a  hypothesis  space  H §  Given  a  training  set  of  examples  xi §  Find  a  hypothesis  h(x)  such  that  h ~ g

§  Includes:  

§  ClassificaUon  (mulUnomial  outputs)   §  Regression  (real  outputs)  

§  How  do  perceptron  and  naïve  Bayes  fit  in?    (H, h, g,  etc.)  

InducUve  Learning   §  Curve  fimng  (regression,  funcUon  approximaUon):  

§  Consistency  vs.  simplicity   §  Ockham’s  razor  

Consistency  vs.  Simplicity   §  Fundamental  tradeoff:  bias  vs.  variance   §  Usually  algorithms  prefer  consistency  by  default  (why?)   §  Several  ways  to  operaUonalize  “simplicity”   §  Reduce  the  hypothesis  space   §  Assume  more:  e.g.  independence  assumpUons,  as  in  naïve  Bayes   §  Have  fewer,  beker  features  /  akributes:  feature  selecUon   §  Other  structural  limitaUons  (decision  lists  vs  trees)  

§  RegularizaUon   §  Smoothing:  cauUous  use  of  small  counts   §  Many  other  generalizaUon  parameters  (pruning  cutoffs  today)   §  Hypothesis  space  stays  big,  but  harder  to  get  to  the  outskirts  

Decision  Trees  

Reminder:  Features   §  Features,  aka  akributes   §  SomeUmes:  TYPE=French   §  SomeUmes:  fTYPE=French(x)  =  1  

Reminder:  Features   §  Features,  aka  akributes   §  SomeUmes:  TYPE=French   §  SomeUmes:  fTYPE=French(x)  =  1  

Decision  Trees   §  Compact  representaUon  of  a  funcUon:   §  Truth  table   §  CondiUonal  probability  table   §  Regression  values  

§  True  funcUon   §  Realizable:  in  H

Expressiveness  of  DTs   §  Can  express  any  funcUon  of  the  features  

§  However,  we  hope  for  compact  trees  

Comparison:  Perceptrons   §  What  is  the  expressiveness  of  a  perceptron  over  these  features?  

§  For  a  perceptron,  a  feature’s  contribuUon  is  either  posiUve  or  negaUve  

§  If  you  want  one  feature’s  effect  to  depend  on  another,  you  have  to  add  a  new  conjuncUon  feature   §  E.g.  adding  “PATRONS=full  ∧  WAIT  =  60”  allows  a  perceptron  to  model  the  interacUon  between  the  two  atomic   features  

§  DTs  automaUcally  conjoin  features  /  akributes  

§  Features  can  have  different  effects  in  different  branches  of  the  tree!  

§  Difference  between  modeling  relaUve  evidence  weighUng  (NB)  and  complex  evidence  interacUon  (DTs)   §  Though  if  the  interacUons  are  too  complex,  may  not  find  the  DT  greedily  

Hypothesis  Spaces   §  How  many  disUnct  decision  trees  with  n  Boolean  akributes?   =  number  of  Boolean  funcUons  over  n  akributes   =  number  of  disUnct  truth  tables  with  2n  rows   =  2^(2n)   §  E.g.,  with  6  Boolean  akributes,  there  are    18,446,744,073,709,551,616  trees  

§  How  many  trees  of  depth  1  (decision  stumps)?  

=  number  of  Boolean  funcUons  over  1  akribute   =  number  of  truth  tables  with  2  rows,  Umes  n   =  4n   §  E.g.  with  6  Boolean  akributes,  there  are  24  decision  stumps  

§  More  expressive  hypothesis  space:  

§  Increases  chance  that  target  funcUon  can  be  expressed  (good)   §  Increases  number  of  hypotheses  consistent  with  training  set   (bad,  why?)   §  Means  we  can  get  beker  predicUons  (lower  bias)   §  But  we  may  get  worse  predicUons  (higher  variance)  

Decision  Tree  Learning   §  Aim:  find  a  small  tree  consistent  with  the  training  examples   §  Idea:  (recursively)  choose  “most  significant”  akribute  as  root  of  (sub)tree  

Choosing  an  Akribute   §  Idea:  a  good  akribute  splits  the  examples  into  subsets  that  are  (ideally)  “all  posiUve”  or   “all  negaUve”  

§  So:  we  need  a  measure  of  how  “good”  a  split  is,  even  if  the  results  aren’t  perfectly   separated  out  

Entropy  and  InformaUon   §  InformaUon  answers  quesUons   §  The  more  uncertain  about  the  answer  iniUally,  the  more   informaUon  in  the  answer   §  Scale:  bits   §  §  §  § 

Answer  to  Boolean  quesUon  with  prior  ?       Answer  to  4-­‐way  quesUon  with  prior  ?   Answer  to  4-­‐way  quesUon  with  prior  ?   Answer  to  3-­‐way  quesUon  with  prior  ?  

§  A  probability  p  is  typical  of:   §  A  uniform  distribuUon  of  size  1/p   §  A  code  of  length  log  1/p  

Entropy   §  General  answer:  if  prior  is  :   §  InformaUon  is  the  expected  code  length  

1 bit

§  Also  called  the  entropy  of  the  distribuUon   §  §  §  § 

0 bits

More  uniform  =  higher  entropy   More  values  =  higher  entropy   More  peaked  =  lower  entropy   Rare  values  almost  “don’t  count”   0.5 bit

InformaUon  Gain   §  Back  to  decision  trees!   §  For  each  split,  compare  entropy  before  and  ayer   §  Difference  is  the  informaUon  gain   §  Problem:  there’s  more  than  one  distribuUon  ayer  split!  

§  SoluUon:  use  expected  entropy,  weighted  by  the  number  of   examples  

InformaUon  Gain   §  Back  to  decision  trees!   §  For  each  split,  compare  entropy  before  and  ayer   §  Difference  is  the  informaUon  gain   §  Problem:  there’s  more  than  one  distribuUon  ayer  split!  

§  SoluUon:  use  expected  entropy,  weighted  by  the  number  of  examples   §  Note:  hidden  problem  here!    Gain  needs  to  be  adjusted  for  large-­‐domain  splits   –  why?  

Next  Step:  Recurse   §  Now  we  need  to  keep  growing  the  tree!   §  Two  branches  are  done  (why?)   §  What  to  do  under  “full”?   §  See  what  examples  are  there…  

Example:  Learned  Tree   §  Decision  tree  learned  from  these  12  examples:  

§  SubstanUally  simpler  than  “true”  tree  

§  A  more  complex  hypothesis  isn't  jusUfied  by  data  

§  Also:  it’s  reasonable,  but  wrong  

Example:  Miles  Per  Gallon  

40 Examples

mpg good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad

cylinders displacement horsepower 4 6 4 8 6 4 4 8 : : : 8 8 8 4 6 4 4 8 4 5

low medium medium high medium low low high : : : high high high low medium medium low high low medium

low medium medium high medium medium medium high : : : high medium high low medium low low high medium medium

weight

acceleration modelyear maker

low medium medium high medium low low high : : : high high high low medium low medium high low medium

high medium low low medium medium low low : : : low high low low high low high low medium medium

75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78

asia america europe america america asia asia america : : : america america america america america america america america europe europe

Find  the  First  Split   §  Look  at  informaUon  gain  for   each  akribute   §  Note  that  each  akribute  is   correlated  with  the  target!   §  What  do  we  split  on?  

Result:  Decision  Stump  

Second  Level  

Final Tree

Reminder:  Overfimng   §  Overfimng:   §  When  you  stop  modeling  the  pakerns  in  the  training  data  (which   generalize)   §  And  start  modeling  the  noise  (which  doesn’t)  

§  We  had  this  before:   §  Naïve  Bayes:  needed  to  smooth   §  Perceptron:  early  stopping  

MPG Training Error

The test set error is much worse than the training set error…

…why?

Consider this split

Significance  of  a  Split   §  StarUng  with:  

§  Three  cars  with  4  cylinders,  from  Asia,  with  medium  HP   §  2  bad  MPG   §  1  good  MPG  

§  What  do  we  expect  from  a  three-­‐way  split?   §  Maybe  each  example  in  its  own  subset?   §  Maybe  just  what  we  saw  in  the  last  slide?  

§  Probably  shouldn’t  split  if  the  counts  are  so  small  they  could  be  due  to  chance   §  A  chi-­‐squared  test  can  tell  us  how  likely  it  is  that  deviaUons  from  a  perfect  split  are  due  to  chance*   §  Each  split  will  have  a  significance  value,  pCHANCE  

Keeping  it  General   §  Pruning:   §  Build  the  full  decision  tree   §  Begin  at  the  bokom  of  the  tree   §  Delete  splits  in  which        pCHANCE  >  MaxPCHANCE   §  ConUnue  working  upward  unUl   there  are  no  more  prunable   nodes   §  Note:  some  chance  nodes  may   not  get  pruned  because  they   were  “redeemed”  later  

y = a XOR b a

b 0 0 1 1

y 0 1 0 1

0 1 1 0

Pruning  example   §  With  MaxPCHANCE  =  0.1:  

Note the improved test set accuracy compared with the unpruned tree

RegularizaUon   §  MaxPCHANCE  is  a  regularizaUon  parameter   §  Generally,  set  it  using  held-­‐out  data  (as  usual)  

Accuracy

Training Held-out / Test

Decreasing Small Trees High Bias

MaxPCHANCE

Increasing Large Trees High Variance

Two  Ways  of  Controlling  Overfimng   §  Limit  the  hypothesis  space   §  E.g.  limit  the  max  depth  of  trees   §  Easier  to  analyze  

§  Regularize  the  hypothesis  selecUon   §  E.g.  chance  cutoff   §  Disprefer  most  of  the  hypotheses  unless  data  is  clear   §  Usually  done  in  pracUce  

Learning  Curves   §  Another  important  trend:   §  More  data  is  beker!   §  The  same  learner  will  generally  do  beker  with  more  data   §  (Except  for  cases  where  the  target  is  absurdly  simple)  

Neural  Networks  

Reminder:  Perceptron   §  Inputs  are  feature  values   §  Each  feature  has  a  weight   §  Sum  is  the  acUvaUon  

§  If  the  acUvaUon  is:   §  PosiUve,  output  +1   §  NegaUve,  output  -­‐1  

f1   f2   f3  

w1   w2   w3  

Σ  

>0?  

Two-­‐Layer  Perceptron  Network   f1   f2   f3   f1   f2   f3  

f1   f2   f3  

w11   w21   w31  

Σ  

>0?  

w12   w22   w32  

Σ  

>0?  

w33  

Σ  

>0?  

f2   f3  

w13   w23  

f1  

w1   w2   w3  

Σ  

>0?  

Two-­‐Layer  Perceptron  Network   f1   f2   f3   f1   f2   f3  

w11   w21   w31  

Σ  

>0?   w1  

w12   w22   w32  

Σ  

>0?  

w2  

w3   f1   f2   f3  

w13   w23   w33  

Σ  

>0?  

Σ  

>0?  

Two-­‐Layer  Perceptron  Network   w11   w21   w31   f1   f2  

f3  

Σ  

>0?   w1  

w12   w22   w32  

Σ  

>0?  

w2  

w3   w13   w23   w33  

Σ  

>0?  

Σ  

>0?  

Two-­‐Layer  Perceptron  Network   w11   w21   w31   f1   f2  

f3  

Σ  

>0?   w1  

w12   w22   w32  

Σ  

>0?  

w2  

w3   w13   w23   w33  

Σ  

>0?  

Σ  

>0?  

Learning  w   §  Training  examples    

§  ObjecUve:  

§  Procedure:     §  Hill  Climbing  

Hill  Climbing   §  Simple,  general  idea:   §  §  §  § 

Start  wherever   Repeat:  move  to  the  best  neighboring  state   If  no  neighbors  beker  than  current,  quit   Neighbors  =  small  perturbaUons  of  w  

§  What’s  bad  about  this  approach?   §  Complete?   §  OpUmal?  

§  What’s  parUcularly  tricky              when  hill-­‐climbing  for  the  mulU-­‐layer  perceptron?  

Two-­‐Layer  Perceptron  Network   w11   w21   w31   f1   f2  

f3  

Σ  

>0?   w1  

w12   w22   w32  

Σ  

>0?  

w2  

w3   w13   w23   w33  

Σ  

>0?  

Σ  

>0?  

Two-­‐Layer  Perceptron  Network   w11   w21   w31   f1   f2  

f3  

Σ  

>0?   w1  

w12   w22   w32  

Σ  

>0?  

w2  

w3   w13   w23   w33  

Σ  

>0?  

Σ  

Two-­‐Layer  Neural  Network   w11   w21   w31   f1   f2  

f3  

Σ  

>0?   w1  

w12   w22   w32  

Σ  

>0?  

w2  

w3   w13   w23   w33  

Σ  

>0?  

Σ  

Neural  Networks  ProperUes   §  Theorem  (Universal  FuncUon  Approximators).    A  two-­‐layer  neural   network  with  a  sufficient  number  of  neurons  can  approximate   any  conUnuous  funcUon  to  any  desired  accuracy.   §  PracUcal  consideraUons   §  Can  be  seen  as  learning  the  features     §  Large  number  of  neurons   §  Danger  for  overfimng  

§  Hill-­‐climbing  procedure  can  get  stuck  in  bad  local  opUma  

Summary   §  FormalizaUon  of  learning   §  Target  funcUon   §  Hypothesis  space   §  GeneralizaUon  

§  Decision  Trees   §  §  §  § 

Can  encode  any  funcUon   Top-­‐down  learning  (not  perfect!)   InformaUon  gain   Bokom-­‐up  pruning  to  prevent  overfimng  

§  Neural  Networks   §  Learn  features   §  Universal  funcUon  approximators   §  Difficult  to  train  

Next:  Advanced  ApplicaUons!