CS 188: Areficial Intelligence Decision Trees and Neural Nets Today ...

Report 1 Downloads 19 Views
Today  

CS  188:  Ar)ficial  Intelligence    

Decision  Trees  and  Neural  Nets  

§  Formalizing  Learning   §  Consistency   §  Simplicity  

§  Decision  Trees   §  Expressiveness   §  Informa)on  Gain   §  OverfiGng   Pieter Abbeel, Dan Klein University of California, Berkeley

§  Neural  Nets  

Induc)ve  Learning  

Induc)ve  Learning  (Science)   §  Simplest  form:  learn  a  func)on  from  examples   §  A  target  func)on:  g   §  Examples:  input-­‐output  pairs  (x, g(x)) §  E.g.  x  is  an  email  and  g(x) is  spam  /  ham   §  E.g.  x  is  a  house  and  g(x) is  its  selling  price   §  Problem:   §  Given  a  hypothesis  space  H §  Given  a  training  set  of  examples  xi §  Find  a  hypothesis  h(x)  such  that  h ~ g

§  Includes:  

§  Classifica)on  (mul)nomial  outputs)   §  Regression  (real  outputs)  

§  How  do  perceptron  and  naïve  Bayes  fit  in?    (H, h, g,  etc.)  

Induc)ve  Learning   §  Curve  fiGng  (regression,  func)on  approxima)on):  

Consistency  vs.  Simplicity   §  Fundamental  tradeoff:  bias  vs.  variance   §  Usually  algorithms  prefer  consistency  by  default  (why?)   §  Several  ways  to  opera)onalize  “simplicity”   §  Reduce  the  hypothesis  space   §  Assume  more:  e.g.  independence  assump)ons,  as  in  naïve  Bayes   §  Have  fewer,  be]er  features  /  a]ributes:  feature  selec)on   §  Other  structural  limita)ons  (decision  lists  vs  trees)  

§  Regulariza)on  

§  Consistency  vs.  simplicity   §  Ockham’s  razor  

§  Smoothing:  cau)ous  use  of  small  counts   §  Many  other  generaliza)on  parameters  (pruning  cutoffs  today)   §  Hypothesis  space  stays  big,  but  harder  to  get  to  the  outskirts  

Decision  Trees  

Reminder:  Features   §  Features,  aka  a]ributes   §  Some)mes:  TYPE=French   §  Some)mes:  fTYPE=French(x)  =  1  

Decision  Trees   §  Compact  representa)on  of  a  func)on:   §  Truth  table   §  Condi)onal  probability  table   §  Regression  values  

Expressiveness  of  DTs   §  Can  express  any  func)on  of  the  features  

§  True  func)on   §  Realizable:  in  H

§  However,  we  hope  for  compact  trees  

Comparison:  Perceptrons   §  What  is  the  expressiveness  of  a  perceptron  over  these  features?  

Hypothesis  Spaces   §  How  many  dis)nct  decision  trees  with  n  Boolean  a]ributes?   =  number  of  Boolean  func)ons  over  n  a]ributes   =  number  of  dis)nct  truth  tables  with  2n  rows   =  2^(2n)   §  E.g.,  with  6  Boolean  a]ributes,  there  are    18,446,744,073,709,551,616  trees  

§  For  a  perceptron,  a  feature’s  contribu)on  is  either  posi)ve  or  nega)ve  

§  If  you  want  one  feature’s  effect  to  depend  on  another,  you  have  to  add  a  new  conjunc)on  feature   §  E.g.  adding  “PATRONS=full  ∧  WAIT  =  60”  allows  a  perceptron  to  model  the  interac)on  between  the  two  atomic   features  

§  DTs  automa)cally  conjoin  features  /  a]ributes  

§  Features  can  have  different  effects  in  different  branches  of  the  tree!  

§  Difference  between  modeling  rela)ve  evidence  weigh)ng  (NB)  and  complex  evidence  interac)on  (DTs)   §  Though  if  the  interac)ons  are  too  complex,  may  not  find  the  DT  greedily  

§  How  many  trees  of  depth  1  (decision  stumps)?  

=  number  of  Boolean  func)ons  over  1  a]ribute   =  number  of  truth  tables  with  2  rows,  )mes  n   =  4n   §  E.g.  with  6  Boolean  a]ributes,  there  are  24  decision  stumps  

§  More  expressive  hypothesis  space:  

§  Increases  chance  that  target  func)on  can  be  expressed  (good)   §  Increases  number  of  hypotheses  consistent  with  training  set   (bad,  why?)   §  Means  we  can  get  be]er  predic)ons  (lower  bias)   §  But  we  may  get  worse  predic)ons  (higher  variance)  

Decision  Tree  Learning   §  Aim:  find  a  small  tree  consistent  with  the  training  examples   §  Idea:  (recursively)  choose  “most  significant”  a]ribute  as  root  of  (sub)tree  

Choosing  an  A]ribute   §  Idea:  a  good  a]ribute  splits  the  examples  into  subsets  that  are  (ideally)  “all  posi)ve”  or   “all  nega)ve”  

§  So:  we  need  a  measure  of  how  “good”  a  split  is,  even  if  the  results  aren’t  perfectly   separated  out  

Entropy  and  Informa)on   §  Informa)on  answers  ques)ons   §  The  more  uncertain  about  the  answer  ini)ally,  the  more   informa)on  in  the  answer   §  Scale:  bits   §  §  §  § 

Entropy   §  General  answer:  if  prior  is  :   §  Informa)on  is  the  expected  code  length  

1 bit

Answer  to  Boolean  ques)on  with  prior  ?       Answer  to  4-­‐way  ques)on  with  prior  ?   Answer  to  4-­‐way  ques)on  with  prior  ?   Answer  to  3-­‐way  ques)on  with  prior  ?  

§  A  probability  p  is  typical  of:   §  A  uniform  distribu)on  of  size  1/p   §  A  code  of  length  log  1/p  

§  Also  called  the  entropy  of  the  distribu)on   §  §  §  § 

0 bits

More  uniform  =  higher  entropy   More  values  =  higher  entropy   More  peaked  =  lower  entropy   Rare  values  almost  “don’t  count”   0.5 bit

Informa)on  Gain   §  Back  to  decision  trees!   §  For  each  split,  compare  entropy  before  and  aper   §  Difference  is  the  informa)on  gain   §  Problem:  there’s  more  than  one  distribu)on  aper  split!  

Next  Step:  Recurse   §  Now  we  need  to  keep  growing  the  tree!   §  Two  branches  are  done  (why?)   §  What  to  do  under  “full”?   §  See  what  examples  are  there…  

§  Solu)on:  use  expected  entropy,  weighted  by  the  number  of   examples  

Example:  Learned  Tree  

Example:  Miles  Per  Gallon   mpg

40 Examples

§  Decision  tree  learned  from  these  12  examples:  

§  Substan)ally  simpler  than  “true”  tree  

§  A  more  complex  hypothesis  isn't  jus)fied  by  data  

§  Also:  it’s  reasonable,  but  wrong  

good bad bad bad bad bad bad bad : : : bad good bad good bad good good bad good bad

cylinders displacement horsepower 4 6 4 8 6 4 4 8 : : :

Find  the  First  Split  

low medium medium high medium medium medium high : : : high medium high low medium low low high medium medium

weight

acceleration modelyear maker

low medium medium high medium low low high : : : high high high low medium low medium high low medium

high medium low low medium medium low low : : : low high low low high low high low medium medium

75to78 70to74 75to78 70to74 70to74 70to74 70to74 75to78 : : : 70to74 79to83 75to78 79to83 75to78 79to83 79to83 70to74 75to78 75to78

asia america europe america america asia asia america : : : america america america america america america america america europe europe

Result:  Decision  Stump  

§  Look  at  informa)on  gain  for   each  a]ribute   §  Note  that  each  a]ribute  is   correlated  with  the  target!   §  What  do  we  split  on?  

Second  Level  

8 8 8 4 6 4 4 8 4 5

low medium medium high medium low low high : : : high high high low medium medium low high low medium

Final Tree

MPG Training Error

Reminder:  OverfiGng   §  OverfiGng:   §  When  you  stop  modeling  the  pa]erns  in  the  training  data  (which   generalize)   §  And  start  modeling  the  noise  (which  doesn’t)  

§  We  had  this  before:   §  Naïve  Bayes:  needed  to  smooth   §  Perceptron:  early  stopping  

The test set error is much worse than the training set error…

…why?

Significance  of  a  Split   §  Star)ng  with:  

§  Three  cars  with  4  cylinders,  from  Asia,  with  medium  HP   §  2  bad  MPG   §  1  good  MPG  

Consider this split

§  What  do  we  expect  from  a  three-­‐way  split?   §  Maybe  each  example  in  its  own  subset?   §  Maybe  just  what  we  saw  in  the  last  slide?  

§  Probably  shouldn’t  split  if  the  counts  are  so  small  they  could  be  due  to  chance   §  A  chi-­‐squared  test  can  tell  us  how  likely  it  is  that  devia)ons  from  a  perfect  split  are  due  to  chance*   §  Each  split  will  have  a  significance  value,  pCHANCE  

Keeping  it  General   §  Pruning:   §  Build  the  full  decision  tree   §  Begin  at  the  bo]om  of  the  tree   §  Delete  splits  in  which        pCHANCE  >  MaxPCHANCE   §  Con)nue  working  upward  un)l   there  are  no  more  prunable   nodes   §  Note:  some  chance  nodes  may   not  get  pruned  because  they   were  “redeemed”  later  

y = a XOR b a

b 0 0 1 1

y 0 1 0 1

Pruning  example   §  With  MaxPCHANCE  =  0.1:  

0 1 1 0

Note the improved test set accuracy compared with the unpruned tree

Regulariza)on  

Two  Ways  of  Controlling  OverfiGng   §  Limit  the  hypothesis  space  

§  MaxPCHANCE  is  a  regulariza)on  parameter   §  Generally,  set  it  using  held-­‐out  data  (as  usual)  

§  E.g.  limit  the  max  depth  of  trees   §  Easier  to  analyze  (coming  up)  

Accuracy

Training

§  Regularize  the  hypothesis  selec)on  

Held-out / Test

Decreasing

MaxPCHANCE

§  E.g.  chance  cutoff   §  Disprefer  most  of  the  hypotheses  unless  data  is  clear   §  Usually  done  in  prac)ce  

Increasing

Small Trees High Bias

Large Trees High Variance

Neural  Networks  

Reminder:  Perceptron   §  Inputs  are  feature  values   §  Each  feature  has  a  weight   §  Sum  is  the  ac)va)on  

§  If  the  ac)va)on  is:  

f1  

§  Posi)ve,  output  +1   §  Nega)ve,  output  -­‐1  

Two-­‐Layer  Perceptron  Network   f1   f2   f3   f1   f2   f3  

f1   f2   f3  

w31  

f1  

Σ  

>0?  

w22   w32  

Σ  

>0?  

w33  

f1   f2   f3  

w13   w23  

f2   f3  

w12  

w1   w2   w3  

f1  

Σ  

>0?  

f2   f3  

>0?  

w2   w3  

Σ  

>0?  

w11   w21   w31  

Σ  

>0?  

Σ  

>0?  

w1  

w12   w22   w32  

w2  

w3   f1  

Σ  

f3  

w1  

Two-­‐Layer  Perceptron  Network  

w11   w21  

f2  

f2   f3  

w13   w23   w33  

Σ  

>0?  

Σ  

>0?  

Two-­‐Layer  Perceptron  Network  

Two-­‐Layer  Perceptron  Network  

w11   w21   w31   f1  

w11  

Σ  

w21  

>0?  

w31  

w1   f1  

w12   w22  

f2  

w32  

Σ  

>0?  

w2  

Σ  

w3  

f3  

w23   w33  

Σ  

w22   w32  

Σ  

>0?  

w1  

w2  

Σ  

w3  

f3  

w13  

>0?  

w12  

f2  

>0?  

Σ  

w13   w23  

>0?  

w33  

Learning  w  

Σ  

>0?  

Hill  Climbing  

§  Training  examples  

§  Simple,  general  idea:   §  §  §  § 

 

§  Objec)ve:  

Start  wherever   Repeat:  move  to  the  best  neighboring  state   If  no  neighbors  be]er  than  current,  quit   Neighbors  =  small  perturba)ons  of  w  

§  What’s  bad  about  this  approach?   §  Complete?   §  Op)mal?  

§  Procedure:    

§  What’s  par)cularly  tricky              when  hill-­‐climbing  for  the  mul)-­‐layer  perceptron?  

§  Hill  Climbing  

Two-­‐Layer  Perceptron  Network  

Two-­‐Layer  Perceptron  Network  

w11   w21   w31   f1   f2  

f3  

w11  

Σ  

w21  

>0?  

w31  

w1   f1  

w12   w22   w32  

Σ  

>0?  

w2  

w3   w13   w23   w33  

Σ  

>0?  

Σ  

>0?  

f2  

f3  

Σ  

>0?  

Σ  

>0?  

w1  

w12   w22   w32  

w2  

w3   w13   w23   w33  

Σ  

>0?  

Σ  

>0?  

Two-­‐Layer  Neural  Network  

§  Theorem  (Universal  Func)on  Approximators).    A  two-­‐layer  neural   network  with  a  sufficient  number  of  neurons  can  approximate   any  con)nuous  func)on  to  any  desired  accuracy.  

w11   w21   w31   f1  

Σ  

>0?  

Σ  

>0?  

w1  

w12   w22  

f2  

w32  

w2  

w3  

f3  

w13   w23   w33  

Σ  

>0?  

Neural  Networks  Proper)es  

Σ  

§  Prac)cal  considera)ons   §  Can  be  seen  as  learning  the  features     §  Large  number  of  neurons   §  Danger  for  overfiGng  

§  Hill-­‐climbing  procedure  can  get  stuck  in  bad  local  op)ma  

Summary   §  Formaliza)on  of  learning   §  Target  func)on   §  Hypothesis  space   §  Generaliza)on  

§  Decision  Trees   §  §  §  § 

Can  encode  any  func)on   Top-­‐down  learning  (not  perfect!)   Informa)on  gain   Bo]om-­‐up  pruning  to  prevent  overfiGng  

§  Neural  Networks   §  Learn  features   §  Universal  func)on  approximators   §  Difficult  to  train  

Next  …   §  Applica)ons!