May 30, 2014

Report 2 Downloads 45 Views
Reproducibility  Now  at  Risk?  

William  H.  Press   University  of  Texas  at  AusAn   Simons  Symposium  on   Evidence  in  the  Natural  Sciences   May  30,  2014  

Reproducibility  Now  at  Risk?  

William  H.  Press   University  of  Texas  at  AusAn   Simons  Symposium  on   Evidence  in  the  Natural  Sciences   May  30,  2014  

Reproducibility  vs.  Irreproducibility   •  Science  is  based  on  the  premise  that   there  is  an  objecAve  reality  that  can  be   exposed  by  reproducible  experiments.   •  So  it  strikes  at  the  heart  of  science  if   occurrences  of  irreproducible   experiments  are  increasing.   •  Several  recent  studies  have  indicated  that   this  may  be  the  case   •  especially,  but  not  exclusively,  in  biomedical  research  

Most  of  this  talk  is  about  human  frailAes,  but  some   deeper  foundaAonal  issues  are  also  worth  menAoning.   “Discover”  f  by  controlling  x,  measuring  y But  f  also  depends  on  unknown  parameters  θ  that   must  be  determined  from  the  data. Of  course  the  result  also  depends  on  random  variables  R  (noise)  in  an   arbitrarily  nonlinear  way  –  which  we  oWen  linearize  to  “addiAve  noise”. So  we  are  now  measuring  relaAons  between  expectaAons  –   if  they  exist  (cf.  Cauchy  distribuAon). SystemaAc  errors  are  addiAonal  long-­‐term   random  variables  that  don’t  average  away.

Finally,  y  may  itself  be  intrinsically  probabilisAc,  as  in  quantum  measurement  or   classical  chaos  (e.g.,  turbulence).

For  complex  adapAve  systems  (with  internal  state)  the   very  noAon  of  probability  may  not  make  sense.   Every  Ame  you  click  the  bu^on,  either  the  Red  or  Green  light   goes  on.    By  repeated  clicks,    esAmate  the  probability  P(Red).   Red  fracAon   never   converges!  

Note  log  scale!  

By  comparison,  this  is   what  a  process   converging  as  t-­‐1/2   would  look  like.  

For  those  mathemaAcally  inclined:    Would  you  be  more  surprised  if  I  told  you  that   the  internal  state  of  the  machine  is  exactly  staAsAcally  staAonary,  that  is,  P(state|t)   does  not  depend  on  t?  

Ioannidis  (2005)  proposed  a  model  for  the  conAnuing   decrease  in  reproducibility.   •  •  •  • 

• 

As  science  advances,  it  probes  an   increasing  number  of  smaller  effects.   The  raAo  of  “true  relaAonships”  to  “no   relaAonship”  therefore  decreases.   Since  accepted  standards  for  staAsAcal   significance  remain  constant  (e.g.,  p-­‐value),   the  fracAon  of  false  posiAves  increases.   A  Bayesian  would  simply  say  that  the  prior   on  true  relaAonships  naturally  decreases   with  the  maturity  of  a  field,  and  that  we   are  not  accounAng  for  this  by  requiring   stronger  evidence.   Ioannidis  also  a^empts  to  model  bias  and   discusses  various  sociological  factors   affecAng  it.  

Ioannidis  JPA  (2005)  PLoS  Med  2(8):  e124.  

Data  of  Begley  and  others  provides  some  empirical   evidence  and  has  a^racted  popular  a^enAon   Amgen  (Begley  et  al.  )  selected  53  “landmark”  papers   A^empted  to  reproduce  findings  with  view  towards   clinical  applicaAon     •  Succeeded  in  reproducing  6/53  =  11%   •  Amgen  declines  to  release  details  of  its  study   •  • 

–  thus  raising  quesAons  about  its  reproducibility!  

• 

A  similar  study  by  Bayer  HealthCare  reported  25%   reproducibility  

Begley  and  Ellis,  Nature  483:531  (2012)  

In  response,  a  panel  led  by  Landis  (2012)  has  suggested   a  set  of  should-­‐be-­‐required  protocols  

My  take:  Highly  templated  to  pre-­‐clinical  cancer  research  of  average  quality,  a  beneficial  but   repeAAve  type  of  research.    CounterproducAve  and  perhaps  dangerous  as  a  model  for  other   kinds  of  research.    Undervalues  exploratory  staAsAcs,  forbids  modern  Bayesian  approaches,   subsAtutes  one  checklist  for  another.    Other  than  that,  it’s  great.  

Nature’s  “checklist”  (as  of  May,  2013)   •  •  •  •  •  •  • 

sample  size   inclusion  criteria   blinding   randomizaAon   staAsAcal  tests  used   stopping  criteria   “Nature  journals  will  now  employ  staAsAcians   as  consultants  on  certain  papers”  

Unfortunately,  slavish  staAsAcal  rituals  are  part   of  the  problem,  not  necessarily  the  soluAon  

Science  has  perhaps  a  more  nuanced  approach  

•  Reviewers  being  asked  to  flag  parAcularly  good  papers   •  Field-­‐specific  symposia  to  be  held  on  best  pracAces   •  Future  applicaAon  to  editorial  process  then  considered  

However,  in  any  parAcular  case,  it  is  hard  to  quanAfy   factors  that  may  (or  may  not)  make  work  incorrect   •  When  it  seems  “too  good  to  be  true”,  how   much  selecAon  bias  is  inherent  in  its   publicaAon?   •  When  the  result  is  “small  but  staAsAcally   significant”,  how  sensiAve  is  it  to  unmodeled   subclustering?   •  effecAve  N  much  smaller  than   represented,  so  significance  is  reduced   •  Were  all  alternaAve  explanaAons  thought  of   by  the  authors  or  reviewers?   •  “I  didn’t  actually  do  the  homework,  so   I’d  be^er  act  especially  nice.”  

Irreproducibility  is  not  one  thing,   it’s  a  catch-­‐all  of  many  things,  few  good!   • 

Genuinely  unknown  systemaAc  errors  or  confounding  variables   –  – 

• 

Inadequately  described  experimental  protocols   – 

• 

– 

5-­‐year  vs.  30-­‐year  Amescale   big  data  makes  much  classical  staAsAcal  training  irrelevant   •  pointwise  staAsAcal  errors  less  important;  systemaAc  errors  and  model  selecAon  biases  may  dominate    

Bad  incenAve  structures   –  –  – 

• 

staAsAcs  taught  by  rote   •  R.A.  Fisher’s  revenge  (the  p-­‐value  0.05)   generaAonal  or  cultural  shiWs  in  teaching  self-­‐criAcal  analysis   •  what  is  most  likely  to  be  wrong?  what  is  next-­‐most  likely?  …  

Experimental  technologies  advancing  more  rapidly  than  the  researchers  using  them   –  – 

• 

striking  results  are  published   lack  of  incenAve  to  publish  negaAve  results  

Deficient  training  in  scienAfic  methodology   – 

• 

“Haec  immatura  a  me  iam  frustra  leguntur  o.y.“  (Galileo)  

PublicaAon  biases   –  – 

• 

the  very  reason  that  confirmatory  experiments  are  so  important!   science  as  a  self-­‐correcAng  enterprise  

over-­‐compeAAveness  (especially  now  internaAonally)   hiring/promoAon  policies  that  incenAvize  less  careful  work   financially  entrepreneurial  researchers   •  convince  the  VC,  not  the  referee  

IntenAonal  scienAfic  misconduct  

Misconduct  is  only  a  very  small,  but   nevertheless  illuminaAng,  part  of  the  problem  

•  RetracAons  have  increased  since  the  1970s,  but  are  very  rare,  of  order  0.01%   •  So,  even  allowing  for  undetected  cases,  misconduct  is  a  Any  part  of  the  (order  unity)   reproducibility  problem.   •  However,  the  drivers  of  misconduct  (e.g.,  over-­‐compeAAon,  inadequate  resources,  thirst  for   glory,  desperaAon)  seem  likely  also  to  be  drivers  of  “honest”  but  insufficiently  self-­‐criAcal   science,  hence  of  irreproducibility    

 

Is  mathemaAcs  exempt?   Are  proofs  “repeatable”?   •  •  •  •  •  •  •  •  • 

Malfat  circles  (1803)  accepted  unAl  1930   ConAnuous  fn  is  almost  everywhere  differenAable   (Ampere  1806)  accepted  unAl  Weierstrass  (1872)   Convergent  sum  of  conAnuous  funcAons  is   conAnuous  (Cauchy  1821)  accepted  unAl  Abel  1824   Dirichlet  Principle  used  by  Riemann  (1851)  accepted   unAl  Weierstrass  counterexample  (1870)   Four  color  map  theorem  (Kempe  1879)  accepted   unAl  1890;  and  others   Hilbert’s  21st  Problem  (Plemelj  1908)  accepted  unAl   1989   “Italian  School”  of  algebraic  geometry  (1885-­‐1935)   turned  out  to  be  mostly  wrong   Jacobian  Conjecture  (Keller  1939)  accepted  unAl   1960s  counterexamples;  and  others   “Perko  pair”  longAme  accepted  as  disAnct  knots   unAl  Perko  1974  

• 

MathemaAcal  corpus  in  the  19th  Century  consists  of   ~10,000  papers,  of  which  some  dozens  are  now   recognized  as  incorrect  and  (at  the  Ame)  important.   – 

• 

not  all  refereed  by  modern  standards  

arXiv  holds  ~10,000  mathemaAcs  papers  of  which   4-­‐6  per  year  are  fairly  quickly  recognized  as  wrong   – 

weakly  reputaAonal  refereeing  

The  world  corpus  of  all  mathemaAcs  is  ~106  proofs.    A  plausible  guess  is  that  between   1%  and  0.1%  are  unrepeatable  (i.e.,  wrong).  

What  can  we  actually  do  about  this?   • 

Recognize  it  as  important   – 

• 

Reform  staAsAcs  educaAon   –  –  – 

• 

many  communiAes  now  in  state  of  denial   train  on  “systemaAc  errors”  as  much  as  on  “staAsAcal   errors”   develop  “feel  for  data”  in  addiAon  to  standard  tests   teach  simulaAon  capability  

Develop  counters  to  publicaAon  bias   – 

Prominent  publicaAon  of  the  most  important  negaAve   results   •  • 

– 

Without  sending  first  to  the  authors!  

IncenAvize  self-­‐criAcal  thinking  by  researchers   –  –  – 

• 

Require  authors  to  include  “most  likely  three  ways  this   paper  could  be  wrong”   Referee  on  the  depth  and  quality  of  that  statement   NOT  requirement  for  addiAonal  experiments  (which   has  go^en  out  of  hand!)  

– 

• 

exempt  from  length  limits   encapsulated  virtual  machines  (soup  to  nuts)  

Tougher  standards  for  publicaAon  of  “small  but   significant”  effects   – 

Subsidize  page  charges  

Allow  on-­‐line  commentary  of  alternaAve  explanaAons   and  criAcisms   • 

• 

“Seal  of  approval”   Award  annual  prizes?  

More  complete  publicaAon  of  protocols,  data,  and   analysis   –  – 

On-­‐line  publicaAon  (without  an  “importance”   criterion)  of  all  negaAve  results   • 

– 

• 

e.g.,  require  a  “theory  of  the  case”  under  which  the   effect  could  possibly  be  large  under  other  condiAons   because  if  no  such  theory  exists,  then  systemaAc  error   is  always  the  most  likely  explanaAon  

Require  higher  significance  (and  more  explicit   mulAple  hypothesis  correcAon)  for  big  data   experiments   –  – 

because  you  can   “like  physics”