Reproducibility Now at Risk?
William H. Press University of Texas at AusAn Simons Symposium on Evidence in the Natural Sciences May 30, 2014
Reproducibility Now at Risk?
William H. Press University of Texas at AusAn Simons Symposium on Evidence in the Natural Sciences May 30, 2014
Reproducibility vs. Irreproducibility • Science is based on the premise that there is an objecAve reality that can be exposed by reproducible experiments. • So it strikes at the heart of science if occurrences of irreproducible experiments are increasing. • Several recent studies have indicated that this may be the case • especially, but not exclusively, in biomedical research
Most of this talk is about human frailAes, but some deeper foundaAonal issues are also worth menAoning. “Discover” f by controlling x, measuring y But f also depends on unknown parameters θ that must be determined from the data. Of course the result also depends on random variables R (noise) in an arbitrarily nonlinear way – which we oWen linearize to “addiAve noise”. So we are now measuring relaAons between expectaAons – if they exist (cf. Cauchy distribuAon). SystemaAc errors are addiAonal long-‐term random variables that don’t average away.
Finally, y may itself be intrinsically probabilisAc, as in quantum measurement or classical chaos (e.g., turbulence).
For complex adapAve systems (with internal state) the very noAon of probability may not make sense. Every Ame you click the bu^on, either the Red or Green light goes on. By repeated clicks, esAmate the probability P(Red). Red fracAon never converges!
Note log scale!
By comparison, this is what a process converging as t-‐1/2 would look like.
For those mathemaAcally inclined: Would you be more surprised if I told you that the internal state of the machine is exactly staAsAcally staAonary, that is, P(state|t) does not depend on t?
Ioannidis (2005) proposed a model for the conAnuing decrease in reproducibility. • • • •
•
As science advances, it probes an increasing number of smaller effects. The raAo of “true relaAonships” to “no relaAonship” therefore decreases. Since accepted standards for staAsAcal significance remain constant (e.g., p-‐value), the fracAon of false posiAves increases. A Bayesian would simply say that the prior on true relaAonships naturally decreases with the maturity of a field, and that we are not accounAng for this by requiring stronger evidence. Ioannidis also a^empts to model bias and discusses various sociological factors affecAng it.
Ioannidis JPA (2005) PLoS Med 2(8): e124.
Data of Begley and others provides some empirical evidence and has a^racted popular a^enAon Amgen (Begley et al. ) selected 53 “landmark” papers A^empted to reproduce findings with view towards clinical applicaAon • Succeeded in reproducing 6/53 = 11% • Amgen declines to release details of its study • •
– thus raising quesAons about its reproducibility!
•
A similar study by Bayer HealthCare reported 25% reproducibility
Begley and Ellis, Nature 483:531 (2012)
In response, a panel led by Landis (2012) has suggested a set of should-‐be-‐required protocols
My take: Highly templated to pre-‐clinical cancer research of average quality, a beneficial but repeAAve type of research. CounterproducAve and perhaps dangerous as a model for other kinds of research. Undervalues exploratory staAsAcs, forbids modern Bayesian approaches, subsAtutes one checklist for another. Other than that, it’s great.
Nature’s “checklist” (as of May, 2013) • • • • • • •
sample size inclusion criteria blinding randomizaAon staAsAcal tests used stopping criteria “Nature journals will now employ staAsAcians as consultants on certain papers”
Unfortunately, slavish staAsAcal rituals are part of the problem, not necessarily the soluAon
Science has perhaps a more nuanced approach
• Reviewers being asked to flag parAcularly good papers • Field-‐specific symposia to be held on best pracAces • Future applicaAon to editorial process then considered
However, in any parAcular case, it is hard to quanAfy factors that may (or may not) make work incorrect • When it seems “too good to be true”, how much selecAon bias is inherent in its publicaAon? • When the result is “small but staAsAcally significant”, how sensiAve is it to unmodeled subclustering? • effecAve N much smaller than represented, so significance is reduced • Were all alternaAve explanaAons thought of by the authors or reviewers? • “I didn’t actually do the homework, so I’d be^er act especially nice.”
Irreproducibility is not one thing, it’s a catch-‐all of many things, few good! •
Genuinely unknown systemaAc errors or confounding variables – –
•
Inadequately described experimental protocols –
•
–
5-‐year vs. 30-‐year Amescale big data makes much classical staAsAcal training irrelevant • pointwise staAsAcal errors less important; systemaAc errors and model selecAon biases may dominate
Bad incenAve structures – – –
•
staAsAcs taught by rote • R.A. Fisher’s revenge (the p-‐value 0.05) generaAonal or cultural shiWs in teaching self-‐criAcal analysis • what is most likely to be wrong? what is next-‐most likely? …
Experimental technologies advancing more rapidly than the researchers using them – –
•
striking results are published lack of incenAve to publish negaAve results
Deficient training in scienAfic methodology –
•
“Haec immatura a me iam frustra leguntur o.y.“ (Galileo)
PublicaAon biases – –
•
the very reason that confirmatory experiments are so important! science as a self-‐correcAng enterprise
over-‐compeAAveness (especially now internaAonally) hiring/promoAon policies that incenAvize less careful work financially entrepreneurial researchers • convince the VC, not the referee
IntenAonal scienAfic misconduct
Misconduct is only a very small, but nevertheless illuminaAng, part of the problem
• RetracAons have increased since the 1970s, but are very rare, of order 0.01% • So, even allowing for undetected cases, misconduct is a Any part of the (order unity) reproducibility problem. • However, the drivers of misconduct (e.g., over-‐compeAAon, inadequate resources, thirst for glory, desperaAon) seem likely also to be drivers of “honest” but insufficiently self-‐criAcal science, hence of irreproducibility
Is mathemaAcs exempt? Are proofs “repeatable”? • • • • • • • • •
Malfat circles (1803) accepted unAl 1930 ConAnuous fn is almost everywhere differenAable (Ampere 1806) accepted unAl Weierstrass (1872) Convergent sum of conAnuous funcAons is conAnuous (Cauchy 1821) accepted unAl Abel 1824 Dirichlet Principle used by Riemann (1851) accepted unAl Weierstrass counterexample (1870) Four color map theorem (Kempe 1879) accepted unAl 1890; and others Hilbert’s 21st Problem (Plemelj 1908) accepted unAl 1989 “Italian School” of algebraic geometry (1885-‐1935) turned out to be mostly wrong Jacobian Conjecture (Keller 1939) accepted unAl 1960s counterexamples; and others “Perko pair” longAme accepted as disAnct knots unAl Perko 1974
•
MathemaAcal corpus in the 19th Century consists of ~10,000 papers, of which some dozens are now recognized as incorrect and (at the Ame) important. –
•
not all refereed by modern standards
arXiv holds ~10,000 mathemaAcs papers of which 4-‐6 per year are fairly quickly recognized as wrong –
weakly reputaAonal refereeing
The world corpus of all mathemaAcs is ~106 proofs. A plausible guess is that between 1% and 0.1% are unrepeatable (i.e., wrong).
What can we actually do about this? •
Recognize it as important –
•
Reform staAsAcs educaAon – – –
•
many communiAes now in state of denial train on “systemaAc errors” as much as on “staAsAcal errors” develop “feel for data” in addiAon to standard tests teach simulaAon capability
Develop counters to publicaAon bias –
Prominent publicaAon of the most important negaAve results • •
–
Without sending first to the authors!
IncenAvize self-‐criAcal thinking by researchers – – –
•
Require authors to include “most likely three ways this paper could be wrong” Referee on the depth and quality of that statement NOT requirement for addiAonal experiments (which has go^en out of hand!)
–
•
exempt from length limits encapsulated virtual machines (soup to nuts)
Tougher standards for publicaAon of “small but significant” effects –
Subsidize page charges
Allow on-‐line commentary of alternaAve explanaAons and criAcisms •
•
“Seal of approval” Award annual prizes?
More complete publicaAon of protocols, data, and analysis – –
On-‐line publicaAon (without an “importance” criterion) of all negaAve results •
–
•
e.g., require a “theory of the case” under which the effect could possibly be large under other condiAons because if no such theory exists, then systemaAc error is always the most likely explanaAon
Require higher significance (and more explicit mulAple hypothesis correcAon) for big data experiments – –
because you can “like physics”