The above titled paper is one of the most widely cited in clinical epidemiology.[1] It was published in PLoS Medicine in 2005 and has 3,805 citations on Google Scholar, over 1.7m views online, and has been widely quoted in the media all over the world. The author, John Ioannidis, is arguably the world’s premier clinical epidemiologist.

The essence of Ioannidis’s argument turns on the notion of false positive study results. A false positive study result can arise in two ways; as a result of bias, or because of an alpha error. Ioannidis, in this paper, is not greatly concerned with traditional biases, such as those resulting from selection effect or lack of blinding. He is concerned, however, with dissemination bias in general, including the particular form of bias called p-hacking. P-hacking arises when an investigator performs multiple statistical tests, but selectively reports those with positive results. Since the denominator – the total number of statistical tests carried out – is not declared, a statistical adjustment of multiple comparison is not possible and the published finding is a skewed sample of all the findings. Ioannidis’s article is packed with examples from clinical and epidemiological research.

In the flurry of excitement over the somewhat sensational findings, and perhaps partly because of the awe in which the author is held, nobody seems to have asked the obvious question; why, if most research findings are false, has medical research been so spectacularly successful? How can it be that so many lives are saved by transplantation surgery, vaccination, chemotherapy, control of haemorrhage, and so on, if most research is false!

The fundamental flaw in Ioannidis’s argument is that research is not interpreted entirely by statistical convention. Scientific inference is not simply a question of acting on research results like a pilot reacting to instruments. The point about science is not to count up positive results (weighted by their size), but to generate a scientific theory. The crucial point about Semmelweis’s findings lay not in the number of postpartum death that he *observed*, but in the germ theory that he *inferred.*[2] The lesson is obvious; we should stand back from individual research findings, and consider research findings in the round to develop scientific theory. On a lighter note, one is reminded of Ernest Rutherford’s chicken who laid an egg each morning and was rewarded by the farmer. Empirically, evidence pointed to a very satisfactory relationship, until Christmas day! The chicken misinterpreted the data because she did not perceive the underlying (socioeconomic) structure of which she was a part.

None of this means, of course, that Ioannidis is wrong to warn us about the perils of p-hacking, but his argument does remind us of a deep flaw in our current, and hopefully transient, way of interpreting scientific data in research – the convention of dichotomising scientific results on the notion of statistical significance. This, of course, is quite wrong, as argued elsewhere in the news blog. And to be fair, Ioannidis does take a swipe at p values along the way. What he doesn’t do is to draw a proper distinction between statistical results and scientific inference; a p value is *not* a scientific finding. This is not semantics, it cuts to the heart of the problem, but an input to inference. Dichotomising results on whether or not some confidence limit excludes the null value is atheoretical and prone to mislead. The proper way to interpret a given finding is to build on theory using Bayesian ideas. Theoretical knowledge is encapsulated in a prior probability density. When the data relate directly to the parameter of interest they are used to update this prior (if necessary after adjustment for potential bias using the method of Turner, et al.[3]). If the data are indirectly related to the parameter of interest, then the updating can be done subjectively, or, if possible, through a Bayesian network analysis. Either way, when data are interpreted in this epistemologically sound way, study results are not ‘true’ or ‘false’ – they are simply the results.

Of course none of this is tantamount to disregarding importance of p-hacking, which is a topic we have discussed before.[4] [5]

*— Richard Lilford, CLAHRC WM Director*

**References:**

- Ioannidis JPA. Why Most Published Research Findings Are False.
*PLoS Med*. 2005;**2**(8): e124. - Best M, Neuhauser D. Ignaz Semmelweis and the birth of infection control.
*Qual Saf Health Care*. 2004;**13**:233–4. - Turner RM, Spiegelhalter DJ, Smith GC, Thompson SG. Bias modelling in evidence synthesis.
*J R Stat Soc Ser A Stat Soc*. 2009;**172**(1):21-47. - Lilford RJ. Bullshit Detectors: Look out for ‘p-hacking’. NIHR CLAHRC West Midlands News Blog. 11 September 2015.
- Lilford RJ. More on ‘p-hacking’. NIHR CLAHRC West Midlands News Blog. 18 December 2015.

As usual a fantastic column. I feel one of the other that is increasing is the eagerness for ‘sensationalism’. This applies to both the authors and editors. Negative findings are equally important as positive findings. Controversial studies get published even in big journals like this one despite obvious methodological flaws (http://www.nejm.org/doi/full/10.1056/NEJMoa072761#t=article)

The Reverend is causing muddle again, Richard. We know what we mean by research, and most of it is still false.

Lives can be “saved by transplantation surgery, vaccination, chemotherapy, control of haemorrhage, and so on” and most research still be false. That’s because for every new treatment which works there are innumerable ones which don’t, but which get a pile of P-hacking “research” studies reporting that they do, until finally a well conducted study shows correctly that they don’t.

Progesterone to prevent preterm labour or miscarriage is a recent example. Nearly a hundred P-hacking “RCTs” and about 20 systematic reviews have reported that it works – many by eminent researchers in reputable journals – but last year Coomarasamy (miscarriage) and Norman (preterm labour) showed that it doesn’t. You can never prove a negative, and it’ll take some time for the “progesterone believers” to all die off. But since those two definitive trials will likely never be repeated, for the rest of time our best estimate of truth must be that progesterone does not work.

That leaves a ratio of false:true progesterone treatment effectiveness research by my estimate of about 120:2. And there are many more new treatments like progesterone, than new treatments like renal transplantation or vaccination.

For less rigorous research methods, such as case control studies of disease aetiology, the false:true ratio is probably thousands to one.