“P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume.”
R. Nuzzo. Scientific method: Statistical Errors. Nature. 2014; 506:150-2.
Phase 1: Theoretical Practice
When I was at medical school, the prevailing idea was that all that was needed for the sound practice of medicine was a deep understanding of physiology and pathology. Our teachers had reason to put faith in this idea. They emanated from what is sometimes called the golden age of discovery. Improved understanding of physiology, alongside technical developments, had placed in their hands powerful treatments, such as oral contraception, the breathing machine, kidney dialysis, cardio-pulmonary bypass, and cancer chemotherapy. Patients could be rescued from the clutches of death in the intensive care unit by following the sound principles of living physiology.
However, these heady discoveries soon gave way to a more deliberative process of trial and error to improve the use of generic treatment types.
Phase 2: Evidence-Based Practice
The effects of these second order interventions were not self-evident. Again and again, randomised trials showed that our intuitions, no matter how well-based in physiology and pathology, were often completely and utterly wrong. In short, we did not know enough about physiology and pathology to be able to predict which treatments would do more good than harm. At first the medical profession was non-plussed by this type of direct evidence, but good arguments gradually displace bad. Evangelists such as Archie Cochrane were followed by early adopters, such as David Sackett, Iain Chalmers and Thomas C. Chalmers, and the Evidence-Based Medicine movement was born. Pick up any of the major six journals now and you will most likely be treated to a pageant of randomised trials and systematic reviews of RCTs. RCTs continue to produce iconic results – for example the magnificent CRASH-2 trial  and the endovascular aneurysm trials.  
However, more and more RCTs are inconclusive, even when they have been of considerable size and very well-funded. Mainly, this is because the head-room for improvement is gradually being eroded by the success of modern medicine; if you halve the absolute effect size you are looking for, then you quadruple the necessary sample size, other things remaining unchanged. Also, as pointed out in a previous blog, we face ‘question inflation’; every question we answer in science spawns a string of subsidiary questions and the science base produces an ever-increasing number of therapeutic targets. This is unmasking an epistemological problem at the heart of the current evidence-based healthcare movement. The results of standard statistical tests have been treated as a decision rule. As argued before, frequentist statistics does not provide a decision rule that can be used in the interpretation of a particular result.  This was emphasised by the founding fathers of frequentist statistics Neyman and Fisher. Frequentist statistics does not yield probabilistic estimates that are required for an axiomatic theory of decision-making.  Yet, practitioners of evidence-based medicine reify the p-value and confidence limits, and use them as the basis for decisions. Confidence limits take account of the play of chance, while various procedural roles, such as randomisation, take care of bias, according to this view of the world. Statisticians, the only people who really understand the problem, keep silent – Bayesian statistics that provides the probabilities of treatment effects, was something “you do not do in front of the children”. This led Steven Goodman (himself a statistician) to make his insightful remark – “…statistics has become too important to leave only to statisticians”.
Phase 3: Integrating Multiple Sorts of Evidence
Of course none of this mattered when RCTs produced iconic results that swept all before them. Under those circumstances misinterpreting a frequentist confidence limit as a Bayesian credible limit does no harm at all – they are virtually the same thing. It is only now as we enter the fuzzier world of small effect sizes and multiple objectives and the need to combine data of different sorts, that the intellectual flaws in using standard statistical method as a decision rule are assuming practical importance.
The only way out of the mess  is to think completely differently. We should start with a formal analysis of the decision problem (using expected utility theory or its elaboration into cost-utility analysis). We should then collect the relevant data and analyse it in a Bayesian paradigm, so that it can provide the kind of probabilities we need (probabilities of events if we follow one decision or the other) and so that the different probabilities and utilities can be reconciled.
When I was a medical student, evidence-based medicine was not given nearly enough prominence. Subsequently, a very simplified version, based on frequentist statistics, was presented as a one-stop shop to clinical decisions. But the best solution is one which synthesises prior knowledge, and all direct comparative evidence, to yield probabilities that are directly referable to the decision problem. Knowledge about the theory of treatments and of how other similar treatments have fared is also part of the evidence for evidence-based medicine, as Bradford Hill pointed out in his famous lecture.
I have been proselytising for Bayesian statistics for 25 years. At one point in my career I spoke to a statistician who, like me, had attracted a certain amount of ridicule. His response had been to back off. Recently, a distinguished statistician who I like and admire, told me I should stop banging on about Bayes – the argument, he said, was widely accepted intellectually. Be that as it may, the world plows on regardless, with doctors, nurses, psychologists and many others misunderstanding conventional statistics, and statisticians, with a few exceptions, remaining silent. The CLAHRC WM Director has absolutely no intention to stop banging on about Bayes!
- CRASH-2 Trial Collaborators.Effects of tranexamic acid on death, vascular occlusive events, and blood transfusion in trauma patients with signiﬁcant haemorrhage (CRASH-2): a randomised, placebo-controlled trial. Lancet. 2010; 376(9734): 23-32.
- Prinssen M, et al. A randomized trial comparing conventional and endovascular repair of abdominal aortic aneurysms. NEJM. 2004; 351: 1607-18.
- EVAR Trial Participants. Endovascular aneurysm repair versus open repair in patients with abdominal aortic aneurysm (EVAR trial 1): randomised controlled trial. Lancet 2005; 365(9478): 2179-86.
- EVAR Trial Participants. Endovascular aneurysm repair and outcome in patients unfit for open repair of abdominal aortic aneurysm (EVAR trial 2): randomised controlled trial. Lancet 2005; 365(9478): 2187-92.
- Lilford RJ. The End of the Hegemony of Randomised Trials. 30 Nov 2012. [Online].
- Lord JM, et al. The systemic immune response to trauma: an overview of pathophysiology and treatment. Lancet. 2014. 384: 1455-65.
- Lilford RJ, Thornton JG, Braunholtz D. Clinical trials and rare diseases: a way out of a conundrum. BMJ. 1995; 311: 1621.
- Lilford RJ, Braunholtz D. Who’s afraid of Thomas Bayes? J Epidemiol Community Health. 2000; 54: 731-9.
- Lindley DV. The philosophy of statistics. Statistician. 2000; 49(3): 293-337.
- Goodman SN. Toward evidence-based medical statistics. 1: The P value fallacy. Ann Intern Med. 1999; 130(12): 995-1004.
- Lilford RJ. The Messy End of Science. 16 April 2014. [Online].
- Hill AB: The environment and disease: Association or causation? Proc R Soc Med. 1965; 58(5): 295-300.