When extrapolating from trial data to a particular context, it is important to compare the trial population to the target population. Given sufficient data, it is possible to examine treatment effect across important subgroups of patients. Then the trial results can be related to a specific sub-group, say with less severe disease than the average in the trial. One problem is that trial data are collected with greater diligence than routine data. Hence a suggestion to link trial data to routine data collected on the same patients. That way one can compare subgroups of trial and non-trial patients recorded in a broadly similar (i.e. routine) way. This strikes me as a half-way house to the day when (most) trial data are collected by routine systems, and trials are essentially nested within routine data-collection systems.
Oh dear – the CLAHRC WM Director would so like to think that disease-specific mortality is the appropriate outcome for cancer screening trials, rather than all-cause mortality. But Black and colleagues have published a very sobering article. They found 12 trials of cancer screening (yes, only 12) where both cancer-specific mortality and all-cause mortality are reported. The effect size (in relative risk terms) is bigger for cancer-specific than for all-cause mortality in seven trials, about the same in four, and the other way in one. This suggests that the benefit is greater, even relatively, for cancer-specific than for all deaths. There are two explanations for this – one that the CLAHRC WM Director had thought of, and the other that was new to him.
Investigation and treatment of false positives (including cancers that would never had presented) may increase risk of death as a result of iatrogenesis and heightened anxiety. There is some evidence for this.
According to the ‘sticky diagnosis theory’, once a diagnostic label has been assigned, then a subsequent death is systematically more likely to be attributed to that diagnosis than if that diagnosis had not been made. There is some evidence for this hypothesis too.
And here is the thing – in screening trials a very small proportion of people in either arm of the study die from the index disease. The corollary is that a small mortality increase among the majority not destined to die has a relatively large effect.
So we have done many expensive trials, and implemented large, expensive screening programmes, yet our effects might have been nugatory. And there is a reason why so few trials have all-cause mortality outcomes – the trials have to be long and potential effects on this end-point are small and liable to be lost in the noise. Somewhere there is a ‘horizon of science’ where precision is hard to find, and where tiny biases can swamp treatment effects. At the risk of sounding nihilistic, the CLAHRC WM Director wonders whether cancer screening is such a topic.
Cluster trials very seldom use a cross-over design for the reason that it is typically tricky to withdraw a cluster level intervention once it has been introduced. However, as in clinical trials, the cross-over design is very powerful statistically (yields precise estimates) in those situations where it is feasible. Such was the case in a cluster trial of methods for cardio-pulmonary resuscitation. One hundred and fourteen clusters (emergency medical services) participated. Adults with non-trauma related cardiac arrest were managed (according to cluster and phase) with either:
Continuous chest compressions with asynchronous ventilations ten times per minute (experimental method); or
Compressions interrupted to provide ventilation at a ratio of 30 compressions to two ventilations (standard method).
Nearly 24,000 people with cardiac arrest were included in the study and the survival rate with continuous compressions was slightly lower (at 9.0%) than with the standard interrupted method (at 9.7%). The result was not quite significant on standard statistical analysis. The CLAHRC WM Director thought the interrupted method would seem to be the one to go for, but the accompanying editorial was equivocal  – it would appear that even a trial of 24,000 participants, albeit in clusters, was not enough to resolve the issue. However, the trial methodology is certainly interesting.
The evaluation of specific treatments in specific diseases are generally investigated in trials of modest size – high hundreds or low thousands. The outcomes of interest, when binary, typically have baseline (control) rates of 5% to 10%. Worthwhile (say 20% risk ratio) improvements can be detected with reasonable precision by trials of this size. When we move to screening, vaccinations, and mass treatment programs, however, things become more difficult, and mega (10,000–100,000) or even ultra (>100,000) trials are necessary. The vitamin A trials in neonates discussed above collectively enrolled 100,038 participants, while the current cohort of vitamin D trials in adults is expected to enroll in excess of 100,000 participants, and the UK Collaborative Trial of Ovarian Cancer Screening has just over 200,000 participants. Given the shape of the graph relating marginal gains in precision to marginal increases in participants (the power function ) we may be reaching the ‘horizon of science’ in these topics.
Randomised controlled trials (RCTs) are getting larger. Increased sample sizes enable researchers to achieve greater statistical precision and detect increasingly smaller effect sizes against diminishing baseline rates of the primary outcomes of interest. Thus over time, we are seeing an increase in the sample sizes of RCTs, leading to what may be termed mega-trials (>1,000 participants) and even ultra-trials (>10,000 participants). The below figure shows the minimum detectable effect size (in terms of difference from baseline) for a trial versus its sample size (with α = 0.05, β = 0.8 and a control group baseline of 0.1, i.e. 10%), along with a small selection of non-cluster RCTs from the New England Journal of Medicine published in the last three years. What this figure illustrates is that there is diminishing returns, in terms of statistical power, from larger sample sizes.
Nevertheless, with great statistical power comes great responsibility. Assuming that the sample size is large enough that observation of a p-value greater than 0.05 is evidence that no (clinically significant) effect exists may lead to perhaps erroneous conclusions. For example, Fox etal. (2014) enrolled 19,102 participants to examine whether ivabradine improved clinical outcomes of patients with stable coronary artery disease. The estimated hazard ratio for death from cardiovascular causes or acute myocardial infarction with the treatment was 1.08 but with a p-value of 0.2, and so it was concluded that ivabradine did not improve outcomes. However, we might see this as evidence that ivabradine worsens outcomes. A crude calculation suggests that the minimum detectable hazard ratio in this study was 1.14, and, for a sample of this size, the results suggest that almost 50 more patients died (against a baseline of 6.3%) in the treatment group. One might therefore actually see this as clinically significant.
Similarly, Roe etal. (2012) enrolled 7,243 patients to compare prasugrel and clopidogrel for acute coronary syndromes without revascularisation. The hazard ratio for death with prasugrel was 0.91 with a p-value of 0.21. The authors concluded that prasugrel did not “significantly” reduce the risk of death. Yet, with the death rate in the clopidogrel group at 16%, a hazard ratio of 0.91, with a sample size this large, represents approximately 50 fewer deaths in the prasugrel group. Again, some may argue that this is clinically significant. Importantly, a quick calculation reveals that the minimum detectable effect size in this study was 0.89.
Many authors have warned against using p-values to decide on whether an intervention has an effect or not. Mega- and ultra-trials do not reduce the folly of using p-values in this way and may even exacerbate the problem by providing a false sense of confidence.
Nearly two decades ago colleagues from the Cochrane Complementary and Alternative Medicine (CAM) Field found that a high proportion of trials originating from Eastern Asia or Eastern Europe tended to report positive results – nearly 100% in some cases. This was not only for acupuncture trials but also for trials in other topics. More recently, a team led by Professor John Ioannidis, a renowned epidemiologist, conducted a meta-epidemiological study (a methodological study that examines and analyses data obtained from many systematic reviews/meta-analyses) and showed that treatment effects reported in trials conducted in less developed countries are generally larger compared with those reported in trials undertaken in more developed countries. Many factors could have contributed to these observations, for example, publication bias, reporting bias, rigour of scientific conduct, difference in patient populations and disease characteristics, and genuine difference in intervention efficacy. While it is almost certain that the observation was not attributed to genuine difference in intervention efficacy alone, teasing out the influence of various factors is not an easy task. Lately, colleagues from CLAHRC WM have compared results from cardiovascular trials conducted in Europe with those conducted in North America, and did not find a convincing difference between them. Perhaps the more interesting findings will come from the comparison between trials from Europe/America and those from Asia. The results? The paper is currently in press, so watch this space!