An increasing proportion of evaluations are based on database studies. There are many good reasons for this. First, there simply is not enough capacity to do randomised comparisons of all possible treatment variables. Second, some treatment variables, such as ovarian removal during hysterectomy, are directed by patient choice rather than experimental imperative. Third, certain outcomes, especially those contingent on diagnostic tests, are simply too rare to evaluate by randomised trial methodology. In such cases, it is appropriate to turn to database studies. And when conducting database studies it is becoming increasingly common to use machine learning rather than standard statistical methods, such as logistic regression. This article is concerned with strengths and limitations of machine learning when databases are used to look for evidence of effectiveness.
When conducting database studies, it is right and proper to adjust for confounders and look for interaction effects. However, there is always a risk that unknown or unmeasured confounders will result in residual selection bias. Note that two types of selection are in play:
- Selection into the study.
- Once in the study, selection into one arm of the study or another.
Here we argue that while machine learning has advantages over RCTs with respect to the former type of bias, it cannot (completely) solve the problem of selection to one type of treatment vs. another.
Selection into a Study and Induction Across Place and Time (External Validity)
A machine learning system based on accumulating data across a health system has advantages with respect to the representativeness of the sample and generalisations across time and space.
First, there are no exclusions by potential participant or clinician choice that can make the sample non-representative of the population as a whole. It is true that the selection is limited to people who have reached the point where their data become available (it cannot include people who did not seek care, for example), but this caveat aside, the problem of selection into the study is strongly mitigated. (There is also the problem of ‘survivor bias’, where people are ‘missing’ from the control group because they have died, become ineligible or withdrawn from care. We shall return to this issue.)
Second, the machine can track (any) change in treatment effect over time, thereby providing further information to aid induction. For example, as a higher proportion of patients/ clinicians adopt a new treatment, so intervention effect can be examined. Of course, the problem is not totally solved, because the possibility of different effects in other health systems (not included in the database) still exists.
Selection Once in a Study (Internal Validity)
However, the machine cannot do much about selection to intervention vs. control conditions (beyond, perhaps, enabling more confounding variables to be taken into account). This is because it cannot get around the cause-effect problem that randomisation neatly solves by ensuring that unknown variables are distributed at random (leaving only lack of precision to worry about). Thus, machine learning might create the impression that a new intervention is beneficial when it is not. If the new intervention has nasty side-effects or high costs, then many patients could end up getting treatment that does more harm than good, or which fails to maximise value for money. Stability of results across strata does not vitiate the concern.
It could be argued, however, that selection effects are likely to attenuate as the intervention is rolled out over an increasing proportion of the population. Let us try a thought experiment. Consider the finding that accident victims who receive a transfusion have worse outcomes than those who do not, even after risk-adjustment. Is this because transfusion is harmful, or because clinicians can spot those who need transfusion, net of variables captured in statistical models? Let us now suppose that, in response to the findings, clinicians subsequently reduce use of transfusion. It is then possible that changes in the control rate and in the treatment effect can provide evidence for or against cause and effect explanations. The problem here is that bias may change as the proportions receiving one treatment or the other changes. There are thus two possible explanations for any set of results – a change in bias or a change in effectiveness, as a wider range of patients/ clinicians receive the experimental intervention. It is difficult to come up with a convincing way to resolve the cause and effect problem. I must leave it to someone cleverer than myself to devise a theorem that might shed at least some light on the plausibility of the competing explanations – bias vs. cause and effect. But I am pessimistic for this general reason. As a treatment is rolled out (because it seems effective) or withdrawn (because it seems ineffective or harmful), so the beneficial or harmful effect (even in relative risk ratio terms) is likely to attenuate. But the bias is also likely to attenuate because less selection is taking place. Thus the two competing explanations may be confounded.
There is also the question of whether database studies can mitigate ‘survivor bias’. When the process (of machine learning) starts, then survivor bias may exist. But, by tracking estimated treatment effect over time, the machine can recognise all subsequent ‘eligible’ cases as they arise. This means that the problem of survivor bias should be progressively mitigated over time?
So what do I recommend? Three suggestions:
- Use machine learning to provide a clue to things that you might not have suspected or thought of as high priority for a trial.
- Nest RCTs within database studies, so that cause and effect can be established at least under specified circumstances, and then compare the results with what you would have concluded by machine learning alone.
- Use machine learning on an open-ended basis with no fixed stopping point or stopping rule, and make data available regularly to mitigate the risk of over-interpreting a random high. This approach is very different to the standard ‘trial’ with a fixed starting and end data, data-monitoring committees, ‘data-lock’, and all manner of highly standardised procedures. Likewise, it is different to resource heavy statistical analysis, which must be done sparingly. Perhaps that is the real point – machine learning is inexpensive (has low marginal costs) once an ongoing database has been established, and so we can take a ‘working approach’, rather than a ‘fixed time point’ approach to analysis.
— Richard Lilford, CLAHRC WM Director
- Lilford RJ. The End of the Hegemony of Randomised Trials. 30 Nov 2012. [Online].
- Mytton J, Evison F, Chilton PJ, Lilford RJ. Removal of all ovarian tissue versus conserving ovarian tissue at time of hysterectomy in premenopausal patients with benign disease: study using routine data and data linkage. BMJ. 2017; 356: j372.
- De Bono M, Fawdry RDS, Lilford RJ. Size of trials for evaluation of antenatal tests of fetal wellbeing in high risk pregnancy. J Perinat Med. 1990; 18(2): 77-87.
- Lilford RJ, Braunholtz D, Edwards S, Stevens A. Monitoring clinical trials—interim data should be publicly available. BMJ. 2001; 323: 441