Category Archives: Director & Co-Directors’ Blog

A Bigger Risk Than Climate Change?

There are many future risks to our planet. The risk that seems to cause most concern is climate change. I share these concerns. However, there are some risks which, if they materialised, would be even worse than those of climate change. An asteroid strike, such as the collision 66 million years ago, is in this category. But this is a most improbable event, essentially 0% in the next 50 years.[1] There is, however, a risk that would be absolutely catastrophic, and whose probability of occurring is not remote. I speak, of course, of nuclear strike.

There are two issues to consider: the degree of the catastrophe, and its probability of occurring. Regarding the extent of the catastrophe, one can refer to the website Nukemap. Here one finds evidence-based apocalyptic predictions. In order to make sense of these it is necessary to appreciate that nuclear bombs destroy human life in three zones radiating out from the epicentre: the fire ball; the shock wave; and the area of a residual radiation (whose direction depends on prevailing winds). If a relatively small atomic bomb, such as a 455 kiloton bomb from a nuclear submarine, landed on Glasgow, it would kill an estimated quarter of a million people and injure half a million, not taking into account radiation damage. The bomb released by the Soviets into the upper atmosphere at 50 megatons (Tsar Bomba) would, if it landed on London, kill over 4.5 million people and injure 3 million more (again not including the radiation damage that would most likely spread across northern Europe). Daugherty, Levi and Van Hippel calculate that deployment of only 1% of the world’s nuclear armaments would cause up to 56 million deaths and 61 million casualties in the book ‘The Medical Implications of Nuclear War‘.[2] Clearly, larger conflagrations pose an existential threat that could wipe out the whole of the northern hemisphere. When I look at my lovely grandchildren, sleeping in their beds at night, I sometimes think of that. And all of the above harms exclude indirect effects resulting from collapse of law and order, financial systems, supply chains, and so on.

So, nuclear war could be catastrophic, but to calculate the net expected burden of disease and disability we need to know the probability of its occurrence. The risk of nuclear strike must be seen as material. During, and immediately following, the Cold War there were at least three points at which missile strikes were imminent. They were all a matter of miscalculation. The most likely cause of nuclear war is a false positive signal of a strike, perhaps simulated by a terrorist group. These risks are increasing since at least eight countries now have nuclear weapons. The risk of a single incident, leading to the death of, say, 1 million people, might be as high as 50% over the next 50 years according to some models.[3] Another widely cited figure from Hellman is 2% per year.[4] The risk of an attack with retaliatory strikes, and hence over 50 million dead, would be lower – say 10% over the next 50 years. Identifying the risk of future events may seem quixotic, but not trying to do so is like the ostrich putting its head in the sand. Using slogans such as ‘alarmist’ is simply a way of avoiding uncomfortable thoughts better confronted. Let us say the risk of a strike with retaliation is indeed 10% over 50 years, and that 50 million causalities will result. If the average causality is 40 years of age, then the expected life years lost over 50 years would be about 200,000,000 (50m x 40 x 0.1). This is without discounting, but why would one discount these lives on the basis of current time-preferences?

Given the high expected loss of life (life years multiplied by probability), it seems that preventing nuclear war is up there with climate change. The effects of nuclear war are immediate and destroy infrastructure, while climate change provides plenty of warning and infrastructure can be preserved, even if at high cost. Avoiding nuclear war deserves no less attention. In 2014 the World Health Organization published a report that estimated that climate change would be responsible for 241,000 additional deaths in the year 2030, which is likely an underestimate as their model could not quantify a number of causal pathways, such as economic damage, or water scarcity.[5] But we have time to adapt and reduce this risk – nuclear war would be sudden and would disrupt coping mechanisms, leading to massive social and economic costs, along with large numbers of deaths and people diseased or maimed for life. Nuclear strike is public health enemy number one in my opinion. It is difficult to pursue the possible options to reduce this risk without entering the world of politics, so this must be pursued within the pages of your News Blog.

— Richard Lilford, CLAHRC WM Director


  1. Sentry: Earth Impact Monitoring. Impact Risk Data. 2018.
  2. Daugherty W, Levi B, Von Hippel F. Casualties Due to the Blast, Heat, and Radioactive Fallout from Various Hypothetical Nuclear Attacks on the United States. In: Solomon F & Marston RQ (eds.) The Medical Implications of Nuclear War. Washington, D.C.: National Academies Press (US); 1986.
  3. Barrett AM, Baum SD, Hostetler K. Analyzing and Reducing the Risks of Inadvertent Nuclear War Between the United States and Russia. Sci Glob Security. 2013; 21: 106-33.
  4. Hellman ME. Risk Analysis of Nuclear Deterrence. The Bent of Tau Beta Pi. 2008 : Spring: 14-22.
  5. World Health Organization. Quantitative risk assessment of the effects of climate change on selected causes of death, 2030s and 2050s. Geneva: World Health Organization, 2014.



Do Poor Examination Results Predict That a Doctor Will Get into Trouble with the Regulator?

A recent paper by Richard Wakeford and colleagues [1] reports that better performance in postgraduate examinations for membership of the Royal Colleges of General Practitioners and of Physicians (MRCGP and MRCP respectively) is associated with a reduced likelihood of being sanctioned by the General Medical Council for insufficient fitness to practise. The effect was stronger for examinations of clinical skills, as opposed to those of applied medical knowledge, but was statistically significant for all five examinations studied. The unweighted mean effect size (Cohen’s d) was -0.68 – i.e. doctors with sanctions had examination scores that were, on average, around two-thirds of a standard deviation below those of doctors without a sanction. The authors find a linear relationship between performance and the likelihood of a sanction, which suggests that there was no clear performance threshold at which there is a significant change in the risk of a sanction.

The main analysis does not control for the timing of the examination attempt vis-á-vis the timing of the sanction, and the authors rightly point out that having a sanction could reduce subsequent examination performance due to the stress of being under investigation, for example. However the results of a sub-analysis for two of the knowledge assessments (MRCGP Applied Knowledge Test, and MRCP Part 1) suggest a slightly larger effect size when only considering doctors whose examination attempt was at least two years before their sanction, so the “temporality” requirement for causation is not absent. We also know there is some stability in relative examination performance (and, plausibly, therefore, knowledge) over time [2] – so “reversed” timing may not be a critical bias.

This study is important as it suggests that performance on the proposed UK Medical Licensing Assessment (UKMLA) (which is likely to be similar in format to both the examinations included in this study) may be a predictor of future standards of professional practice. However, the study also suggests that it may not be possible to find a pass mark for the UKMLA that has a significant impact on the number of doctors for whom sanctions are imposed (in comparison to other possible pass marks). Given the intention of the UKMLA as a pass/fail assessment and the low rate of sanctions amongst doctors on the GMC register (1.6% of those on the register in January 2017 had one or more sanctions since September 2008, and even lower amongst doctors in their first decade since joining the register), it is unlikely that the introduction of the UKLMA will have a detectable difference on the rate of sanctions. As a result, other outcome measures for an evaluation of its predictive validity will be needed, even with a large sample size (around 8,000 UK candidates per year).

Nevertheless, given that at least some sanctions relate to communication (and not just clinical performance), the results of Wakeford and colleagues’ study also imply that there is not necessarily a trade-off between a doctor’s knowledge base and their skills relating to communication, empathy and bedside manner. This may have implications for those responsible for selection into and within the profession, as Richard Lilford and I suggested some time ago.[3] Taken to its limit, it could be argued that the expensive and often criticised situational judgement test which is intended to evaluate the non-cognitive attributes of doctors may not be required after all.

— Celia Brown, Associate Professor


  1. Wakeford R, Ludka K, Woolf K, McManus IC. Fitness to practise sanctions in UK doctors are predicted by poor performance at MRCGP and MRCP(UK) assessments: data linkage study. BMC Medicine. 2018; 16: 230.
  2. McManus IC,Woolf K, Dacre J, Paice E, Dewberry C. The Academic Backbone: longitudinal continuities in educational achievement from secondary school and medical school to MRCP(UK) and the specialist register in UK medical students and doctorsBMC Medicine. 2013: 11:242.
  3. Brown CA, & Lilford RJ. Selecting medical students. BMJ. 2008; 336: 786.

Health Service and Delivery Research – a Subject of Multiple Meanings

Never has there been a topic so subject to lexicological ambiguity as that of Service Delivery Research. Many of the terms it uses are subject to multiple meanings, making communication devilishly difficult; a ‘Tower of Babel’ according to McKibbon, et al.[1] The result is that two people may disagree when they agree, or agree when they are fundamentally at odds. The subject is beset with ‘polysemy’(one word means different things) and, to an even greater extent, ‘cognitive synonyms’ (different words mean the same thing).

Take the very words “Service Delivery Research”. The study by McKibbon, et al. found 46 synonyms (or near synonyms) for the underlying construct, including applied health research, management research, T2 research, implementation research, quality improvement research, and patient safety research. Some people will make strong statements as to why one of these terms is not the same as another – they will tell you why implementation research is not the same as quality improvement, for example. But seldom will two protagonists agree and give the same explanation as to why they differ, and textual exegesis of the various definitions does not support separate meanings – they all tap into the same concept, some focussing on outcomes (quality, safety) and others on the means to achieve those outcomes (implementation, management).

Let us examine some widely used terms in more detail. Take first the term “implementation”. The term can mean two quite separate things:

  1. Implementation of the findings of clinical research (e.g. if a patient has a recent onset thrombotic stroke then administer a ‘clot busting’ medicine).
  2. Implementation of the findings from HS&DR (e.g. do not use incentives when the service providers targeted by the incentive do not believe they have any control over the target.[2][3]

Then there is my bête noire, “complex interventions”. This term concatenates separate ideas, such as the complexity of the intervention vs. the complexity of the system (e.g. health system) with which the intervention interacts. Alternatively, it may concatenate the complexity of the intervention components vs. the number of components it includes.

It is common to distinguish between process and outcome, á la Donabedian.[4] But this conflates two very different things – clinical process (such as prescribing the correct medicine, eliciting the relevant symptoms, or displaying appropriate affect), and service level (upstream) process endpoints (such as favourable staff/patient ratios, or high staff morale). We have described elsewhere the methodological importance of this distinction.[5]

Intervention description is famously conflated with intervention uptake/ fidelity/ adaptation. The intervention description should be the implementation as described (like the recipe), while the way the interventions is assimilated in the organisation is a finding (like the process the chef actually follows).[6]

These are just a few examples of words with multiple meanings that cause health service researchers to fall over their feet. Some have tried to forge an agreement over these various terms, but widespread agreement is yet to be achieved. In the meantime, it is important to explain precisely what is meant when we talk about implementation, processes, complexity, and so on.

— Richard Lilford, CLAHRC WM Director


  1. McKibbon KA, Lokker C, Wilczynski NL, et al. A cross-sectional study of the number and frequency of terms used to refer to knowledge translation in a body of health literature in 2006: a Tower of Babel? Implementation Science. 2010; 5: 16.
  2. Lilford RJ. Financial Incentives for Providers of Health Care: The Baggage Handler and the Intensive Care Physician. NIHR CLAHRC West Midlands News Blog. 2014 July 25.
  3. Lilford RJ. Two Things to Remember About Human Nature When Designing Incentives. NIHR CLAHRC West Midlands News Blog. 2017 January 27.
  4. Donabedian A. Explorations in quality assessment and monitoring. Health Administration Press, 1980.
  5. Lilford RJ, Chilton PJ, Hemming K, Girling AJ, Taylor CA, Barach P. Evaluating policy and service interventions: framework to guide selection and interpretation of study end points. BMJ. 2010; 341: c4413.
  6. Brown C, Hofer T, Johal A, Thomson R, Nicholl J, Franklin BD, Lilford RJ. An epistemology of patient safety research: a framework for study design and interpretation. Part 3. End points and measurement. Qual Saf Health Care. 2008. 17;170-7.

A Casualty of Evidence-Based Medicine – Or Just One of Those Things. Balancing a Personal and Population Approach

My mother-in-law, Celia, died last Christmas. She died in a nursing care home after a short illness – a UTI that precipitated prescription of two courses of antibiotics followed by an overwhelming C. diffinfection from which she did not recover. She had suffered from mild COPD after years of cigarette smoking, although she had given up more than 35 years previously, and she also had hypertension (high blood pressure) treated with a variety of different medications (more of which later). She was an organised and sensible Jewish woman who would not let you leave her flat without a food parcel of one kind or another, and who had arranged private health insurance to have her knees and cataracts replaced in good time. Officially, medically she had multimorbidity; unofficially her life was a full and active one, which she enjoyed. She moved house sensibly and in good time, to a much smaller warden-supervised flat with a stair lift, ready to enjoy her declining years in comfort and with support. She had a wide circle of friends, loved going out to matinées at the theatre, and was a passionate bridge player and doting grandma. So far so typical, but I wonder if indirectly she died of iatrogenesis – doctor induced disease – and I have been worrying about exactly how to understand and interpret the pattern of events that afflicted her for some time.

A couple of weeks ago a case-control study was published in JAMA (I can already hear you say ‘case control in JAMA!’ yes – andit’s a good paper).[1] It helps to raise the problem of whatmay have happened to my son’s grandma and has implications for evidence use in health care. The important issue is that my mother-in-law also suffered from recurrent syncope, or fainting and falls. It became inconvenient – actually more than inconvenient. She would faint after getting up from a meal, after going upstairs, after rising in the morning – in fact at any time when she stood up. She fell a lot, maybe ten times that I knew about and perhaps there were more. She badly bruised her face once, falling onto her stair lift and on three occasions she broke her bones as a result of falling. She broke her ankle requiring surgical intervention and her arm, and her little finger. Her GP ordered a 24-hour ECG and referred her to a cardiologist where she had a heap of expensive investigations.

Ever the over-enthusiastic medically-qualified, meddling epidemiologist, I went with her to see her cardiologist. We had a long discussion about my presumptive diagnosis: postural hypotension – low blood pressure on standing up – and her blood pressure readings confirmed my suspicion. Postural hypotension can be caused by rare abnormalities, but one of the commonest causes is antihypertensive medication – medication for high blood pressure. The cardiologist and the GP were interested in my view, but were unhappy to change her medication. As far as they were concerned, she definitely came into the category of high blood pressure, which should be treated.

The JAMA paper describes the mortality and morbidity experience of 19,143 treated patients matched to untreated controlsin the UK using CPRD data. Patients entered the study on an ‘index date’, defined as 12 months after the date of the third consecutive blood pressure reading in specific a range (140-159/90-99mmHg). It says: “During a median follow-up period of 5.8 years (interquartile range, 2.6-9.0 years), no evidence of an association was found between antihypertensive treatment and mortality (hazard ratio [HR], 1.02; 95% CI, 0.88-1.17) or between antihypertensive treatment and CVD (HR, 1.09; 95% CI, 0.95-1.25). Treatment was associated with an increased risk of adverse events, including hypotension (HR, 1.69; 95% CI, 1.30-2.20; number needed to harm at 10 years [NNH10], 41), and syncope (HR, 1.28; 95% CI, 1.10-1.50; NNH10, 35).”

Translated into plain English, this implies that the high blood pressure medication did not make a difference to the outcomes that it was meant to prevent (cardiovascular disease or death). However, it did make a difference to the likelihood of getting adverse events including hypotension (low blood pressure) and syncope (fainting). The paper concludes: “This prespecified analysis found no evidence to support guideline recommendations that encourage initiation of treatment in patients with low-risk mild hypertension. There was evidence of an increased risk of adverse events, which suggests that physicians should exercise caution when following guidelines that generalize findings from trials conducted in high-risk individuals to those at lower risk.”

Of course, there are plenty of possible criticisms that can never be completely ironed out of a retrospective case control study relying on routine data, even by the eagle-eyed scrutineers at CLAHRC WM and the JAMA editorial office. Were there underlying pre-existing characteristics that differentiated case and controls at inception into the study, which might affect their subsequent mortality or morbidity experience? Perhaps those who were the untreated controls were already ‘survivors’ in some way that could not be adjusted for. Was the follow-up period long enough for the participants to experience the relevant outcomes of interest? A median of 5.8 years is not long when considering the development of major cardiovascular illness. Was attention to methods of dealing with missing data adequate? For example, the study says: “Where there was no record of blood pressure lowering, statin or antiplatelet treatment, it was assumed that patients were not prescribed treatment.” Nevertheless, some patients might have been receiving prescriptions that, for whatever reason, were not properly recorded. The article is interesting, and food for thought. We must always bear in mind, however, that observational designs are subject to the play of those well-known, apparently causative variables, ‘confoundings.’[2]

What does all this mean for my mother-in-law? I did not have access to her full medical record and do not know the exact pattern of her blood pressure readings over the years. I am sure that current guideless would clearly have stated that she should be prescribed antihypertensive medication. The risk of her getting a cardiovascular event must have been high, but the falls devastated her life completely. Her individual GP and consultant took a reasonable, defensible and completely sensible decision to continue with her medication and her falls continued. Finally, a family decision was taken that she couldn’t stay in her own home – she had to be watched 24 hours a day. Her unpredictable and devastating falls were very much a factor in the decision.

Celia hated losing her autonomy and she never really agreed with the decision. From the day that the decision was taken she went downhill. She stopped eating when she went in to the nursing home and wouldn’t even take the family’s chicken soup, (the Jewish antibiotic) however lovingly prepared. It was not surprising that after a few weeks, and within days of her 89thbirthday, she finally succumbed to infection and died.

How can we rationalise all this? Any prescription for any medication should be a balance of risks and benefits, and we need to assess these at both the population level, for guidelines, and at the individual level, for individuals. It’s very hard to calculate precisely how the risk of possible future cardiovascular disease (heart attack or stroke) stacked up for my mother-in-law, against the real and present danger of her falls. But I can easily see what apparently went wrong in her medical care, with the benefit of hindsight. I think that the conclusion has to be that in health care we should never lose sight of the individual. Was my mother-in-law an appropriately treated elderly woman experiencing the best of evidence-based medicine? Or was she the victim of iatrogenesis, a casualty of evidence-based medicine whose personal experiences and circumstances were not fully taken into account in the application of guidelines? Certainly, in retrospect it seems to me that I may have failed her – I wish I’d supported her more to have her health care planned around her life, rather than to have her shortened life planned around her health care.

Aileen Clarke, Professor at Warwick Medical School


  1. Sheppard JP, Stevens S, Stevens R, et al. Benefits and Harms of Antihypertensive Treatment in Low-Risk Patients With Mild Hypertension. JAMA Intern Med. 2018.
  2. Goldacre B. Personal communication. 2018.

Evidence-Based Guidelines and Practitioner Expertise to Optimise Community Health Worker Programmes

The rapid increase in scale and scope of community health worker (CHW) programmes highlights a clear need for guidance to help programme providers optimise programme design. A new World Health Organization (WHO) guideline in this area [1] is therefore particularly welcome, and provides a complement to existing guidance based on practitioner expertise.[2] The authors of the WHO guideline undertook an overview of existing reviews (N=122 reviews with over 4,000 references included), 15 separate systematic reviews of primary studies (N=137 studies included), and a stakeholder perception survey (N=96 responses). The practitioner expertise report was developed following a consensus meeting of six CHW programme implementers, a review of over 100 programme documents, a comparison of the standard operating procedures of each implementer to identify areas of alignment and variation, and interviews with each implementer.

The volume of existing research, in terms of the number of eligible studies included in each of the 15 systematic reviews, varied widely, from no studies for the review question “Should practising CHWs work in a multi-cadre team versus in a single-cadre CHW system?” to 43 studies for the review question “Are community engagement strategies effective in improving CHW programme performance and utilization?”. Across the 15 review questions, only two could be answered with “moderate” certainty of evidence (the remainder were “low” or “very low”): “What competencies should be included in the curriculum?” and “Are community engagement strategies effective?”. Only three review questions had a “strong” recommendation (as opposed to “conditional”): those based on Remuneration(do so financially), Contracting agreements(give CHWs a written agreement), and Community engagement(adopt various strategies). There was also a “strong” recommendation not to use marital status as a selection criterion.

The practitioner expertise report provided recommendations in eight key areas and included a series of appendices with examples of selection tools, supervision tools and performance management tools. Across the 18 design elements, there was alignment across the six implementers for 14, variation for two (Accreditation– although it is recommended that all CHW programmes include accreditation – and CHW:Population ratio), and general alignment but one or more outliers for two (Career advancement– although supported by all implementers, and Supply chain management practices).

There was general agreement between the two documents in terms of the design elements that should be considered for CHW programmes (Table 1), although notincluding an element does not necessarily mean that the report authors do not think it is important. In terms of the specific content of the recommendations, the practitioner expertise document was generally more specific; for example, on the frequency of supervision the WHO recommend “regular support” and practitioners “at least once per month”. The practitioner expertise report also included detail on selection processes, as well as selection criteria: not just what to select for, but how to put this into practice in the field. Both reports rightly highlight the need for programme implementers to consider all of the recommendations within their own local contexts; one size will not fit all. Both also highlight the need for more high quality research. We recently found no evidence of the predictive validity of the selection tools used by Living Goods to select their CHWs,[3] although these tools are included as exemplars in the practitioner expertise report. Given the lack of high quality evidence available to the WHO report authors, (suitably qualified) practitioner expertise is vital in the short term, and this should now be used in conjunction with the WHO report findings to agree priorities for future research.

Table 1: Comparison of design elements included in the WHO guideline and Practitioner Expertise report

114 DC - WHO Guidelines Fig

— Celia Taylor, Associate Professor


  1. World Health Organization. WHO guideline on health policy and system support to optimize community health worker programmes. Geneva, Switzerland: WHO; 2018.
  2. Community Health Impact Coalition. Practitioner Expertise to Optimize Community Health Systems. 2018.
  3. Taylor CA, Lilford RJ, Wroe E, Griffiths F, Ngechu R. The predictive validity of the Living Goods selection tools for community health workers in Kenya: cohort study. BMC Health Serv Res. 2018; 18: 803.

The Same Data Set Analysed in Different Ways Yields Materially Different Parameter Estimates: The Most Important Paper I Have Read This Year

News blog readers know that I have a healthy scepticism about the validity of econometric/regression models. In particular, the importance of being aware of the distinction between confounding and mediating variables, the latter being variables that lie on the causal chain between explanatory and outcome variables. I therefore thank Dr Yen-Fu Chen for drawing my attention to an article by Silberzahn and colleagues.[1] They conducted a most elegant study in which 26 statistical teams analysed the same data set.

The data set concerns the game of soccer and the hypothesis that a player’s skin tone will influence propensity for a referee to issue a red card, which is some kind of reprimand to the player. The provenance of this hypothesis lies in shed loads of studies on preference for lighter skin colour across the globe and subconscious bias towards people of lighter skin colour. Based on access to various data sets that included colour photographs of players, each player’s skin colour was graded into four zones of darkness by independent observers with, as it turned out, high reliability (agreement between observers over and above that expected by chance).

The effect of skin colour tone and player censure by means of the red card was estimated by regression methods. The team was free to select its preferred method. The team could also select which of 16 available variables to include in the model.

The results across the 26 teams varied widely but were positive (in the hypothesised direction) in all but one case. The ORs varied from 0.89 to 2.93 with a median estimate of 1.31. Overall, twenty teams found a significant (in each case positive) relationship. This wide variability in effect estimates was all the more remarkable given that the teams peer-reviewed  each other’s methods prior to analysis of the results.

All but one team took account of the clustering of players in referees and the outlier was also the single team not to have a point estimate in the positive (hypothesised) direction. I guess this could be called a flaw in the methodology, but the remaining methodological differences between teams could not easily be classified as errors that would earn a low score in a statistics examination. Analytic techniques varied very widely, covering linear regression, logistic regression, Poisson regression, Bayesian methods, and so on, with some teams using more than one method. Regarding covariates, all teams included number of games played under a given referee and 69% included player’s position on the field. More than half of the teams used a unique combination of variables. Use of interaction terms does not seem to have been studied.

There was little systematic difference across teams by the academic rank of the teams. There was no effect of prior beliefs about what the study would show and the magnitude of effect estimated by the teams. This may make the results all the more remarkable, since there would have been no apparent incentive to exploit options in the analysis to produce a positive result.

What do I make of all this? First, it would seem to be good practice to use different methods to analyse a given data set, as CLAHRC West Midlands has done in recent studies,[2] [3] though this opens opportunities to selectively report methods that produce results convivial to the analyst. Second, statistical confidence limits in observational studies are far too narrow and this should be taken into account in the presentation and use of results. Third, data should be made publically available so that other teams can reanalyse them whenever possible. Fourth, and a point surprisingly not discussed by the authors, the analysis should be tailored to a specific scientific causal model ex antenot ex post. That is to say, there should be a scientific rationale for choice of potential confounders and explication of variables to be explored as potential mediating variables (i.e. variables that might be on the causal pathway).

— Richard Lilford, CLAHRC WM Director


  1. Silberzahn R, Uthman EL, Martin DP, et al. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Adv Methods Pract Psychol Sci. 2018; 1(3): 337-56.
  2. Manaseki-Holland S, Lilford RJ, Bishop JR, Girling AJ, Chen Y-F, Chilton PJ, Hofer TP; the UK Case Note Review Group. Reviewing deaths in British and US hospitals: a study of two scales for assessing preventability. BMJ Qual Saf. 2017; 26: 408-16.
  3. Mytton J, Evison F, Chilton PJ, Lilford RJ. Removal of all ovarian tissue versus conserving ovarian tissue at time of hysterectomy in premenopausal patients with benign disease: study using routine data and data linkage. BMJ. 2017; 356: j372.

Re-thinking Medical Student Written Assessment

“Patients do not walk into the clinic saying ‘I have one of these five diagnoses. Which do you think is most likely?’” (Surry et al., 2017)

The predominant form of written assessment for UK medical students is the ‘best of five multiple choice question’ (Bo5). Students are presented with a clinical scenario – usually information about a patient, a lead-in or question such as “which is the most likely diagnosis?” and a list of five possible answers, only one of which is unambiguously correct. Bo5 questions are incredibly easy to mark, particularly in the age of computer-read answer sheets (or even computerised assessment). This is critical when results must be turned-round, ratified and feedback provided to students in a timely manner. Because Bo5s are relatively short (UK medical schools allow a median of 72 seconds per question, compared with short answer or essay questions for which at least 10 minutes per question would be allowed), an exam comprising of Bo5 questions can cover a broad sample of the curriculum. This helps to improve the reliability of the exam: a student’s grade is not contingent on ‘what comes up in the exam’, so should have been similar had a different set of questions covering the same curriculum been used. Students not only know that their (or others’) scores are not dependent on what came up, but they are also reassured that they would get the same score regardless of who (or what) marked their paper. There are no hawk/dove issues in Bo5 marking.

On the other hand, Bo5 questions are notoriously difficult to develop. The questions used in the Medical Schools Council Assessment Alliance (MSCAA) Common Content project, where questions are shared across UK medical schools to enable passing standards for written finals exams to be compared,[1] go through an extensive review and selection process prior to inclusion (the general process for MSCAA questions is summarised by Melville, et al. [2]). Yet the data are returned for analysis with comments such as “There is an assumption made in this question that his wife has been faithful to the man” or “Poor distractors – no indication for legionella testing”. But perhaps the greatest problem with Bo5 questions is their poor representativeness to clinical practice. As the title of this blog implied, patients do not come with a list of five possible pathologies, diagnoses, important investigations, treatment options, or management plans. While a doctor would often formulate such a list (e.g. a differential diagnosis) before determining the most likely or appropriate option, such formulation requires considerable skill. We all know that assessment drives learning, so by using Bo5 we may therefore be inadvertently hindering students from developing the full set of clinical reasoning skills required of a doctor. There is certainly evidence that students use test-taking strategies such as elimination of implausible answers and clue-seeking when sitting Bo5-based exams.[3]

A new development in medical student assessment, the Very Short Answer question (VSA) therefore holds much promise. It shifts some of the academic/expert time from question development to marking, but, by exploiting computer-based assessment technology, does so in a way that is not prohibitive given the turn-around times imposed by institutions. The VSA starts with the same clinical scenario as a Bo5. The lead-in changes from “Which is…?” to “What is…?” and this is followed by a blank space. Students are required to type between one and five words in response. A pilot of the VSA style question showed that the list of acceptable answers for a question could be finalised by a clinical academic in just over 90 seconds for a cohort of 300 students.[4] With the finalised list automatically applied to all students’ answers, again there are no concerns regarding hawk/dove markers that would threaten the exam’s acceptability to students. While more time is required per question when using VSAs compared to Bo5s, the internal consistency of VSAs in the pilot was higher for the same number of questions,[4] so it should be possible to find an appropriate compromise between exam length and curriculum coverage that does not jeopardise reliability. The major gain with the use of VSA questions is in clinical validity; these questions are more representative of actual clinical practice than Bo5s, as was reported by the students who participated in the pilot.[4]

To produce more evidence around the utility of VSAs, the MSCAA is conducting a large-scale pilot of VSA questions with final year medical students across the UK this autumn. The pilot will compare student responses and scores to Bo5 and VSA questions delivered electronically and assess the feasibility of online delivery using the MSCAA’s own exam delivery system. A small scale ‘think aloud’ study will run alongside the pilot, to compare students’ thought processes as they attempt Bo5 and VSA questions. This work will provide an initial test of the hypothesis that gains in clinical reasoning validity could be achieved with VSAs, as students are forced to think ‘outside the list of five’. There is strong support for the pilot from UK medical schools, so the results will have good national generalisability and may help to inform the design of the written component of the UK Medical Licensing Assessment.

We would love to know what others, particularly PPI representatives, think of this new development in medical student assessment.

— Celia Taylor, Associate Professor


  1. Taylor CA, Gurnell M, Melville CR, Kluth DC, Johnson N, Wass V. Variation in passing standards for graduation‐level knowledge items at UK medical schools. Med Educ. 2017; 51(6): 612-20.
  2. Melville C, Gurnell M, Wass V. #5CC14 (28171) The development of high quality Single Best Answer questions for a national undergraduate finals bank. [Abstract] Presented at: The International Association for Medical Education AMEE 2015; 2015 Oct 22; Glasgow. p. 372.
  3. Surry LT, Torre D, Durning SJ. Exploring examinee behaviours as validity evidence for multiple‐choice question examinations. Med Educ. 2017; 51(10): 1075-85.
  4. Sam AH, Field SM, Collares CF, et al. Very‐short‐answer questions: reliability, discrimination and acceptability. Med Educ.2018; 52(4): 447-55.

Trials are Not Always Needed for Evaluation of Surgical Interventions: Does This House Agree?

I supported the above motion at a recent surgical trails meeting in Bristol. What where are my arguments?

I argued that there were four broad categories of intervention where trials were not needed:

  1. Where causality is not in dispute

This scenario arises where, but for the intervention, a bad outcome was all but inevitable. Showing that such an outcome can be prevented in only a few cases is sufficient to put the substantive question to bed. Such an intervention is sometimes referred to as a ‘penicillin-type’ of intervention. Surgical examples include heart transplantation and in vitro fertilisation (for people both of whose Fallopian tubes have been removed). From a philosophy of science perspective, causal thinking requires a counterfactual: what would have happened absent the intervention? In most instances a randomised trial provides the best approximation to that counterfactual. However, when the counterfactual is near inevitable death, then a few cases will be sufficient to prove the principle. Of course, this is not the end of the story. Trials of different methods within a generic class will always be needed, along with trials of cases where the indication is less clear cut, and hence where the counterfactual cannot be predicted with a high level of certainty. Nevertheless, the initial introduction of heart transplantation and in vitro fertilisation took place without any randomised trial. Nor was such a trial necessary.

  1. Speculative procedures where there is an asymmetry of outcome

This is similar to the above category, but the justification is ethical rather than scientific. I described a 15 year old girl who was born with no vagina but a functioning uterus. She was referred to me with a pyometra, having had an unsuccessful attempt to create a channel where the vagina should have been. The standard treatment in such a dire situation would have been hysterectomy. However, I offered to improvise and try an experimental procedure using tissue expansion methods to stretch the skin at the vaginal opening and then to use this skin to create a functioning channel linking the uterus to the exterior. The patient and her guardian accepted this procedure, in the full knowledge that it was entirely experimental. In the event, I am glad to report that the operation was successful, producing a functional vagina and allowing regular menstruation.[1] The formal theory behind innovative practice in such dire situations comes from expected utility theory.[2] An example is explicated in the figure.

113 DCB - Trials Eval Sur Interv Figure

This example relates to a person with very low life expectancy and a high-risk procedure that may either prove fatal or extend their life for a considerable time. In such a situation, the expected value of the risky procedure considerably exceeds doing nothing and is preferable, from the point of view of the patient, to entry in an RCT. In fact, the expected value of the RCT (with a 1:1 randomisation ratio) is (0.5 x 0.25) + (0.5 x 1.0) = 0.625. While favourable in comparison to ‘no intervention’, it is inferior in comparison with the ‘risky intervention’.

  1. When the intervention has not been well thought through

Here my example was full frontal lobotomy. Trials and other epidemiological methods can only work out how to reach an objective, not which objective to reach or prioritise. Taking away someone’s personality is nota fair price to pay for mental calmness.

  1. When the trial is poor value for money

Trials are often expensive and we have made them more so with extensive procedural rules. Collection of end-points by routine systems is only part of the answer to this question. Hence trials can be a poor use of research resources. Modelling shows that the value of the information trials provide is sometimes exceeded by the opportunity cost.[3-5]

Of course, I am an ardent trialist. But informed consent must be fully informed so that the preferences of the patient can come into play. I conducted an RCT of two methods of entering a patient into an RCT and showed that more and better information reduced willingness to be randomised.[6] Trial entry is justified when equipoise applies, and the ‘expected value’ of the alternative treatment is about the same.[7] The exception is when the new treatment is unlicensed. Then equipoise plus should apply – the expected value of trial entry should exceed or equal that of standard treatment.[8]

— Richard Lilford, CLAHRC WM Director


  1. Lilford RJ, Sharpe DT, Thomas DFM. Use of tissue expansion techniques to create skin fplas for vaginoplasty. Case report. Br J Obstet Gynacol. 1988;95: 402-7.
  2. Lilford RJ. Trade-off between gestational age and miscarriage risk of prenatal testing: does it vary according to genetic risk? Lancet. 1990; 336: 1303-5.
  3. De Bono M, Fawdry RDS, Lilford RJ. Size of trials for evaluation of antenatal tests of fetal wellbeing in high risk pregnancy. J Perinat Med. 1990; 18(2): 77-87.
  4. Lilford R, Girling A, Braunholtz D. Cost-Utility Analysis When Not Everyone Wants the Treatment: Modeling Split-Choice Bias.Med Decis Making. 2007; 27(1): 21-6.
  5. Girling AJ, Freeman G, Gordon JP, Poole-Wilson P, Scott DA, Lilford RJ. Modeling payback from research into the efficacy of left-ventricular assist devices as destination therapy. Int J Technol Assess Health Care. 2007; 23(2): 269-77.
  6. Wragg JA, Robison EJ, Lilford RJ. Information presentation and decisions to enter clinical trials: a hypothetical trial of hormone replacement therapy. Soc Sci Med. 2000; 51(3): 453-62.
  7. Lilford J. Ethics of clinical trials from a Bayesian and decision analytic perspective: whose equipoise is it anyway?BMJ. 2003; 326: 980.
  8. Robinson EJ, Kerr CE, Stevens AJ, Lilford RJ, Braunholtz DA, Edwards SJ, Beck SR, Roelwy MG. Lay public’s understanding of equipoise and randomisation in randomised controlled trials. Health Technol Assess. 2005; 9(8): 1-192.

Estimating Mortality Due to Low-Quality Care

A recent paper by Kruk and colleagues attempts to estimate the number of deaths caused by sub-optimal care in low- and middle-income countries (LMICs).[1] They do so by selecting 61 conditions that are highly amenable to healthcare. They estimate deaths from these conditions from the global burden of disease studies. The proportion of deaths attributed to differences in health systems is estimated from the difference in deaths between LMICs and high-income countries (HICs). So if the death rate from stroke in people aged 70 to 75 is ten per thousand in HICs and 20 per thousand in LMICs, then ten deaths per 1000 are preventable. This ‘subtractive method’ to estimate deaths that could be prevented by improved health services simply answers the otiose question: “what would happen if low-income countries and their populations could be converted, by the wave of a wand, into high-income countries complete with populations enjoying high income from conception?” Such a reductionist approach simply replicates the well-known association between per capita GDP and life expectancy.[2]

The authors of the above paper do try to isolate the effect of institutional care from access to facilities. To make their distinction they need to estimate utilisation of services. This they do from various household surveys, conducted at selected sites around the world. These surveys contain questions about service use. So a further subtraction is performed; if half of all people deemed to be having a stroke utilise care, then half of the difference in stroke mortality can be attributed to quality of care.

Based on this methodology the authors find that the lion’s share of deaths are caused by poor quality care not failure to get care. This conclusion is flawed because:

  1. The link between the databases is at a very coarse level – there is no individual linkage.
  2. As a result risk-adjustment is not possible.
  3. Further to the above, the method is crucially unable to account for delays in presentation and access to care preceding presentation that will inevitably result in large differences in prognosis at presentation.
  4. Socio-economic status and deprivation over a lifetime is associated with recovery from a condition, so differences in outcome are not due only to differences in care quality.[3]
  5. There are measurement problems at every turn. For example, Global Burden of Disease is measured in very different ways across HICs and LMICs – the latter rely heavily on verbal autopsy.
  6. Quality, as measured by crude subtractive methodologies, includes survival achieved by means of expensive high technology care. However, because of opportunity costs, introduction of effective but expensive treatments will do more harm than good in LMICs (until they are no longer LMICs).

The issue of delay in presentation is crucial. Take, for example, cancer of the cervix. In HICs the great majority of cases are diagnosed at an early, if not at a pre-invasive, stage. However, in low-income countries almost all cases were already far advanced when they present. To attribute the death rate difference to the quality of care is inappropriate. Deep in the discussion the authors state ‘comorbidity and disease history could be different between low and high income countries which can result in some bias.’ This is an understatement and the problem cannot be addressed by a passing mention of it. Later they also assert that all sensitivity analyses support the conclusion that poor healthcare is a larger driver of amenable mortality than utilisation of services. But it is really difficult to believe such a sensitivity analyses when this bias is treated so lightly.

Let us be clear, there is tons of evidence that care is, in many respects, very sub-optimal in LMICs. We care about trying to improve it. But we think such dramatic results based on excessively reductionist analyses are simply not justifiable and in seeking attention in this way risk undermining broader support for the important goal of improving care in LMICs. In areas from global warming to mortality during the Iraq war we have seen the harm that marketing with unreliable methods and generalizing beyond the evidence can do to a good cause by giving fodder to those who don’t want to believe that there is a problem. What is needed are careful observations and direct measurements of care quality itself, along with evaluations of the cost-effectiveness of methods to improve care. Mortality is a crude measure of care quality.[4][5] Moreover, the extent to which healthcare reduces mortality is quite modest among older adults. The type of paper reported here topples over into marketing – it is as unsatisfying as a scientific endeavour as it is sensational.

— Richard Lilford, CLAHRC WM Director

— Timothy Hofer, Professor in Division of General Medicine, University of Michigan


  1. Kruk ME, Gage AD, Joseph NT, Danaei G, García-Saisó S, Salomon JA. Mortality due to low-quality health systems in the universal health coverage era: a systematic analysis of amenable deaths in 137 countries. Lancet. 2018.
  2. Rosling H. How Does Income Relate to Life Expectancy. Gap Minder. 2015.
  3. Pagano D, Freemantle N, Bridgewater B, et al. Social deprivation and prognostic benefits of cardiac surgery: observational study of 44,902 patients from five hospitals over 10 years. BMJ. 2009; 338: b902.
  4. Lilford R, Mohammed MA, Spiegelhalter D, Thomson R. Use and misuse of process and outcome data in managing performance of acute medical care: avoiding institutional stigma. Lancet. 2004; 363: 1147-54.
  5. Girling AJ, Hofer TP, Wu J, et al. Case-mix adjusted hospital mortality is a poor proxy for preventable mortality: a modelling studyBMJ Qual Saf. 2012; 21(12): 1052-6.

Cognitive Bias Modification for Addictive Behaviours

It can be difficult to change health behaviours. Good intentions to quit smoking or drink less alcohol, for example, do not always translate into action – or, if they do, the change doesn’t last very long. A meta-analysis of meta-analyses suggests that intentions explain, at best, a third of the variation in actual behaviour change.[1] [2] What else can be done?

One approach is to move from intentions to inattention. Quite automatically, people who regularly engage in a behaviour like smoking or drinking alcohol pay more attention to smoking and alcohol-related stimuli. To interrupt this process ‘cognitive bias modification’ (CBM) can be used.

Amongst academics, the results of CBM have been called “striking” (p. 464),[3] prompted questions about how such a light-touch intervention can have such strong effects (p. 495),[4] and led to the development of online CBM platforms.[5]

An example of a CBM task for heavy alcohol drinkers is using a joystick to ‘push away’ pictures of beer and wine and ‘pull in’ pictures of non-alcoholic soft drinks. Alcoholic in-patients who received just an hour of this type of CBM showed a 13% lower rate of relapse a year later than those who did not – 50/108 patients in the experimental group and 63/106 patients in the control group.[4]

Debate about the efficacy of CBM is ongoing. It appears that CBM is more effective when administered in clinical settings rather than in a lab experiment or online.[6]

— Laura Kudrna, Research Fellow


  1. Sheeran P. Intention-behaviour relations: A conceptual and empirical review. In: Stroebe W, Hewstone M (Eds.). European review of social psychology, (Vol. 12, pp. 1–36). London: Wiley; 2002.
  2. Webb TL Sheeran P. Does changing behavioral intentions engender behavior change? A meta-analysis of the experimental evidence. Psychol Bull. 2006; 132(2): 249.
  3. Sheeran P, Gollwitzer PM, Bargh JA. Nonconscious processes and health. Health Psychol. 2013; 32(5): 460.
  4. Wiers RW, Eberl C, Rinck M, Becker ES, Lindenmeyer J. Retraining automatic action tendencies changes alcoholic patients’ approach bias for alcohol and improves treatment outcome. Psychol Sci. 2011; 22(4): 490-7.
  5. London School of Economics and Political Science. New brain-training tool to help people cut drinking. 18 May 2016.
  6. Wiers RW, Boffo M, Field M. What’s in a trial? On the importance of distinguishing between experimental lab studies and randomized controlled trials: The case of cognitive bias modification and alcohol use disorders. J Stud Alcohol Drugs. 2018; 79(3): 333-43.