Tag Archives: Statistics

Causal Models Should Inform Statistical Models

CLAHRC WM News Blog readers will have read the study where 30 different statistical teams analysed data from the same association study (concerning the link between a soccer player’s skin colour and predisposition for the player to receive a red card).[1] At the end of the News Blog article Richard argued that the very different results of the analysis across the statistical teams would have been greatly attenuated had they first agreed on a causal model.[2] Even in the absence of agreement on a specific analysis protocol that operationalised this model, the shared conceptual model describing the underlying causal mechanisms leading to the association of skin colour to red  card totals could have led to an increased consensus concerning exactly which covariates to include in the model, whether to include them as mediating or confounding variables, and which (of the very large number of possible) first order interactions to examine.

In this article we provide a topical example to underscore the purpose of creating a conceptual causal model to inform building a statistical model. As our example, we use the putative correlation between the supportiveness of nurses’ workplace (henceforth called ‘workplace’) and clinical outcomes, such as pressure ulcers and patient satisfaction (henceforth called ‘outcome’).  A recent meta-analysis covering 21 studies and 22 countries found that a scale of workplace correlates with outcome, and that there was no statistical evidence of publication (small study) bias.[3]

But what should be controlled for in such association studies, and how can understanding of mechanisms (the essence of realist evaluations) be better understood?

We start with a simple model:

127 DCB - Causal Model Fig 1

Could the association be confounded? Say by hospital size? In that case it might immediately seem right to control for hospital:

127 DCB - Causal Model Fig 2

Not so fast we say! A confounder is a variable that is associated with both the explanatory variable (in this case environment) are the outcome variable, but which is not on the causal chain linking explanatory variable and outcome. Economists call a variable linking explanatory and outcome variables ‘endogenous’.

Don’t large hospitals have economies of scale? And don’t economies of scale allow them to have better nurse-patient ratios (hereafter called nurse/patient)? So a causal model relating hospital, environment and outcome might look like this:

127 DCB - Causal Model Fig 3

Now that we have constructed a causal model we are stimulated to theorise further. What about ‘leadership’, do I hear you say? Leadership may not be randomly distributed, the larger hospitals are likely to get first pick. So we might agree a model like this:

127 DCB - Causal Model Fig 4

It might also be necessary to control for possible confounders at the individual patient level that are not on the causal chain. Then our model may look like this:

127 DCB - Causal Model Fig 5

Presented like this our model suggests additional variables to measure and include, as well as a need to account for variables that are likely to be on the causal chain serving as mediating variables. Building causal conceptual models in this way can be formalized and extended using “directed acyclic graphs” which hold out the promise “that a researcher who has scientific knowledge in the form of a structural equation model is able to predict patterns of independencies in the data, based solely on the structure of the model’s graph, without relying on any quantitative information carried by the equations or by the distributions of the errors.”[4] If those independencies are validated by data that is collected it provides evidence for the models and the specified causal mechanisms. While there are challenges to building statistical models to analyse data using the more complex of these conceptual causal models, the resulting analyses are more likely to advance our theoretical understanding of the world.

— Richard Lilford, CLAHRC WM Director

— Timothy Hofer, Professor General Medicine

References:

  1. Silberzahn R, Uthman EL, Martin DP, et al. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Adv Methods Pract Psychol Sci. 2018; 1(3): 337-56.
  2. Lilford RJ. The Same Data Set Analysed in Different Ways Yields Materially Different Parameter Estimates: The Most Important Paper I Have Read This Year. NIHR CLAHRC West Midlands News Blog. 16 November 2018.
  3. Lake ET, Sanders J, Duan R, Riman KA, Schoenauer KM, Chen Y. A Meta-Analysis of the Associations Between the Nurse Work Environment in Hospitals and 4 Sets of Outcomes. Med Care. 2019; 57(5): 353-61.
  4. Pearl J, Glymour M, Jewell NP. Causal Inference in Statistics: A Primer. Chichester: Wiley; 2016. p. 35.

Now the Journal Nature Attacks P Values

News blog readers know that I am a long standing critic of decision making based on significance tests. ‘Nature’ has now added its voice to the long list of important organisations that have criticised the widespread use of P values to inform decision making.[1] The American Statistical Association has also published a series of articles on the inappropriate significance attached to the traditional limits of statistical significance![2]

‘Nature’ is less clear on what should replace (or I would say augment) the standard significance test. The answer, of course, is obvious – Bayesian statistics, based on informative priors. We are specialising in the use of Bayesian networks to incorporate all salient evidence across causal chains, to estimate parameters and their credible limits.[3]

— Richard Lilford, CLAHRC WM Director

References:

  1. Nature. It’s Time to Talk About Ditching Statistical Significance. Nature. 2019; 567: 283.
  2. Wasserstein RL, Schirm AL, Lazer NA. Moving to a World Beyond “p < 0.05”. Am Stat. 2019; 73(s1): 1-19.
  3. Watson SI & Lilford RJ. Essay 1: Integrating Multiple Sources of Evidence: a Bayesian Perspective. In: Challenges, solutions and future directions in the evaluation of service innovations in health care and public health. Southampton (UK): NIHR Journals Library, 2016.

The Same Data Set Analysed in Different Ways Yields Materially Different Parameter Estimates: The Most Important Paper I Have Read This Year

News blog readers know that I have a healthy scepticism about the validity of econometric/regression models. In particular, the importance of being aware of the distinction between confounding and mediating variables, the latter being variables that lie on the causal chain between explanatory and outcome variables. I therefore thank Dr Yen-Fu Chen for drawing my attention to an article by Silberzahn and colleagues.[1] They conducted a most elegant study in which 26 statistical teams analysed the same data set.

The data set concerns the game of soccer and the hypothesis that a player’s skin tone will influence propensity for a referee to issue a red card, which is some kind of reprimand to the player. The provenance of this hypothesis lies in shed loads of studies on preference for lighter skin colour across the globe and subconscious bias towards people of lighter skin colour. Based on access to various data sets that included colour photographs of players, each player’s skin colour was graded into four zones of darkness by independent observers with, as it turned out, high reliability (agreement between observers over and above that expected by chance).

The effect of skin colour tone and player censure by means of the red card was estimated by regression methods. The team was free to select its preferred method. The team could also select which of 16 available variables to include in the model.

The results across the 26 teams varied widely but were positive (in the hypothesised direction) in all but one case. The ORs varied from 0.89 to 2.93 with a median estimate of 1.31. Overall, twenty teams found a significant (in each case positive) relationship. This wide variability in effect estimates was all the more remarkable given that the teams peer-reviewed  each other’s methods prior to analysis of the results.

All but one team took account of the clustering of players in referees and the outlier was also the single team not to have a point estimate in the positive (hypothesised) direction. I guess this could be called a flaw in the methodology, but the remaining methodological differences between teams could not easily be classified as errors that would earn a low score in a statistics examination. Analytic techniques varied very widely, covering linear regression, logistic regression, Poisson regression, Bayesian methods, and so on, with some teams using more than one method. Regarding covariates, all teams included number of games played under a given referee and 69% included player’s position on the field. More than half of the teams used a unique combination of variables. Use of interaction terms does not seem to have been studied.

There was little systematic difference across teams by the academic rank of the teams. There was no effect of prior beliefs about what the study would show and the magnitude of effect estimated by the teams. This may make the results all the more remarkable, since there would have been no apparent incentive to exploit options in the analysis to produce a positive result.

What do I make of all this? First, it would seem to be good practice to use different methods to analyse a given data set, as CLAHRC West Midlands has done in recent studies,[2] [3] though this opens opportunities to selectively report methods that produce results convivial to the analyst. Second, statistical confidence limits in observational studies are far too narrow and this should be taken into account in the presentation and use of results. Third, data should be made publically available so that other teams can reanalyse them whenever possible. Fourth, and a point surprisingly not discussed by the authors, the analysis should be tailored to a specific scientific causal model ex antenot ex post. That is to say, there should be a scientific rationale for choice of potential confounders and explication of variables to be explored as potential mediating variables (i.e. variables that might be on the causal pathway).

— Richard Lilford, CLAHRC WM Director

References:

  1. Silberzahn R, Uthman EL, Martin DP, et al. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Adv Methods Pract Psychol Sci. 2018; 1(3): 337-56.
  2. Manaseki-Holland S, Lilford RJ, Bishop JR, Girling AJ, Chen Y-F, Chilton PJ, Hofer TP; the UK Case Note Review Group. Reviewing deaths in British and US hospitals: a study of two scales for assessing preventability. BMJ Qual Saf. 2017; 26: 408-16.
  3. Mytton J, Evison F, Chilton PJ, Lilford RJ. Removal of all ovarian tissue versus conserving ovarian tissue at time of hysterectomy in premenopausal patients with benign disease: study using routine data and data linkage. BMJ. 2017; 356: j372.

Twelve Deadly Misconceptions

Steven Goodman (my favourite statistician worldwide) writes a lively article on how not to interpret a p-value (or confidence interval – it is really the same kind of thing, often miss-sold as a remedy for the p-value problem).[1] How not to is one thing, how to is another. To my mind, things went wrong when we transplanted the notion of hypothesis testing from the lab to the clinic. It has no place in clinical medicine because you can be near certain that the (literally) true value is neither HO or HA! What we are doing is estimating the probabilities of effects / associations of different sizes. In any event, Goodman’s article is an excellent grounding in what we are doing (and not doing) when we use standard (frequentist) statistical tests.

— Richard Lilford, CLAHRC WM Director

Reference:

  1. Goodman S. A Dirty Dozen: Twelve P-Value Misconceptions. Semin Hematol. 2008; 45: 135-40.

 

P Values – Yet Again This Deceptively Slippery Concept

The nature of the P value has recent come up in the New England Journal of Medicine. Pocock, a statistician, is quoted as saying that “a P value of 0.05 carries a 5% risk of a false positive result.”[1]

Such a statement is obviously wrong, and Daniel Hu complains, correctly, that it is a “misconception”.[2] So Pocock and Stone reply that the p value of 0.05 carries a 5% risk of carrying a false positive result “when there is no true difference between treatments.”[3] This is correct, provided it is understood that false positive does not mean that the probability that the treatment is not effective is 5%. When is it reasonable to suppose that there is absolutely no true difference between treatments? Hardly ever. So the P value is not very useful to decision makers. The CLAHRC WM Director cautions statisticians not to discount prior consideration of how likely/ realistic a null hypothesis is. Homeopathy aside, it is seldom a plausible prior hypothesis.

— Richard Lilford, CLAHRC WM Director

References:

  1. Pocock SJ & Stone GW. The primary outcome is positive – is that good enough? N Engl J Med. 2016; 375: 971-9.
  2. Hu D. The Nature of the P Value. N Engl J Med. 2016; 375: 2205.
  3. Pocock SJ & Stone GW. Author’s Reply. N Engl J Med. 2016; 375: 2205-6.

Researchers Continue to Consistently Misinterpret p-values

For as long as there have been p-values there have been people misunderstanding p-values. Their nuanced definition eludes many researchers, statisticians included, and so they end up being misused and misinterpreted. The situation recently prompted the American Statistical Association (ASA)  to produce a statement on p-values.[1] Yet, they are still widely viewed as the most important bit of information in an empirical study, and careers are still built on ‘statistically significant’ findings. A paper in Management Science,[2] recently reported on Andrew Gelman’s blog,[3] reports the results of a number of surveys of top academics about their interpretations of the results of hypothetical studies. They show that these researchers, who include authors in the New England Journal of Medicine and American Economic Review, generally only consider whether the p-value is above or below 0.05; they consider p-values even when they are not relevant; they ignore the actual magnitude of an effect; and they use p-values to make inferences about the effect of an intervention on future subjects. Interestingly, the statistically untrained were less likely to make the same errors of judgement.

As the ASA statement and many, many other reports emphasise, p-values do not indicate the ‘truth’ of a result, nor do they imply clinical or economic significance, they are often presented for tests that are completely pointless, and they cannot be interpreted in isolation of all the other information about the statistical model and possible data analyses. It is possible that in the future the p-value will be relegated to a subsidiary statistic where it belongs rather than the main result, but until that time statistical education clearly needs to improve.

— Sam Watson, Research Fellow

References:

  1. Wasserstein RL & Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. Am Stat. 2016; 70(2). [ePub].
  2. McShane BM & Gal D. Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence. Manage Sci., 2015; 62(6): 1707-18.
  3. Gelman A. More evidence that even top researchers routinely misinterpret p-values. Statistical Modeling, Causal Inference, and Social Science. 26 July 2016.

Do they think we’re stupid? The rise of statistical manipulitis and preventable measures

If there is one thing that the campaigns on the EU Referendum have taught us, it’s how the same set of data can be used to generate statistics that support two completely opposing points of view. This is beautifully illustrated in a report in the Guardian newspaper.[1] While the research community (amongst others) might accuse the campaigners of misleading the public and lament the journalists who sensationalise our findings, we are not immune from statistical manipulitus. To help control the susceptibility of researchers to statistical manipulitis, compulsory registration of trial protocols had to be instigated,[2] but five years later the majority of studies failed to do so, even registered trials where reporting results within one year of trial completion was mandated.[3] Furthermore, reporting alone provides insufficient public protection against the symptoms of statistical manipulitis. As highlighted in a previous blog, and one of Ben Goldacre’s Bad Science blogs,[4] researchers have been known to change primary endpoints, or select which endpoints to report. To provide a full aetiology for statistical manipulitis is beyond the scope of this blog, although Maslow’s belief that esteem (incorporating achievement, status, dominance and prestige) precedes self-actualisation (incorporating the realisation of one’s actual personal potential) provides an interesting starting point.[5] Whatever the causative mechanism, statistical manipulitis is not the only adverse consequence. For example, some professional athletes may stretch the principles underlying Therapeutic Use Exemptions to enable them to legally use substances on the World Anti-Doping Agency’s banned list, such as testosterone-based creams to treat saddle-soreness, when not all physicians would consider the athlete’s symptoms sufficiently severe to justify their use.[6]

We can also think of statistical manipulitis as pushing its victims across a balanced scale to the point at which the statistics presented become too contrived to be believed. Which side in the EU Referendum debate has travelled further from equilibrium is a moot point. While important gains could be had if those engaged with the debate knew the point at which the public’s scale is balanced, watching them succumb has injected some much-needed entertainment. The increased awareness of statistical manipulitis resulting from the debate has also provided an open door for those involved with public engagement with science to help move that tipping point and reduce the expected value of manipulation. To do so, the public need the tools and confidence to ask questions about political, scientific and other claims, as now being facilitated by the work of CLAHRC WM’s new PPIE Lead, Magdalena Skrybant, in her series entitled Method Matters. The first instalment, on regression to the mean, is featured in this blog.

Method Matters are ‘bite size’ explanations to help anyone without a degree in statistics or experience in research methods make sense of the numbers and claims that are bandied about in the media, using examples taken from real life written. Certainly, we would hope that through Method Matters, more people would be able to accurately diagnose any cases of statistical manipulitis and take relevant precautions.

Writing Method Matters is not an easy task: if each student in my maths class had rated my explanation of each topic, those ratings would vary both within and between students. My challenge was how to maximise the number of students leaving the class uttering those five golden words: “I get it now Miss!” Magdalena faces a tougher challenge – one size does not fit all and, unlike a “live” lesson, she cannot offer multiple explanations or answer questions in real time. However, while I had to convince 30 14-year-olds of the value of trigonometry on a windy Friday afternoon, the epidemic of statistical manipulitis highlighted by the EU Referendum debate has provided fertile ground for Method Matters. Please let us know what you think.

— Celia Taylor, Associate Professor

References:

  1. Duncan P, Gutiérrez P, Clarke S. Brexit: how can the same statistics be read so differently? The Guardian. 3 June 2016.
  2. Abbasi K. Compulsory registration of clinical trials. BMJ. 2004; 329: 637.
  3. Prayle AP, Hurley MN, Smyth AR. Compliance with mandatory reporting of clinical trial results on ClinicalTrials.gov: cross sectional study. BMJ. 2012; 344: d7373.
  4. Goldacre B. The data belongs to the patients who gave it to you. Bad Science. 2008.
  5. McLeod S. Maslow’s Hierarchy of Needs. Simply Psychology. 2007.
  6. Bassindale T. TUE – Therapeutic Use Exemptions or legitimised drug taking? We Are Forensic. 2014.

Comparing Statistical Process Control and Interrupted Time Series

Some people have tried to tell the CLAHRC WM Director that Statistical Process Control (SPC) and standard statistics are completely different ideas – one to do with special cause variation and the other with hypothesis testing. This concept is not so much wrong as it is not right! Fretheim and Tomic recently published a relevant and interesting article in BMJ Quality and Safety.[1] The CLARHC WM Director’s take on this article is as follows. If there is no intervention, then use SPC to detect non-random variation, but if there is an intervention point (or period), use a statistical test. Since such a test uses the information that an intervention has occurred at a certain point in time, it is much more sensitive to change than SPC. Interrupted time series methods should be used to compare the slope of the lines before and after the intervention and remember to allow for any auto-correlation, as emphasised in a previous post. However, the CLAHRC WM Director emphasises  the need for contemporaneous controls whenever possible to allow for temporal trends – ‘rising tide’ situations.[2] He is also very concerned about publication bias arising from selective reporting of ‘positive’ interrupted time series studies. In the meantime, our CLAHRC has discovered that information presented to hospital boards mostly does not use SPC and when it is used, the limits (e.g. two or three SDs) are not stated.[3]

— Richard Lilford, CLAHRC WM Director

References:

  1. Fretheim A, & Tomic O. Statistical Process Control and Interrupted Time Series: A Golden Opportunity for Impact Evaluation in Quality Improvement. BMJ Qual Saf. 2015. [ePub].
  2. Chen Y, Hemming K, Stevens AJ, Lilford RJ. Secular trends and evaluation of complex interventions: the rising tide phenomenon. BMJ Qual Saf. 2015. [ePub].
  3. Schmidtke K, Poots AJ, Carpio J, et al. Considering Chance in Quality and Safety Performance Measures. BMJ Qual Saf. 2016. [In Press].

More on ‘P-Hacking’

News Blog readers know that the CLAHRC WM has a large interest in dissemination biases of various kinds. We recently featured an article on ‘p-hacking’, which can be unmasked by detecting a peak of findings with P values just below the fabled 0.05 threshold, as represented below.

Possible distribution of p-values around the null hypothesis

Possible distribution of p-values around the null hypothesis.

Well, Ginsel, et al. analysed 2,000 abstracts selected at random from over 80,000 papers in medical journals.[1] Sure enough, there was an overrepresentation between P values 0.05 and 0.049. The P value in this study was <0.001!

— Richard Lilford, CLAHRC WM Director

Reference:

  1. Ginsel B, Aggawal A, Xuan W, Harris I. The distribution of probability values in medical abstracts: an observational study. BMC Res Notes. 2015; 8(1): 721.

Q. When Can Evidence-Based Care do More Harm than Good?

A. When People Mistake no Evidence of Effect for Evidence of no Effect.

Imagine that you have malignant melanoma on your forearm. You can select wide margin excision or a narrow margin. The latter is obviously less disfiguring.

Results from six RCTs (n=4,233) have been consolidated in a meta-analysis.[1] In keeping with individual trials and with previous meta-analyses, the result is null for numerous outcomes. However, the point estimates all favour wider margins and the confidence limits are close to the (arbitrary) 0.5% significance level. For example, the hazard ratio for overall survival favouring wide margins is 1.09 (0.98-1.22). The authors state that the study shows “a 33% probability that [overall survival] is more than 10% worse” when a narrow margin excision is used. It should be added that this assumes an uninformative prior. If the prior probability estimate favoured better survival with wider excision margins, then the evidence in favour of a wider margin excision is stronger still. Moreover, the authors quote results showing that patients do not trade even small survival gains for improved cosmetic outcome. Despite loose statistical language (conflating the probability of survival given the data with the probability of the data if there was no difference in outcome), the authors have done science and practice a great service. This paper should be quoted in the context of surgical treatment of cancer, not just melanoma excision. For example, is sentinel biopsy really preferred to axillary dissection in breast cancer surgery?

— Richard Lilford, CLAHRC WM Director

Reference:

  1. Wheatley K, Wilson JS, Gaunt P, Marsden JR. Surgical excision margins in primary cutaneous melanoma: A meta-analysis and Bayesian probability evaluation. Cancer Treat Rev. 2015. [ePub].