Protocols for Database Studies

We introduce a new innovation for the CLAHRC WM News Blog – the online protocol for database studies. Here we introduce the approach. Later in the blog we enclose a pre-protocol for a study we propose to carry out in ‘real life’. We would value your opinion on this approach.

Large, prospective studies, such as randomised clinical trials, have formal protocols that can be accessed by reviewers and readers. Publication of protocols increases transparency, thereby reducing many types of ‘dissemination bias.’ First, it reduces the risk of undetected publication bias. The risk that the literature will be skewed towards studies with positive results. Second, it reduces data-driven hypotheses masquerading as prior hypotheses – reviewers can determine whether what was planned was done, and whether what was done was planned. In many cases these protocols are published in recognised journals, most often online journals, such as Trials or BMJ Open. Increasingly, publication of protocols – placing them in the public domain – is seen as a tenet of good practice.[1]

Database studies are at particularly high risk of publication bias and other types of ‘dissemination bias’, such as selectively publishing significant findings or performing numerous correlations and selecting only those with ‘positive’ results, a topic of previous posts on p-hacking.[2] [3] Moreover, modern clinical service delivery research increasingly relies on such database studies; ‘big data’ is all the rage. This is not a criticism of such studies; CLAHRC WM is a proponent of database studies, and we have reviewed some iconic data-linkage studies, such as the study that unravelled the ‘Muslim mortality paradox’.[4] All the more important then, to guard against dissemination bias. Barriers to entry are low for database studies. That is to say, they can often be done without grant funding (and hence without the requirement to submit a protocol). Unlike trials, there is no requirement or expectation that a protocol will be filed (registered in the public domain). Anyone with access to the data can sit down at the computer and ‘play’. As they do so, new ideas may occur to them, or a finding may prompt further exploration of the data. That would be fine if all results were reported, but the risk here is that positive results are reported, while the denominator – the number of correlations from which published correlations are drawn – is unknown. Even the investigator might not know how many correlations are performed, because s/he may not have kept a tally. More risky still, routinely collected data may be analysed for ‘quality control’ purposes and the idea of publishing the findings may arise only when interest is piqued by a positive result. It is through this biased process that so called “quality improvement reports” arise. Inevitably, these are a highly skewed sample of quality improvement initiatives.

The obvious response to this risk is to mandate pre-publication of a protocol for database studies and then insist that people stick to it, as one would for a large clinical trial. But this is heavy-handed. First, people often interrogate databases for service reasons, with no intent to publish. Should they be muzzled if they chance upon an interesting and important finding – say concerning a putative negative side effect of treatment? Second, it is often necessary to refine a search as one proceeds – one may realise that the same condition can be recorded under many different sub-categories, for example. A surprise finding may prompt subsidiary questions that can be answered from the database – for example, finding an increased cancer incidence in association with some exposure begs the question, which cancers?

What can be recommended so that, on the one hand research is not stifled, but on the other hand the risk of dissemination bias is mitigated? We propose a half-way house between excessive straight-jacketing of database searches, and encouraging a free-for-all with the risk of p-hacking and selective reporting with consequent false positive study results. The epistemological principles we hew to are:

  1. Sharp separation question formulation / study of design and data collection.
  2. Transparency.

Our point of departure from the most stylised and formal processes, such as those properly followed in RCTs, is acceptance that there can be rapid iterations:

Design -> Data collection -> Design of subsidiary study -> Further data collection, and so on

Such a proposal is entirely compatible with philosophical argument on subjectivity and objectivity in science [5] and with Fichte’s proposition [6] of:

Thesis -> Anti-thesis -> Synthesis -> Thesis, and so on

Only one problem remains – how to operationalise such a process, i.e. how to maintain the required separation and transparency while iteratively refining questions to ask of databases. Computers to the rescue! We propose that an original protocol is filed with a view to recording and dating amendments prior to each subsequent database search. Rather than just ‘fly a kite’ we will provide a living demonstration in the pages of the News Blog. In this issue we ‘file’ a pre-protocol. This will then be sent to the data hub where Felicity Evison and colleagues will try to ‘operationalise’ the search and will populate the database with specific searchable terms for concepts such as ‘peritonsilar abscesses.’ We may meet or telephone, but no search will be done until we have agreed the updated search protocol, which will then be filed in the News Blog alongside the pre-protocol. Such iterations will continue until we send the manuscript for publication. At this point reviewers (and future readers) will have full access to the protocol through all stages of its evolution. We hope you enjoy this ‘real time’ story as it unfolds in the pages of your News Blog. Readers are invited to contribute to the enclosed pre-protocol (and to its future evolution) and contributions that lead to a change in the protocol will be acknowledged in the final paper. We welcome contributions from patients and the public.

— Richard Lilford, CLAHRC WM Director


  1. Chan A-W, Tetzlaff JM, Gøtzsche PC, Altman DG, Mann H, Berlin JA, et al. SPIRIT 2013 explanation and elaboration: guidance for protocols of clinical trials. BMJ 2013; 346: e7586.
  2. Lilford RJ. Look out for ‘P-Hacking’. NIHR CLAHRC West Midlands News Blog. 11 September 2015.
  3. Lilford RJ. More on ‘P-Hacking’. NIHR CLAHRC West Midlands News Blog. 18 December 2015.
  4. Lilford RJ. The Most Important Applied Research Paper This Year? Perhaps Any Year? NIHR CLAHRC West Midlands News Blog. 19 September 2014.
  5. Lilford RJ. Objectivity in Service Delivery Research. NIHR CLAHRC West Midlands. 19 June 2015.
  6. Fichte J. Early Philosophical Writings. Trans. and ed. Breazeale D. Ithaca, NY: Cornell University Press, 1988.

2 thoughts on “Protocols for Database Studies”

  1. Having been involved in a number of database studies, I think the suggestion of an evolving protocol with dated amendments is an excellent idea and what exactly what I was thinking as I started to read your blog. This is essentially how the CPRD currently operates, although here protocols are reviewed by a committee. I think an institutional repository, or similar, of protocols would work very well for studies using any database.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s