Skip to main content

How do you do pioneering research without ever seeing the data?

Lucy Elliss-Brookes, Associate Director for Data Curation at NHS Digital, explains the new approach that allowed researchers to show the life-changing benefits of the Human Papilloma Virus vaccine – without any direct access to patient data.

A recent study published in The Lancet showed, for the first time, that the Cervarix Human Papilloma Virus (HPV) vaccine has had a dramatic effect on cervical cancer incidence. As well as this being an historic moment for the vaccination programme, there is also an exciting data story behind the headlines.

Two teenage girls writing on a whiteboard in a classroom.

A new approach

What makes this study novel and exciting is that the researchers never had access to any patient data during their work: the real individual-level data never left the secure National Cancer Registration and Analysis Service (NCRAS) environment and patient confidentiality was protected.

The study used cancer registration data which are collated, maintained and quality assured by the NCRAS, which was part of Public Health England when the research was carried out and is now part of NHS Digital. The work was undertaken in collaboration with the Cancer Prevention Group (which is funded by Cancer Research UK) at Kings College London (KCL), the lead author was Dr Milena Falcaro and the study was overseen by Professor Peter Sasieni.

The Simulacrum

Researchers at KCL did not need direct access to patient data because they developed and tested their models using simulated data, extending the artificial cancer data in the Simulacrum, and then sent their analysis code to NCRAS to run on the real data.

The Simulacrum is synthetic cancer data which imitates some of the data held securely by NCRAS. It’s a publicly available resource that anyone can use to learn more about the structure and properties of cancer data in England without compromising patient privacy, as it does not contain any real patient information.

Also, because the Simulacrum contains the same data fields and similar data entries as the data held by NCRAS, the Simulacrum can be used to write and test queries, as it was for this study, before making a formal request (with the right permissions and ethical approval) to analyse the real data.

It allows anyone to learn more about the structure and properties of cancer data in England without compromising patient privacy.

The Simulacrum was developed and built by Health Data Insight Community Interest Company as part of a partnership with NCRAS. It was generated synthetically by mathematical and computational analysis of tens of thousands of anonymous extracts from the original NCRAS data. You can find out more about the work on the Simulacrum website, but a note of caution: results from the Simulacrum should not be used for clinical decisions because it only approximates the original data.

For The Lancet study, Milena collaboratively designed a statistical analysis plan and then wrote and tested her analysis (Stata) code based on an extension of the data in the Simulacrum. The code was given to Dr Busani Ndlela at NCRAS who ran it on the real cancer data held securely inside Public Health England. The aggregate results were quality assured by Jennifer Lai in NCRAS and then shared with the KCL researchers to interpret, with the suppression of any small numbers to further protect patient confidentiality.

Global impact

This research used data that have been provided by patients and collected by the NHS as part of their care and support. Working in partnership with external organisations, including academic research groups and other data experts, enabled NCRAS to make the best use of the cancer data that it is responsible for curating. Our contribution to this study, which gives new evidence that the HPV vaccine could potentially eradicate cervical cancer, shows how we can safely use patient data to improve health outcomes of women not only in England, but across the globe as well.


Last edited: 15 December 2021 4:06 pm