We have detected that you are using Internet Explorer to visit this website. Internet Explorer is now being phased out by Microsoft. As a result, NHS Digital no longer supports any version of Internet Explorer for our web-based products, as it involves considerable extra effort and expense, which cannot be justified from public funds. Some features on this site will not work. You should use a modern browser such as Edge, Chrome, Firefox, or Safari. If you have difficulty installing or accessing a different browser, contact your IT support team.
General Practice Extraction Service (GPES) Data for pandemic planning and research: a guide for analysts and users of the data
This guidance provides an overview of the dataset for analysts and other users of the General Practice Extraction Service (GPES) Data for Pandemic Planning and Research (GDPPR) that will provide information for coronavirus (COVID-19) planning and research.
This website provides an overview of the dataset for analysts and other users of the GPES data for Pandemic Planning and Research (GDPPR) that will provide information for COVID-19 planning and research:
- an overview of the generic GPES data extraction mechanism
- a description of the specific extract requirement (instance of a GPES extract) that will be used to provide the COVID-19 Planning and Research data extract
- frequency of extraction
- participation of GP Practices
- data model
- fields extracted and their types
- patient inclusion/exclusion criteria
- the coded information extract from each patient record (if present) in the form of code clusters
- data coverage management information
It is intended for consumption by a wide group of stakeholder groups clinicians, patients, Information Governance (IG) professionals and anyone with a need to understand what is being extracted for the purposes of COVID-19 planning and research and how the extract operates.
It does not replace the technical specification (extraction requirement) that GP System Suppliers (GPSS) will utilise to build the extract.
The current COVID-19 pandemic has led to urgent demand for General Practice (GP) data for planning and research from multiple sources. The British Medical Association (BMA) and Royal College of General Practitioners (RCGP) have asked NHS Digital for support to manage data requests and therefore reduce burden on General Practices.
In response, NHS Digital has created a tactical solution which uses the existing General Practice Extraction Service (GPES) to run a fortnightly data extract from General Practices into NHS Digital. The proposal has been supported by BMA and RCGP (via the Joint GP IT Committee). The legal bases for the extract are the COVID-19 Public Health Directions and Control of Patient Information Regulations. The extract will:
- respond to and manage the increased demand for data for COVID-19 planning and research
- make sure data is stored securely and disseminated appropriately and safely using our robust Data Access Request Service (DARS) and Independent Group Advising on Release of Data (IGARD) processes BMA/RCGP representatives will provide additional approval for the releases of data
- reduce burden on GP Practices and allow GPs to focus on patient care
This guidance is aimed at clinicians, patients, Information Governance (IG) professionals and anyone with a need to understand what is being extracted for the purposes of COVID-19 planning and research and how the extract operates.
Summary of the end to end process
Coronavirus (COVID-19) has led to increased demand on general practices, including an increasing number of requests to provide patient data to inform planning and support vital research on the cause, effects, treatments and outcomes for patients of the virus.
NHS Digital was asked to provide support to general practice to help reduce the burden of data requests and allow clinicians to focus more on delivering care. Learn more about the information on the extract and its uses.
1. While the data sits within GP System Supplier (GPSS) boundary the GP is data controller
The data set has been developed with stakeholders focusing on the current pandemic need. Practices would be provided with consistent and exemplary fair processing information for all data collected by NHS Digital.
Data is only taken from the Practice where a Data Provision Notice has been accepted. Not all patients are included in the extract. Only specific Coded and Structured data will be extracted by the General Practice Extraction Service (GPES) and sent to NHS Digital.
The patient data is transferred from the GP System Supplier to the NHS Digital Data Processing Service (DPS) using the Message Exchange for Social Care and Health (MESH) service for secure large file transfers.
2. Upon data landing, NHS Digital (NHSD) and the Department of Health and Social Care (DHSC) become joint data controller
Data is passed through a secure ‘data pipeline’ where it is ingested, validated and has derivations applied before being stored separately to other data assets.
Upon landing the DPS takes the extract file from the landing zone file store and applies validation and Data Quality (DQ) checks. The DPS then calls the De-Identification Service to tokenise identifiers to the DPS internal pseudonyms ahead of storage.
3. Processed data is then held securely in an encrypted and pseudonymised form, in isolation from other data sets and NHS Digital staff
All data held is protected by system level security policies. Data sets are stored as objects in AWS S3 Buckets with controlled access via Identity and Access Management (IAM) mechanisms. Files are not publicly readable and data is encrypted at rest in S3 using AES-256.
4. Applications to request data must include clear purpose(s) and legal basis to help ensure appropriate data level access and use independent external IGARD - (Independent Group Advising on the Release of Data) and internal Data Access Request Service (DARS) assessments, to ensure:
- the data file will only contain data that has been authorised via a Data Sharing Agreement (DSA) which has been approved through the Data Access Request Service (DARS)
- the file will be sent to the recipient using a secure mechanism such as MESH
- each recipient will receive data with a different set of pseudonyms (based on the DSA)
All applications to access data must be initiated through the NSHX Single Point of Contact for triage before entering into the standard DARS and IGARD process. The NHS Digital Senior Information Risk Owner (SIRO) will have final approval before any data is released. A DSA and contract must be in place between NHS Digital and the recipient ahead of formal release.
5. Upon approving the application, data can be linked and/or re-identified ahead of dissemination where required
Upon being granted approval the data can be linked to other data sets and any further processing including linkage is only undertaken upon DARS approval. Data does not need to be re-identified to be linked to other data sets. Where re-identification is approved to meet a specific purpose it is strictly controlled, monitored and fully auditable and contains many steps and security levels to execute.
6. NHS Digital’s responsibility for the data does not stop at the dissemination and audit. Sanctions are imposed for any organisation deemed to have breached the Data Sharing Agreement (DSA). These include:
- revocation of the DSA and access to the data
- data destruction notice
- customer being reported to the Information Commissioners Office (ICO) for data breaches.
Once all approvals have been obtained and the data prepared it can then be accessed by the requesting organisation within the Data Access Environment (DAE). DAE is a single access environment for NHS Digital and external users to access this data which supports a number of presentation tools. By default, users cannot download the results of queries from DAE. However, there are cases, typically involving cohort management, where this is necessary in which case the user is granted specific permission to download data
GPES extraction overview
The GPES is a generic data extraction service operating between NHS Digital and GPSS that allows NHS Digital to query GP systems for data in the form of specific data extractions (an extraction requirement) to meet the needs of a particular data use case. Examples of existing data extractions are those that provide the basis for GP payments or that are used for health screening for example Diabetic Retinopathy.
The GPES provides standard mechanisms for controlling and scheduling extractions as well as targeting and controlling practice involvement (Participation). This allows control of the population (Cohort) for which data is extracted as well as, where applicable, GP Data Control of whether the extraction is authorised to take place.
This COVID-19 planning and research extract is a further extract which has been developed by GPSS and is undertaken by NHS Digital to extract the relevant data for central processing.
The actual subset of available data that is extracted in each GPES extract is defined by a set of business rules. These rules specify features such as the target cohort of patients, the patients qualifying for extraction, the coded record content for extraction and limitations such as time period cut-offs to be applied to the extracted content.
The following sections provide an overview of the business rules that specify the actual subset of patient data held by GP systems that will be included in the COVID-19 planning and research extract.
The COVID-19 Planning and Research extract is an initial extract, followed by a fortnightly extraction. Data is up to date as of the day before each extract takes place. The data available for dissemination will be approximately one week old.
The initial GDPPR extract will consist of patient demographic information and coded medical information (as per the business rules) as a snapshot in time when the first extract is undertaken. A snapshot in this context means data recorded up to the date the extract is taken, looking back through the full history of the relevant parts of the patient record stored within their GP system. Thereafter subsequent fortnightly extracts will then be taken. The fortnightly extracts will ask for the same data items (patient demographics and coded medical information) and snapshot as defined in the initial extract but from a more specific group of patients, namely any who meet at least one of the criteria below. This group of patients are described as below:
- patients who have recently registered at a GP practice in the two weeks up to and including the reporting period end date.
- patients who have any codes relevant to pandemic planning and research recorded in the month up to and including the reporting period end date
- patients who have any codes relevant to pandemic planning and research and whose date of death is in the month up to and including the reporting period end date.
We will see updates where:
patients register at a new practice
journals are added
journals are added and removed in between reporting periods
patients have died
We will not see updates where
only changes made in the patient section of the record
only journals are removed
only contents of journals are changed
patients are deleted from practice registers
As data controllers of data in their GP systems, GP Practices are required to opt-in to this extraction via the standard GPES mechanism by accepting an offer of participation in the CQRS (Calculating Quality Reporting Service) system. Data will not be extracted for any practices that have not opted-in via CQRS.
Candidate patient records for extraction are patients with active, current registrations at participating practices and deceased patients with a date of death on or after 1 November 2019.
Records will not be extracted from patient records with a recorded dissent from secondary use of GP patient identifiable data therefore respecting the current national Type 1 data opt-out. There are around 1.3 million people with a Type 1 opt-out. Statistics on Type 1 opt-outs can be found in the National Data Opt-out publication.
Patient records will be included where they have coded record content that matches the codes defined by the Code Clusters applicable for the COVID-19 planning and research extract.
General content exclusions
The extract does not include any free-text notes or documents attached to patient records.
Extract scope and content
The GPES-I standard models patient data held in GP systems via four main entities in what is commonly referred to as the ‘4 table model’.
The entities in scope for the GDPPR extraction are patients and journals only.
Provides relevant details of patient demographics for example age and sex as well as details of a patient’s registration for example registration type and registration status
This describes the coded record entries that make up a patient record for example a diagnosis of asthma, measurements such as blood pressure values or medications prescribed to the patient.
Inclusion of coded information is driven by the ‘Code Clusters’ specified for the ‘COVID 19 planning and research’ extract. Each cluster specifies a set of codes and where information in the patient record has been coded with clinical codes corresponding to those cluster members it is extracted. This mechanism allows both relevant patients and relevant information to be extracted, excluding patient information which is not relevant.
The data from GPSS flows through an ingestion pipeline through the NHS Digital Data Processing Service (DPS) platform which operates on Amazon Web Services using Simple Storage Service (Amazon S3) which is demonstrated in this diagram.
Code clusters and content
The rules and logic governing patient inclusion and extracted record content is provided by the GPES Extract for pandemic planning and research_business_rules_v2.0 or later version. For the latest content of the code clusters see below.
The business rules document defines the set of code clusters setting out the inclusion criteria for coded record content in terms of SNOMED CT reference sets. The contents of each refset are available via Technology Reference data Update Distribution (TRUD)/Power BI portal. An example of a subset of the defined refsets is shown in this table.
|Cluster name||Description||SNOMED CT|
|AAA_COD||Abdominal aortic aneurysm diagnosis codes||^999016371000230105|
|ABPM_COD||Ambulatory blood pressure codes||^999016411000230109|
|ACE_COD||Angiotensin-converting enzyme (ACE) inhibitor prescription codes||^12464201000001109|
Where applicable, time-based cut offs are applied to extracted journal entries for example within 2 years of the extraction date. These time-based cut-offs are also defined in the business rules document.
This table is an example of a 2-year cut-off being applied to codes belonging to the ambulatory blood pressure code cluster. Where no time-based cut of is applied all instances of a qualifying code are extracted.
|Field number||Field name||Code cluster (if applicable)||Qualifying criteria||Returned fields||Non-technical decision|
All > (RPED - 2 years)
AND <= RPED
Refer to 4.4 Patient-level Extracts
|The specified fields for all ambulatory blood pressure codes recorded in the 2 years up to and including the reporting period end date.|
To give context to the code clusters used in this dataset
- there are over 900,000 SNOMED codes in the UK and international releases including drug codes and inactive codes
- there are over 34,000 SNOMED codes used within the GDPPR dataset (all current NHS Digital GP extracts cover 36,400 SNOMED codes)
Similar SNOMED codes are grouped together into code clusters. For example, there are 18 SNOMED codes which refer to a patient receiving a seasonal influenza vaccine; these 18 SNOMED codes are grouped under the code cluster ‘Flu vaccination codes’. The same occurs with the 17 SNOMED codes which denote a patient receiving an MMR vaccine to produce the ‘MMR vaccine codes’ code cluster. These two code clusters are then grouped together under a wider cluster category, ‘Vaccines and immunisations’, along with several other relevant code clusters. The document/Power BI report below can be used to understand the hierarchical structure of SNOMED codes, code clusters and categories, and can help users decide which may be relevant to their research.
Only the individual SNOMED code is included within each journal record. Therefore, in order to filter the data using specific code clusters/refsets the provided reference data must be utilised. For Data Access Environment (DAE) users, reference data is available in the dss_corporate database. Care must be taken when joining GDPPR data to reference data as SNOMED codes can appear in more than one code cluster
This diagram shows which fields in the reference data can be used to link to the GDPPR data.
For efficiency, the two logical tables, JOURNALS and PATIENTS, extracted via the GPES extract are merged into a single combined table for utilisation as a data set by NHS Digital. This does not alter the data extracted or compromise security or information governance of the received data. It means that both the records that describe the coded information recorded against a patient (JOURNALS) and the demographic information about the patient (PATIENTS) are held in the same physical record which means they are easily and efficiently retrievable together in the same query operations without needing to join the two tables in query operations which would be a less efficient and more costly operation.
Conceptually this can be thought of as each JOURNAL record contains additional columns containing the details from the PATIENTS table about the patient corresponding to the JOURNAL record.
This diagram is showing the merged view that is provided within the eventual data asset for utilisation – the CODE column is the SNOMED CT code of the journal entry, ADDRESS_5 and ETHNIC are from the PATIENT table.
The merged view, which forms the GDPPR data asset, will always contain the most up-to-date view of the data for example new records and the corresponding patient information will be appended to this view as they are extracted and processed. It is suggested that users utilise the available snapshots of the data, or create their own, to provide a stable dataset for analysis and to enable replication of results from previous analyses.
To view the most up-to-date version of a patients record users should utilise the REPORTING_PERIOD_END_DATE and JOURNAL_REPORTING_PERIOD_END_DATE fields. These fields contain the date that journal records were extracted from GP systems and can therefore be used to filter the data to only include the most recent extract date for each patient. By using the maximum JOURNAL_REPORTING_PERIOD_END_DATE for each patient users are able to filter out journals which may have been amended or deleted as these journals will have older dates.
In its current state, the data asset can be aggregated by grouping on any of the current fields in the data for example patient level (NHS_NUMBER), practice level (PRACTICE) or supplier level (GP_SYSTEM_SUPPLIER). To aggregate by other possible areas of interest, such as CCG (Clinical Commissioning Group) or region, users will need to join reference data to the GDPPR data asset. This process will be different depending on whether users access the asset via a physical data extract via MESH, or the DAE.
Physical extract - reference data
Users with a physical extract of the GDPPR data asset can download reference data through the TRUD.
DAE reference data
Within the DAE, reference data is stored in the dss_corporate database. NHS Digital internal users can use the DSS report to understand what reference data is available, and how it should be used to filter the GDPPR data asset.
External users are advised to look at the NHS Digital data registers service to understand what reference data is available, and how it should be used to filter the GDPPR data asset.
Reference tables which are thought to be particularly useful to the GDPPR data asset are listed in this table.
|Asset name||Description||Notes||Fields to join a= GDPR, b =reference data|
|ods_practice_v02||Contains practice mapping information including practice names and the codes of the CCG/Region they belong to||
To get data for open and active practices this table must be filtered using:
DSS_RECORD_END_DATE is nullCLOSE_DATE is null
|a.PRACTICE = b.CODE|
|gp_patient_list||Contains the number of patients registered at GP practices broken down by age and gender||For the correct GP patient list size, EXTRACT_DATE should be filtered to the first of whichever month GDPPR data was most recently extracted e.g. if data was last extracted on 2020-05-18 then EXTRACT DATE = 2020-05-01||a.PRACTICE = b.PRACTICE_CODE|
|org_daily||Contains further mapping information||This table should be used in conjunction with ods_practice_v02 for mapping regions/CCGs/ etc.
For the most recent information the table should be filtered using:
ORG_CLOSE_DATE is null
BUSINESS_END_DATE is null
ORG_IS_CURRENT = 1Mapping information for GP practices are available within this table but are not as frequently updated hence why ods_practice_v02 should be used in conjunction with org_daily.
|b.ORG_CODE = relevant field from ods_practice_v02|
Data Coverage - Management Information
To assist users, and potential future users understanding of the coverage and quality of the GPES Data for Pandemic Planning and Research (GDPPR) dataset we have produced aggregate counts, proportions, and distributions of items found within the GDPPR dataset. Data quality and interpretation notes are included within the file to assist users in their understanding and interpretation of this data. This data is released as management information (MI) and should be interpreted carefully to ensure there are no misunderstandings.
This MI should be used:
- to understand the patient and practice coverage of the GDPPR dataset, as well as the distribution of that coverage
- to understand the data quality of the GDPPR dataset
- to understand the utilisation of code clusters within patient records, and practices
- in conjunction with the other information on the GDPPR analyst user guidance webpage
This MI should not be used:
- to infer epidemiological prevalence as code cluster utilisation is driven by several factors such as clinical code usage within a practice, whether a cluster contains declines/refusals, whether the cluster contains codes for other related conditions, as well as prevalence of that particular condition/observation/vaccination etc
Working with the data
Whilst the GDPPR data asset is relatively simple in terms of its data model and limited number of fields, it can be complex to use and can be used inappropriately if misunderstood. The information in the file below provides useful information and examples which will help users of the data to understand how to use it properly for the purposes of their analysis.
Data Quality Notes
As the GDPPR asset is a product which was developed rapidly in response to the coronavirus outbreak, limited quality assurance checks have been applied during data processing. Because of this, there are known Data Quality (DQ) issues within the dataset which could impact how the data is used.
The file below highlights known DQ issues which have been identified by current users of the GDPPR dataset. NHS Digital are sharing these DQ issues to:
- Inform people of the limitations of the dataset
- Prevent duplication of initial DQ checking by users of the data
- Aid potential users of the data in their understanding of whether this dataset is suitable for their needs
GDPPR subject matter experts have completed various analyses using the
GDPPR dataset and are sharing code to:
- prevent duplication of work
- allow peer review of code and methodology used in analysis
- increase consistency of methodology across users
- increase general knowledge sharing
This GitHub code repository contains various analytical code such as code to categorise various patient factors such as ethnicity and BMI. If you would like to suggest changes to the available code or add your own code to the repository then please submit a pull request – all analytical code related to the GDPPR dataset is welcome.