Skip to main content


Coronavirus as Recorded in Primary Care - EXPERIMENTAL STATISTICS

2. Data Preparation

The data includes patients with at least one COVID-19 related diagnosis in their primary care record between 1st March 2020 and 31st March 2021. Patients are included if they have at least one event recorded as a clinical or test diagnosis or a suspected case. Event types are defined through SNOMED codes.

2.1 SNOMED Codes

Primary care data systems use SNOMED codes to record events, symptoms, and diagnoses. These codes are continually updated to reflect the needs of the system, such as a new drug or a new virus. As the pandemic progressed and as scientific knowledge of the virus changed over time, additional SNOMED codes became available for use in general practice. Others were no longer relevant, such as drugs that were initially used to treat COVID-19 before other treatments were available. 7 Additional Information – COVID-19 SNOMED codes lists the codes used to define the patient cohort. These were curated by internal NHS Digital clinicians with relevant expertise and labelled as clinical or test diagnosis or suspected cases.

In the initial period of the pandemic, COVID-19 specific codes and guidance on how to capture infected patients were limited. Further, some test diagnostic SNOMED codes were not available to GPs until 18 June 2020. As a result, a broader list of codes is used to define clinical or test diagnoses in the earlier months of the pandemic. There is no singular date on which GPs started using the new codes - they were gradually adopted over time. For the purposes of this analysis a cut-off date of 1 July 2020 is selected before which the broader categorisations are used. This date was selected to allow some time for the new SNOMED codes to be incorporated widely in general practice and to coincide with the end of the first wave. Clinical judgement has been used to determine the best categorisation for each code.

2.2 Data Cleaning

The GDPPR data is event-based and contains many distinct records for each patient. The type of event is determined by the SNOMED code.

Several filters are applied to the source data to create the event-based data asset that is used for subsequent analysis in this report:

  • Filtered to the SNOMED codes relating to a COVID-19 clinical or test diagnosis or suspected case, listed in 7 Additional Information– COVID-19 SNOMED codes.
  • Filtered to events for patients resident in England. This is based on the lower-layer super output area (LSOA) of residence on event date.
  • Filtered to events recorded as occurring between 1 March 2020 and the date the record was first received by NHS Digital’s Data Processing Service.
  • Filtered to remove event records with an unrealistic date of birth entry (removing all records indicating an age outside of the range 0 to 130 years old at the time of the event).
  • Records are deduplicated on NHS number, event date and SNOMED code (removing, for example, erroneous double entries by data suppliers or records of multiple positive tests on the same day)

The resulting filtered event records are mapped to the categories of test diagnosis, clinical diagnosis or suspected case based on SNOMED codes. For a small subset of codes, the assigned category differs depending on whether the event occurred before or after the 1 July 2020. Table 1 shows the number and percentage of records assigned to each category in the event-based data asset.

Test diagnosis Clinical diagnosis Suspected








Table 1: the number and percentage of records in the event-based asset that have a SNOMED code in one of the three groupings.

The demographics fields in the filtered data each have 100% coverage, where coverage is the percentage of non-null values. The only exception is ethnicity, which is discussed further in section 2.4.

From the event-based data asset, the most appropriate demographic information for each patient must be chosen. The demographics for each patient are taken from the most recent event record, except for ethnicity (see section 2.4).

2.3 Cohort selection date

The results are presented for the cohort of patients with a test or clinical diagnosis, or a suspected COVID-19 case. Patients are included in only one of these three sub-cohorts.

A key event is identified for each patient. This event determines the date at which they enter the overall cohort and which sub-cohort they are allocated to. The key event is selected according to the following prioritisation (descending order): i) test or clinical diagnosis, ii) broader test or clinical diagnosis before  1 July 2020, iii) suspected diagnosis, iv) broader suspected diagnosis before 1 July 2020. This prioritisation assumes that patients can only contract COVID-19 once and that test or clinical diagnoses are more reliable than the broader definitions used early in the pandemic or suspected cases.

For example, a patient with a positive test in August 2020 and a suspected case in March 2020 will only be included in the test-diagnosed group in August 2020 and will not be included in the suspected group. None of the patients included in the suspected group shown in this analysis have a test or clinical diagnosis within the study period. Patients with more than one test or clinical diagnosis on different dates are only counted once on the first date.

There are 5,123,640 unique patients with COVID-19 identified using this method. Table 2 shows the number of unique patients in each of the three sub-cohorts. Figure 1 displays the total patients as they are included in the analysis by case type. Figure 2 highlights the cases defined using the broader definition up to 1 July 2020. Initially, the majority of patients are identified through clinical diagnosis but this changes over time as testing becomes widely available.

Test diagnosis Clinical diagnosis Suspected








Table 2: the number and percentage of unique patients included in the cohort from each of three groupings.


Figure 1: Total number of patients included in the analysis by case definition

Figure 2: Total number of patients included in the analysis by case definition highlighting those identified using a broader definition up to 1 July 2020.

2.4 Ethnicity selection

The coverage of ethnicity in the filtered GDPPR event-based data set is 44.8%. However, in cases where ethnicity is not recorded for a specific event, it is often possible to extract this information from elsewhere in a patient’s medical record. An algorithm was used to increase completeness of ethnicity data, drawing on wider medical records from both GDPPR and Hospital Episode Statistics (HES) data sets.

Ethnicity in GDPPR is recorded in two possible ways. The simplest way is as a code in the ETHNIC field in each record. The second way is using the patients journal records, which are SNOMED codes that represent specific ethnicities. By searching for these codes and mapping from the SNOMED code to the ONS ethnicity category code, a second source of ethnicity data is obtained.

The ethnicity algorithm takes a list of unique patients and searches the GDPPR data for the standard and journal ethnicities. Three HES data sets going back 5 years are also searched: inpatient, A&E, and outpatient data. Each of the five methods and sources are combined into a single asset where each patient has multiple, potentially conflicting, ethnicity codes. The ethnicity of the most recent record for each patient is used. If there are multiple ethnicities for a patient on a single day, then the preferred ordering is by data set: GDPPR journal, GDPPR patient, hospital inpatient data, hospital A&E record, and hospital outpatient data. If there are conflicting ethnicities on a single day, within a single data set or method, then these records are ignored and records from the next most recent day, or next highest priority data set or method are considered. The ethnicity coverage after applying the algorithm is 97.6%. Excluding ethnicity code Z, “not stated”, the coverage is 91.4%.

Last edited: 19 May 2021 1:57 pm