Methodology for identifying and removing duplicate records from the HES data set

This guidance provides details of the methodology used to identify and remove duplicate records within the HES (Hospital Episode Statistics) data set; these are records that have been erroneously submitted to the Secondary Uses Service (SUS) multiple times they do not appear multiple times in the providers local IT systems.

It is not designed to detect duplicate records generated within the local IT systems that providers use to collect and record data prior to submission.

The duplicate identification and removal process runs after provider codes have been mapped (cleaned) in HES. It uses the cleaned provider code rather than the submitted provider code. This increases the accuracy of the methodology, as duplicates will be detected and removed when they have been submitted under both a correct and invalid provider code. Please note that this methodology uses a HES specific provider code field (PROCODED) that is not available in the final HES data set. PROCODED represents the provider organisation (e.g. the NHS trust) for NHS providers, and the code from the PROCODE field for Independent Sector Health Providers (which represents a site for this type of provider).

Duplicate methodology

It is common practice for providers to resubmit records to SUS (for example when corrections have been made, or when additional details that were not available at the time of submission have been added to records). When a submission of data is sent by a provider to SUS, one of the key fields used to apply those records is the CDS Sender Identity (SENDER) field. All 5 characters of this code are taken into account (for example, a SENDER code of RNN is different from RNN00, and both are different from RNN01). All matching data for that SENDER code in SUS are deleted and the newly submitted records are then inserted.

If the same records are submitted to SUS using a different SENDER code, the existing records will not be removed and the new records will be inserted into the database causing duplication.

When this happens, it almost always causes large amounts of duplicate records typically entire months of data are duplicated, and in extreme circumstances it can cause an entire years-worth of data for a provider to be duplicated.

The HES duplicate methodology is designed to identify duplication that has happened in this way. It looks for a match in key fields between more than one record, and a difference in the SENDER field indicating that the duplication has occurred due to a change in SENDER.

A record is flagged as a duplicate when identical data is submitted in the following key fields:

Provider code of organisation acting as healthcare provider (PROCODED)

and either

NHS Number (NEWNHSNO)
or Local Patient Identifier (LOPATID)
or postcode of patient (HOMEADD) and Date of Birth (DOB) and Sex of patient (Sex)

and (depending on CDS type)

arrival date (ARRIVAL DATE)
arrival time (ARRIVAL TIME)

APC

Episode start date (EPISTART)
Episode end date (EPIEND)

Appointment Date (APPTDATE)
Appointment Time (APPOINTMENT_TIME)
Main Speciality (MAINSPEF)
Treatment Speciality (TRETSPEF)
Consultant Code (CONSULT)
First Attendance (FIRSTATT)
Attended or Did Not Attend (ATTENDED)

AND

When non identical data is submitted in the following key field

CDS Sender Identity (SENDER)

Processing and removal of duplicate records

This method assumes that the latest submission date within any group of duplicates is the correct record. When duplicate records have been identified using the criteria described, the record with the latest Submission Date (SUBDATE) is flagged as 1 and is retained in the data set. All the other records are flagged as 2 and are removed from the data set prior to publication.

Last edited: 7 December 2021 9:48 am