Skip to main content
Blog

How we stealthily migrated the COVID-19 National Testing Service

At the height of the pandemic, NHS Digital took the decision to migrate the COVID-19 National Testing Service to a new platform. Mark Byrne, Head of Technology, explains how the team succeeded in doing this while processing about 1 million test results per day – without anyone really noticing.

As the country emerged from lockdown in early in 2021, we were planning significant re-engineering of the COVID-19 National Testing Service. For a year, the platform had evolved due to numerous features required to support the pandemic response.

This organic growth had led to architectural issues that were impeding ability to continue expanding the platform. The volume of change for the service is un-relenting, with 60 releases per month being the norm, and the original architecture just wasn’t fit for purpose anymore.

Healthcare worker performing a COVID-19 test to a man inside a vehicle.

As the pandemic continued to evolve, the pace of functional change would have to be maintained and we needed near-zero down time because lateral flow tests were in use 24/7.

In March 2021 – while planning was underway – the service experienced its peak of volume to that point, processing 1.8 million test results per day and 291,000 per hour. Migrating a platform of this size under any circumstances would be a challenge but migrating a key platform underpinning the pandemic response at the height of COVID-19 testing was a significant undertaking.


Which platform?

At the time, NHS Test and Trace – later the UK Health Security Agency (UKHSA) – were also developing their strategic Halo cloud platform for service delivery, which would provide improved security for the platform.

At NHS Digital, there was a strong desire to migrate to Halo so that the service was delivered from a platform owned by a government body with the improved security that Halo offered. It would also ensure that the data assets benefited from the strong governance that Halo provides.

To achieve this, the services and data would need to be migrated in a staged way between the platforms.

The Halo migration required:

  • services to be modified to adhere to the different contexts and security controls on Halo
  • migration of all subject and COVID-19 test data to Halo
  • orchestration of a complex set of external dependencies as services were migrated
  • coordination with a host of other government agencies and providers to make corresponding changes as services were migrated

Our approach

We determined that the value stream concept from the Scaled Agile Framework (SAFe) would provide a sound structure for the future. In SAFe, a value stream is dedicated to build and support a set of services with the minimum number of handovers, thereby improving time to market. NHS Digital set out a value stream architecture for the National Testing Service where each value stream represented a set of independently testable and deployable services that could deliver value to end users.

It was also apparent that delivery efficiencies could be realised by dividing the service into independently testable and deployable components that could be operated by separate teams, potentially from different suppliers.


Value streams

The concept of value streams was chosen so that we would have a breakdown of the service into completely independent components that allowed teams to deliver end-to-end value for users without the dependencies and complex release planning that a tightly-coupled system would require. This approach also allowed migration of the service to Halo in phases and ensured that teams were as efficient as possible. The value stream architecture required services to be re-configured:

  • so that all value streams had a consistent integration pattern (using an API gateway), which ensured that they were independent
  • so that each value stream was able to deliver end-to-end value to users
  • to move capability between value streams so that the value streams were balanced in terms of the level of effort to support and maintain them
  • so that services could continue to scale to even higher volumes of tests results per day
  • to ensure that limits in the underlying cloud infrastructure were not a barrier to further functional changes

Ways of working

We worked in close partnership with suppliers and colleagues across NHS Test and Trace to ensure services were all migrated successfully. Specific migration resources were deployed and integrated into the delivery function so that there continued to be close collaboration between the functional delivery teams and the migration resources. This ensured that the migration resources remained abreast of all changes to the service during the migration. Regular formal governance and ad hoc checkpoints between NHS Digital, suppliers and NHS Test and Trace ensured that delivery remained focussed. All of this was delivered by NHS Digital, UKHSA and supplier teams working remotely due to the pandemic.

The planning and execution of the migration took 8 months to complete – dozens of incremental migrations took place and several hundred functional releases were conducted during this time but only a handful of days were allocated to specific migration windows.


Critical success factors

A number of factors combined to make the migration a success:

1. Planning

Each stage of the migration was planned by the team that ran that service and the architects responsible for the overall migration. This ensured that we got both the input of those who knew each component best but also any learning from previous work.

2. Close collaboration

NHS Digital, suppliers and the many sub-contractors and third parties involved worked together closely and openly on issues to ensure the quickest resolution.

3. Incremental migration

The services were migrated incrementally. This was enabled by the initial work to ensure services were independent. This approach helped contain any issues and made them easier to investigate and resolve.

4. Willingness and ability to adapt

The complexity of the migration and the context of doing it during a pandemic meant that challenges occurred frequently. The close collaboration and incremental nature of the migration allowed for plans to change with limited impact.

5. Rehearse and automate

Each migration step was rehearsed to ensure it would work and automated to ensure it was always done the same way.

6. Approach to data migration

Migrating the vast history of subject and test result data was one of the most complex areas. This was completed in close collaboration with the cloud platform provider. As we were pushing the boundaries of the standard replication process, we knew that failures were possible so we rehearsed and prepared a detailed timeline for the main and top-up replication processes along with checkpoints for each detailed step so we would be able to identify and quickly recover from any failures.

7. Data reconciliation

Although standard migration tools were used where possible, we also validated the outputs of these tools by sampling the migrated data. This gave us high levels of confidence moving forward after long-running data migration tasks.

8. Keeping the service running

A vital platform such as the National Testing Service during a global pandemic simply couldn’t fail. We ensured continued operation by running parallel services (on the old and new platforms) and migrating the consumers of the services incrementally.


The outcome

All services and data were successfully re-configured into value streams and migrated to Halo with very little impact on the relentless roadmap of change and with very limited user impact. Over a billion rows of data were migrated without any migration-related data issues being encountered.

As the migration neared completion and most services were migrated, the highest monthly release rate ever for the National Testing Service was achieved, demonstrating one of the goals of creating independent services that could release more easily. Shortly after the migration completed, the National Testing Service had its busiest day ever with nearly 2 million results reported . No platform issues were encountered.

For most platform migrations, the biggest indicator of success is going unnoticed by users. By this measure, the rebuild of the COVID-19 National Testing Service, at the height of a global pandemic, was a great success for our team.



Related subjects

In just over a year, the UK’s COVID-19 testing capacity increased from a few thousand to hundreds of thousands of tests a day. This achievement has been crucial to the fight against the virus – and the sinews of the system are digital.

Author

Last edited: 10 October 2022 10:49 am