On 8 June 2021 we saw the biggest day for COVID-19 vaccination bookings, taking over 1 million bookings through the National Booking Service. At our busiest time we were taking over 150,000 bookings in an hour.
When designing the system, we didn’t know how many people would take up the vaccine or how they would be invited to get it. Crazy requirements started to be thrown about: "It must deal with 50 million people all wanting a vaccine at the same time!"
In reality, this just wasn’t realistic. The whole country wasn’t going to stop to book a vaccine when the time came. However, we had two things to consider:
- How do we protect the system against a massive surge in traffic?
- How do we regulate the traffic going into the system?
Being British, the answer had to be a queue!
However, from an architecture point of view, we didn’t want people to queue; we wanted people to get their booking as quickly as possible. If there is a queue, it’s better to come back later than wait. Overall, the system had to be designed to be as quick as possible, letting as many people in to book as it could.
Protecting against a surge
Within NHS.UK, we already use a product called Akamai as a Web Application Firewall (WAF) and Content Delivery Network (CDN). This ensures we can cache as much of the application as possible, but also provides an element of security to protect the application.
Akamai has servers distributed across the world to provide high levels of resilience. We have successfully implemented the Visitor Prioritization Cloudlet in the past for the COVID-19 testing solution we built when tests were first announced and continued to use this for the vaccine booking service. This uses the vast capacity of Akamai to hold people in a waiting area based on a percentage that we control.
Given the scale of Akamai, we had no concerns that it could deal with the amount of traffic we might see and for it to hold it for us. As this is percentage based, we don’t have an absolute number of people that will enter the system at any one time. For instance, if we had 0.1% of people entering into the system, that would be 2,000 people a second, which is far too high. Added to this, we can’t tell people how long this page will hold them for. The user experience isn’t fantastic, but it is designed to protect the system.
Setting the thresholds in Akamai is pretty easy, either through the portal or by using a command line interface. We took the approach of creating a pipeline within Azure DevOps to do this and have subsequently linked it to Microsoft Teams, so it can be triggered by a team that carries out monitoring.
This also means we can defer the deployment to a later time and date. As we know when new cohorts are being invited, we can set Akamai to deploy at a particular time in the morning to provide that surge protection, without the need to get up early!
Controlling the numbers in the queue
Once we have dealt with the surge, we then wanted a more granular way of controlling the number of people in the booking journey at any one time. This is where the second queue from Netacea comes in.
This is an image that shows where a person is in the vaccination queue in the National Booking Service queue page.
We have 2 levers to pull when controlling the number of people coming into this queue: the number of people actually making a booking and the number of people in the queue. We control the number of people entering the queue by using Akamai, as above.
We control the number of people making a booking by adjusting the zone size. The zone size is the actual number of people using the booking system. At any one time, we allowed approximately 12,000 people to access the bookings. This was determined through extensive performance testing and monitoring key components within the solution.
The chart below shows what we experienced on 8 June, when we had over 1 million vaccinations bookings in a day. The queue began building in the morning and continued to build at a rate we'd not had with previous cohorts.
As the queue grew, we allowed more people into the booking journey until we hit a level that gave sufficient throughput without unduly stressing parts of the architecture.
Once the queue started to reduce, we then released more people through from Akamai into the queue. The second queue of people was cleared around 2pm.
Why was the queue bigger this time?
The reason we have this dual queue approach is because we don’t know how people are going to interact with the system.
Are we going to see a massive surge of people? Will people wait for the SMS to arrive before making their booking? Or will people hear from elsewhere that it's available and go to book straight away?
Based on the traffic profiles, it’s more likely that it’s the latter.
The journey is broken up into 2 parts, the eligibility checking and the actual booking a vaccination part of the journey. The queue needs to protect both; it would be rather odd to have a third queue halfway through the journey after you’d been shown to be eligible.
A record breaking day
A proportion of people were being told they weren’t eligible on 8 June, which probably meant they were spending more time in the eligibility part of the journey than normal.
This has a knock-on impact of not letting more people in, as the queue has a one-out one-in approach. If people can’t get in, they start to queue. That’s not to say we wouldn’t have had a queue at all, but it may have been smaller. This is where we then start to pull the levers available to us:
- increase the number of people in the eligibility, which has an overall capacity greater than booking, knowing a certain percentage won’t get through but enabling a constant throughput into booking
- monitor the different parts of the system to ensure nothing is running at a level higher than we are comfortable with and then where possible increase the throughput of people
The key aim though is to get as many people in to book a vaccine. On 8 June we did that. A record-breaking day for us all.
Last edited: 22 December 2021 11:21 am