On 23 December, Denmark was badly hit by a snow storm. DSB, Danish Railways, worked hard to get passengers to their destinations on the busiest travel day of the year. The combination of inclement weather and a very busy day, forced the company website, dsb.dk, to its knees.
Many organisations experience extreme traffic peaks on their website, sometimes expectedly; sometimes unexpectedly. I had a conversation with Thomas Jørgensen, system owner for the website in DSB's IT department. Thomas shared several crucial lessons learned on how to keep a busy site with regular peaks up and running, while also having prepared for when the unexpected happens.
The DSB website traffic challenge
DSB is the major rail operator in Denmark, so its website is generally a busy one. The website traffic pattern has its expected and regular ups and downs based on rush hour and holiday periods.
From a process perspective, Thomas told me that their work to address the traffic challenge is divided into 2 separate streams:
- Preparing carefully for when peaks are likely to happen, e.g. for busy travel days
- Unexpected peaks which require action straight away, e.g. for emergencies such as if a train got stuck somewhere on a bridge or in a tunnel
For the unexpected traffic peaks, the communication department operate a 24/7 hotline. In case of an emergency, the communication department can activate a special emergency frontpage without the usual digital marketing campaigns at the click of a button.
DSB maintains their own hosting infrastructure and actively monitors website performance, enabling them to respond faster in case of issues. The actual website is implemented by Cap Gemini in Sweden using EPiServer as the content management system.
Keeping the DSB website up and running is not just about technology
In my conversation with Thomas, he shared a detailed account of the 23 December crash including both technical and non-technical take-aways. To quote:
You can do everything possible at a technical level, including performance tuning, buying additional hardware and software, but most important to keep the website up are internal processes
Specifically, Thomas is referring to internal communication procedures. The responsible needs to know whom to reach straight away in case of likely trouble ahead. Ideally this should empower the web team when the situation unfolds to make quicker and better decisions.
You still fine tune all the systems, but no matter how much hardware and software you have in place, you are still bound to run into unexpected and extreme peaks that can bring your website to a sudden halt.
DSB website traffic handling to-do list
Thomas shared a high-level listing of the different on-going tasks DSB has in place:
- Make sure that the technology stack (operating system, database, application server and CMS) used for dsb.dk is tuned for performance to avoid running into bottlenecks
- Avoid scheduled administrative tasks and routines, e.g. backups and updates, in peak seasons and be prepared to stop these during unexpected peaks
- Add virtual servers as additional capacity to increase the load that the application servers can handle. This means that for peaks the capacity is significantly larger than normal
- Have procedures in place with a dedicated responsible person and emergency procedures which are used for unexpected peaks
Finally, DSB keeps static pages ready for emergencies, which we'll take a closer look at below.
The DSB emergency website
DSB maintains ready and up-to-date static pages ready, so that they can quickly be put in place as a replacement to the normal dynamic DSB frontpage. Thomas explained that DSB at regular intervals tries to identify what might cause peaks and consequently what content needs to be on the static emergency page.
This is what the DSB website looks like on a normal day with an approximate 50/50 split between marketing campaigns and booking a ticket:
Here's how the website looked on 23 December around 13:00 with 0% marketing, a brief explanation that this is an emergency page and relevant links where visitors may be able to find what they were looking for, e.g. is my train still running, book a ticket
DSB realized from the start, that the content required on the emergency website will vary from case to case, which is why the communication department has flexibility to easily get relevant text on the site. For service windows for the IT department, the same text can usually be used.
Performance, fail over and keeping websites up and humming is a regular theme in our groups, both the business and technical groups.