Fastly outage highlights the need for unified IT practices

Friday, 11 June, 2021

Fastly's global outage caused by an undiscovered software bug was trigged by one customer's configuration change. The bug led 85% of Fastly's network to return errors, causing global disruption.

On 12 May 2021, Fastly began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances. On 8 June 2021, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug. The ensuing outage was broad and severe and Fastly has apologised for its impact on customers.

The disruption was detected within one minute, then identified, with the cause isolated and the configuration disabled. Within 49 minutes, 95% of Fastly's network was operating as normal. Once the immediate effects were mitigated, Fastly focused on fixing the bug and communicating with its customers. Fastly then created a permanent fix for the bug, which will be deployed across its network as quickly and safely as possible.

Fastly will also conduct a complete post-mortem of the processes and practices that were followed during this incident, to determine why the bug was not detected during Fastly’s software quality assurance and testing processes.

Andrew Goodall, Federal Director (Australia) of Elastic, said the large-scale internet outage shows that there is real need for organisations in Australia and globally to have a unified view across their IT estates. Goodall added that addressing major outages demands attention from IT operations and security teams.

"By embedding technologies that facilitate a broad overview of the organisation, IT teams can detect, respond to and resolve business-impacting incidents early and proactively. Organisations operating applications on multi-cloud and hybrid cloud architectures benefit tremendously from a unified view spanning across those estates, consolidating telemetry from applications, infrastructure and security systems. Doing so is critical to proactive detection, faster response and ultimately, the resolution of incidents," Goodall said.

Goodall noted that many organisations are still operating IT technologies in silos, thereby leaving critical blind spots. Goodall believes an observability stack will help by continuously analysing the digital exhaust from IT systems and identifying configuration errors and failures to IT teams early.

"Enriching those views with sources like IT change and release management, social media and business transaction data helps organisations understand service delivery trends in the context of business and IT changes," Goodall said.

Dr Klaus Ackermann from Monash Business School explained that in Australia, Fastly has internet connection points and infrastructure in Perth, Melbourne, Sydney and Brisbane; therefore, access to a website will be faster as the package only needs to travel within Australia, not across the undersea cables. For other countries with no physical server, a package needs to travel to the 'closest' node. Dr Ackermann said that businesses must be willing to invest in setting up their own distributed servers to reduce the 'ping' speed from a customer to a server and prevent similar issues from occurring in future.

"The investment is well worth it to optimise the customer experience and engagement. If you have a better 'ping', the higher the chance of people staying on your website and making a purchase for example," Dr Ackermann said.

Associate Professor Carsten Rudolph from the Faculty of Information Technology at Monash University added that a large content delivery network like Fastly operates with servers around the world and whenever content is accessed on the internet.

"During last night's outage, which impacted websites like The Age, Sydney Morning Herald, New York Times, Amazon and Gov.uk, Fastly claimed that the 'network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime'. While automatic failover is not easy, if there is a major issue, the remaining nodes might receive a very high load and either become very slow or completely fail," Associate Professor Rudolph said.

The day after the outage, it was discovered that the outage occurred due to a misconfiguration of Fastly's 'points of presence' (POPs) — servers that are distributed all over the world. Associate Professor Rudolph stated that moving from centralised solutions to distributed architectures that use a worldwide network of POPs can improve speed of delivery and its reliability. However, Fastly's outage indicates that small errors can disrupt centralised services and distributed solutions.

"As far as we know this incident wasn't caused by a malicious attack; however, it's important for people to be aware that servers like Fastly are still susceptible to technical faults and misconfigurations. These types of reliability issues can potentially result in financial losses and point to the need for a proper risk analysis. Businesses need to understand exactly what services and infrastructures they rely on," Associate Professor Rudolph said.

Fastly outage highlights the need for unified IT practices

Barracuda unveils multimodal AI threat detection tools

Elastic develops automatic SIEM migration tool

CrowdStrike releases agentic AI for the SOC

Content from other channels on our network