Navigating tech catastrophes: five key lessons from the CrowdStrike outage

DXC Technology

By Chris Drumgoole, General Manager - Cloud and Infrastructure, DXC Technology
Tuesday, 10 September, 2024


Navigating tech catastrophes: five key lessons from the CrowdStrike outage

In today’s hyperconnected world, where businesses rely heavily on digital platforms, any interruption in service can be catastrophic. A significant disruption impacted many of the world’s largest enterprises this year when a defective software update caused a major shutdown across various sectors, including airlines, banks, government agencies and retailers, all operating on Microsoft’s Windows operating system. Recognising the severe impact such downtime can have on businesses, some IT experts have described the CrowdStrike incident as one of the most extensive outages in history.

As organisations continue to recover from this incident, it is essential to reflect on the key lessons learned to mitigate the effects of future disruptions.

1. Contingency planning is essential

The outage has prompted industry-wide discussions about vulnerabilities, data protection, supply chain impacts and other critical concerns. When facing such a crisis, prioritising tasks effectively is vital. Focusing on the most critical aspects first, such as restoring essential systems, can make a significant difference in minimising downtime.

This incident also highlights the need for thorough testing, risk assessments and clear communication protocols to prevent widespread disruptions. Including the entire supply chain in contingency planning is crucial, as third-party risks can significantly impact business operations during outages or cyber threats.

2. Around-the-clock commitment

IT outages do not adhere to traditional business hours, making it imperative for businesses to maintain a 24/7 response capability. Continuous network monitoring, swift incident response and effective resource management are essential for prompt restoration of services.

The ability to respond quickly, regardless of the time of day, can make all the difference in minimising the impact on customers.

3. The human element is vital

While technical solutions are critical, the human touch remains an essential aspect of problem-solving. This outage stresses the challenges of integrating best practices for cloud-based infrastructure while ensuring that humans are kept informed and equipped for technology testing.

Technicians often needed to engage directly with end users, guiding them through the complex restoration process. For example, at DXC Technology, some technicians had to deal with non-technical users over the phone, and it is during such times that patience and empathy are required, despite the high stakes.

4. Vendor relationships are crucial

Strong relationships with vendors can be pivotal during an IT crisis. Regular contact with vendors, understanding their update processes and maintaining direct communication lines are all essential for effective incident response.

The ability to quickly collaborate with vendors can help address issues more efficiently and reduce the overall impact of the outage.

5. Effective communication is key

Clear and timely communication during a crisis is paramount. Promptly updating customers about the situation, managing expectations and providing transparent information can significantly reduce confusion and anxiety.

Establishing reliable communication channels ensures clarity and helps maintain trust with customers. Additionally, gathering feedback from customers about their experience during the incident can help refine response strategies for future preparedness.

The CrowdStrike outage serves as a powerful reminder that no system is immune to failure. However, by focusing on resilience in infrastructure, transparent communication, preparedness through incident response, continuous monitoring and learning from each incident, organisations can navigate technological catastrophes more effectively.

Image credit: iStock.com/Parradee Kietsirikul

Related Articles

What is the cost of a false alarm when it comes to data issues?

A documented triage process is necessary in order to weed out any misunderstandings and false...

Balancing innovation with integrity to build trust in AI

Given the promising reports on AI one might assume its widespread adoption is inevitable,...

Cyber resilience: key takeaways from a global IT outage

One of the industry's largest IT outages in recent memory was an event that could easily have...


  • All content Copyright © 2024 Westwick-Farrow Pty Ltd