Preparing for 'black swan' IT events

SolarWindsSoftware

By Leon Adato, Head Geek, SolarWinds
Monday, 14 September, 2015


Preparing for 'black swan' IT events

About a year ago, well-known US statistician Nate Silver famously got it wrong. Really, really wrong. While known for his ability to adeptly predict everything from elections to baseball finals, Silver was completely thrown by Germany’s win over Brazil in the World Cup. As he described it, the result was completely unforeseen and unforeseeable — a ‘black swan’ event.

The tendency in the face of these things is to do as Nate did, and focus on what went wrong with the prediction rather than what caused the event.

In business, when the unforeseen occurs, what often happens is management acquires a dark obsession with post-analysis. Meetings are called under the guise of ‘lessons learned’ exercises, with the express intent of ensuring ‘this’ never happens again. Time is spent not on figuring out what went wrong, but instead, why the assumedly informed prediction failed.

To be clear, I’m not saying that after a failure, business should just blithely ignore any lessons which can be learned. Far from it. But what Nate’s observation and other black swan events teach us is that one of the first things an organisation should do is determine whether the failure was predictable in the first case. If it isn’t, your efforts and post-analysis are much better spent elsewhere.

There’s little doubt that in the face of black swan events there is a natural urge to protect ourselves, to ensure this kind of impact on our business can never again occur.

But I’m here to tell you that that urge is a waste of time and valuable resources. Don’t believe me? Let’s take a not-so-imaginary case of a company that has a single, spectacular failure that cost it $100,000. Management immediately sets up a task force to identify the root cause of the failure and recommend steps to avoid it in the future. It takes more than 100 man-hours to investigate the trigger. Let’s be conservative and say that the cost is $50 per hour times five people times 100 hours. A total of $25,000. And let’s be completely optimistic and say that at the end of the effort, the problem is not only identified but code is in place to predict the next one. The company has expended $25,000 to devise a solution which may (or may not) predict the occurrence of a black swan exactly like the one that hit before.

Compare that to a fairly common problem — disk failures. Drives fill up, or throw errors until they are unreadable, or just completely stop. But at this not-quite-fictitious company, there was no alerting for this. Disk space was monitored, but not alerted on. Alerts on disks which stopped responding or disappeared was simply not done.

A fairly simple set of alerts could save a moderately sized company as much as $140,000 per year. And disk failures are no black swan. Even Nate Silver would agree they are a sure thing.

Leon Adato is a Head Geek and technical evangelist at SolarWinds, and is a Cisco Certified Network Associate (CCNA), MCSE and SolarWinds Certified Professional. His career includes key roles at Rockwell Automation, Nestlé, PNC and CardinalHealth, providing server standardisation, support and network management and monitoring.

Related Articles

Navigating the evolution of data in the age of generative AI

As businesses tackle the intricacies of data management amid the rise of GenAI, embracing a...

Big AI in big business: three pillars of risk

Preparation for AI starts with asking the right questions.

Making sure your conversational AI measures up

Measuring the quality of an AI bot and improving on it incrementally is key to helping businesses...


  • All content Copyright © 2024 Westwick-Farrow Pty Ltd