The “Glitch” - Is It Really the Culprit Behind IT Outages?
- 17 April, 2019 12:10
When things go wrong, responsible people step up to the plate and take – responsibility. Unfortunately, there aren't that many responsible people in this world, which is why, when things go wrong, most people look for someone or something to blame. In other words - a scapegoat.
Tech has not escaped this phenomenon. When banking apps and websites go down, when airlines cancel flights, the go-to excuse is often a technical glitch. Those glitches could be caused by all sorts of issues, from an incompatible software update to a malware attack to a fat finger issue.
Of course, all of these have been responsible at one time or another for IT and service outages, some of them major. Companies issue statements blaming these incidents on glitches, assuaging the concerns of investors and customers. And in many organizations, outages are far from a one-time issue; sites that report service outages show that some of the largest organizations are subject to outages time and again.
But by naming the culprit as a glitch, organizations are being a bit disingenuous; it sort of displaces the responsibility for the problem to an ephemeral force that has nothing to do with them- and outages do indeed have everything to do with them. We would – and should – expect organizations to be prepared for all those catastrophes. Did they not stress-test their systems? What about disaster recovery programs? Why aren't those systems and processes more effective in preventing these problems?
Most likely the organizations were properly prepared - or so they believed... For decades now, every mission critical component in enterprise IT has been configured with full redundancy - from redundant power supplies and network connections at the server-level, through redundant core networking and storage infrastructure, all the way to the use of clustering, load balancing, elastic computing, etc. Even if all of those layers of defense fail (for example, when an entire datacenter is lost, or a cloud provider serving an entire region goes down), almost all enterprises also implement disaster recovery solutions that enable critical applications to quickly restart.
If it's not the hardware, perhaps it’s the software. But here, too, companies don't act foolishly. While certainly more challenging than hardware issues, mitigating software issues also entails best practices that are rigorously adhered to by IT teams. Before performing software updates, teams will perform rigorous quality assurance testing, and anticipating that some issues will still escape testing, organizations have elaborate roll-back plans that allow them to revert to a previous tried and true, environment.
And yet. Although organizations are likely quite prepared for outage disasters and cyber attacks, how is it that things go wrong? The answer is – must be – that what we thought was “right,” really wasn't. What appears to us to be a working system, especially on the software side, may itself contain the seeds of disaster. Outages are, in fact, very similar to violent natural or biological phenomena, such as erupting volcanoes, or unexpected heart attacks. They are almost always the culmination of multiple faults that were building up over time, and lying in wait for the right set of circumstances to materialize.
For example, the deployment of an update that is tested in an IT staging environment might go well, without any indication that a problem is inherent. But when this is implemented in a complex production environment configuration, and cross-checked against other configuration changes, the result is an outage - meaning that the problem was there all along, but remained dormant until conditions were just right (or wrong) . Or take the hardware side. An airline might blame a blown power switch for a service outage that could last for days. These components were certainly configured for High Availability (clustered, load-balanced, etc.); if they failed, there was an inherent problem that was not identified in advance.
And so on. The point is that in today’s hybrid IT environment, change is the new normal, and it’s impossible for a human being to test for issues following each and every change.
Instead, what’s needed is a system that will track those changes automatically, validating each one in order to ensure service continuity. Such a system would examine IT configurations, settings, and dependencies, and make sure they are set up according to vendor resilience best practices. If the system determines that a scenario is developing that would indicate a problem, it could alert IT teams, giving them an opportunity to intervene and prevent the outage. Glitches will always be with us – but by taking control of our systems and being proactive about identifying issues, we no longer have to blame glitches for our service outages, because there will be a lot fewer of them.