What's not to like about the cloud? From IT staff to C-Suite executives, the cloud can seem heaven-sent. No more major expenditures on hardware, no more long software release cycles, no more multiple physical data centers management hassle. All that stuff is now someone else's headache. And they guarantee things will work!
The fact is that the cloud does work – but let's not get overenthusiastic. The cloud has its problems, as well-known cloud adopters regularly experience. Take, for example, Netflix, which had a series of issues, “glitches,” as the company explained to customers, affecting subscribers in the UK, the US and Australia. Considering that downtime costs the average S&P 500 companies at least $100,000 an hour, it behooves companies to be very careful when deciding if, and with which, cloud service to go with, and how to best design and integrate their new cloud service platform.
We've been down this road before. In their hype-cycle model, Gartner has aptly captured the universal human tendency to develop a “peak of inflated expectations” for every new disruptive technology, only later to land into the “troughs of disillusionment.” Just over the past decade we've seen this with Virtualization, Workload orchestration solutions, and active-active storage and compute clustering. All were amazing technological advancements – but none materially reduced risk.
The benefits of the cloud – being able to offload responsibility for operations (thus freeing up resources for organization projects), agility and elasticity “guaranteed” uptime, and significant savings in hardware deployment - are well-known. Cloud providers offer a rich and sophisticated set of building-blocks for forming a resilient infrastructure, but the responsibility to use them wisely still lies with the end-user. The realistic expectation, at best, is for a “shared-responsibility” model for resilience, security, and uptime.
When an outage occurs you probably won't get a definitive answer from your service provider on what exactly happened. Interestingly, a study by the University of Chicago on cloud outages indicates that the biggest reason for them is - “unknown.” That in itself should raise alarms for anyone shopping around for a cloud service.
Is there any way to completely prevent cloud outages? Probably not, but there are definitely things you can do to build more quality and resiliency into your cloud infrastructure. The immediate questions would be:
What level of resiliency do you really need? Is it sufficient to use multiple availability-zones? Is multi-region protection required? Multi-cloud? It’s all about balancing risk and cost. Surprisingly, many organizations do not really think all the way through these very fundamental questions. Availability zones can and will fail – it’s “by design” and well documented by the cloud vendors. While regions should not fail, media publications clearly suggest this does in fact happen. Does your design truly support the resilience level you need?
Given that, how often do you test the ability of your infrastructure to withstand fault-scenarios? Modern deployment paradigms like CI/CD, deployment automation tools and elastic-compute result in dozens to hundreds of daily configuration changes, in the average environment. Cloud providers themselves will introduce hundreds of new and improved features every year. Do you understand the impact of each change on your resilience? Do you test frequently enough? It is also important to mention that testing gets progressively more challenging as your resilience requirements grow. For example, validation that your application will survive an entire region failover, could be, at best, extremely disruptive to your business – and at worst, could lead to actual production downtime.
Finally, did you pay sufficient attention to ensuring recoverability from cyber attacks? Unlike “traditional” causes of outages (human error, equipment failures, etc.), cyber events tend to target the prevention of recovery attempts; the sophisticated attacker can lock or corrupt your data, or even infect your recovery instances and images. Planning, and even more importantly, validating that you can recover requires specific attention to protecting your instance images, orchestration and data. This could involve creating an “air-gap” between production and the vaulted recovery infrastructure, as well as considering keeping tertiary archives outside of your cloud infrastructure. Here again, the ability to actually validate and prove that your recovery mechanism and recovery process work is even more challenging.
From closely following some of the most mature and sophisticated IT cloud shops in the world, the most striking traits are:
Clear and realistic resilience strategy
Clear KPIs for everything related to resilience and data protection
Automating the quality check process – not just the deployment
Taking a holistic approach for resilience and quality will ensure even better agility (far less re-deployments, re-testing, etc.), significantly less downtime, more effective DevOps cycles, and higher predictability.