Who gets blame for Amazon outage?

Reliability of cloud services is makes customers complacent; many don't plan for worst-case scenarios

Amazon.com has promised to provide a "detailed post-mortem" on the root causes of the prolonged outage of its cloud services in recent days. Users of the Amazon services, meanwhile, may also have to explain how they got caught up in the outage.

The ensuing conversations may be uncomfortable for both Amazon and its cloud customers -- perhaps even more so for users of the services.

Cloud services overall have been remarkably reliable, which may be fostering a dangerous complacency among customers who are putting too must trust in them. This is another old and familiar story of technology hubris, one that was famously illustrated by another tech marvel, the unsinkable Titanic.

In this case, it is IT managers who will have to explain to their users -- and to their company's executives -- why they didn't have a lifeboat.

Amazon's partial outage, which began Thursday and seemed largely resolved today, was an exceptional event.

Based on data compiled by AppNeta, the uptime reliability of 40 of the largest providers of cloud-based services, including Amazon, Google, Azure and Salesforce.com, shows how well cloud providers are delivering uninterrupted services. The performance management and network monitoring firm, known as Apparent Networks until this week, captures minute-by-minute uptime and other data from cloud providers used by its customers.

The overall industry yearly average of uptime for all the cloud services providers monitored by AppNeta is 99.9948 per cent, which equal to 273 minutes or 4.6 hours of unavailability per year.

The worst providers clock in at 99.992 per cent or 420 minutes or 7 hours of unavailability a year.

The best providers are at 99.9994 per cent or 3 minutes or .05 hours of unavailability a year.

The takeaway for cloud users looking at the AppNeta data is often that the risk of an outage is very low.

But that's not how the world works.

For example, Ken Brill, founder of the Uptime Institute, which researches data center issues, points to Japan's Fukushima Nuclear Power Plant. For 40 years, there were no problems at the plant. Then an earthquake and tsunami that hit in March disabled the facility with catastrophic consequences.

Brill expects a post-mortem on the nuclear plant will show at least 10 things that could have been done to help avoid that failure, reduce the magnitude of damage and made it easier or faster to recover from the disaster.

The Amazon post mortem will likely show something similar, said Brill.

Despite the redundancies and backups built into the Amazon cloud, "you hit a combination of events for which the backups don't work," he said.

Users see the promise of cloud technology as a way to reduce costs and be greener, but "that [also] means concentrating processing in fewer, bigger places," said Brill. Thus, when something goes wrong, "it has a bigger impact."

Meanwhile, the promise of reliable cloud uptime is putting protection advocates -- the IT people who champion more internal reliability and safeguards -- at a disadvantage, he added. "There will always be an advocate for how it can be done cheaper," but "if you haven't had a failure for five years - who is the advocate for reliability?

"My prediction is that in the years ahead we will see more failures than we have been seeing because people have forgotten what we had to do to get to where we are," Brill added.

AppNeta runs its company on Amazon's cloud technology and was thus affected by the outage. However, its problems where short-lived because it's service is architected to respond to a data center failure in Amazon's cloud.

Matt Stevens, the chief technology officer of AppNeta, said its system was able to fallback to an alternative availability zone in another data center in Amazon's cloud.

"You still need to plan for worst-case scenarios," said Stevens, who said Amazon advises its customers to plan for a potential data center interruption. "It was actually their guidance that helped us avoid this from being more being more painful."

Amazon has built the system with multiple levels of disaster recovery, including a design for high availability across virtual infrastructure within a zone, such as the ability to failover between servers, as well as planning to failover to another data center, as AppNeta did.

AppNeta has redundant mirroring of its data in Amazon's S3 storage service, which allowed them to pull that data into a second data center. Their problem was limited to a couple of hours Thursday morning, said Stevens.

Stevens believes that the Amazon's outage will cause people to step back and ask some question about their internal architecture, as well as ask whether to adopt a multi-cloud strategy to do more to spread the risk. "That's certainly got to be top of mind for a lot of CIOs today," he said.

Patrick Thibodeau covers SaaS and enterprise applications, outsourcing, government IT policies, data centers and IT workforce issues for Computerworld. Follow Patrick on Twitter at @DCgov , or subscribe to Patrick's RSS feed . His e-mail address is pthibodeau@computerworld.com.

Read more about cloud computing in Computerworld's Cloud Computing Topic Center.

Join the CSO newsletter!

Error: Please check your email address.

Tags Configuration / maintenancedisaster recoveryamazon.comapplicationsSalesforce.comhardware systemssoftwareBusiness ContinuityData Centercloud computinginternetGoogle

More about Amazon.comAmazon Web ServicesetworkGoogleSalesforce.comTopic

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by Patrick Thibodeau

Latest Videos

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

  • 150x50

    IDG Live Webinar:The right collaboration strategy will help your business take flight

    Speakers - Mike Harris, Engineering Services Manager, Jetstar - Christopher Johnson, IT Director APAC, 20th Century Fox - Brent Maxwell, Director of Information Systems, THE ICONIC - IDG MC/Moderator Anthony Caruana

    Play Video

More videos

Blog Posts