Lessons learned from a recent Amazon outage

Another Amazon cloud-services outage occurred on Sunday, August 7th in a Dublin, Ireland data center. This one occurred due to a lightning strike that hit a transformer near the Dublin data center. It led to an explosion and fire that knocked out all utility services thereby leading to a total data center outage. Amazon had its only European data center located there.

My initial thoughts are related to disaster recovery and Amazon services. In their last significant outage in April, they had a network configuration change that led to an outage of services in the eastern United States. This outage begs other questions. Why isn't Amazon deploying a redundant power source, like a diesel powered backup? Maybe they did, but the fire blew out a portion of that utility service. So a more serious disaster emerged from an initial transformer explosion.

[Related: Creating a cloud SLA from diagnostic data]

How could this be addressed? How about fail-over to services in another geographic location in Europe. This didn't happen. I can only guess that building out another data center is cost prohibitive at this time and that is why Amazon doesn't have another European data center. The rest of the article mentions that it will take Amazon up to two more days to bring up the remaining servers.

It mentions that a significant period of time is being taken to start all of the servers up again. It also states that Microsoft, who has services in the same data center, does not have the same weakness. I wonder why this is; data replication should be a high priority, especially when Amazon lacks full-scale data center disaster recovery.

On Monday, August 8th, Amazon mentioned that a software error is slowing the recovery of the data within the European data center. This points to another error, a lack of business continuity testing. This testing is necessary, because conditions like this occur rarely. It also points to the fact that complex configurations make it hard to test various scenarios. Only deploying and testing a minimum number of application configurations is realistic. Otherwise there are too many permutations to test. See a previous disaster recovery article that mentions products should have standard configurations, similar to a car engine configuration and the car model.

So, it looks as if Amazon has more cloud services weaknesses that are bubbling up due to operational stresses. How can mid-sized and small businesses that outsource their web applications to Amazon's cloud protect themselves? It's clear that Amazon supports cloud applications where profitable. I suggest that those firms create a very detailed, per application SLA (Service Level Agreement) that lists global up-time, performance, and penalties when service isn't meeting objectives.

In my last couple of articles, I outlined questions to ask the service provider that reveal a current application's architecture. These questions can be asked for all of the applications that a company wants to be managed by a cloud provider. This information along with up-time requirements and performance statistics can be combined to form the SLA.

It is likely that Amazon and other major cloud providers will not support extensive disaster recovery plans until the SLAs penalize them into delivering that service well. Well defined SLAs lead to global trade growth because they ensure business is running well globally. This business handshake leads to trust between the two parties. And we all know that 'Trust is Trade.'

Join the CSO newsletter!

Error: Please check your email address.

Tags amazon.comsecuritycloud computinginternet

More about Amazon Web ServicesetworkMicrosoft

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by Gregory Machler

Latest Videos

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

  • 150x50

    IDG Live Webinar:The right collaboration strategy will help your business take flight

    Speakers - Mike Harris, Engineering Services Manager, Jetstar - Christopher Johnson, IT Director APAC, 20th Century Fox - Brent Maxwell, Director of Information Systems, THE ICONIC - IDG MC/Moderator Anthony Caruana

    Play Video

More videos

Blog Posts