Startup claims it saw early signs of Amazon's cloud outage

Almost two hours before Amazon Web Services publicly acknowledged an outage that brought down websites such as Reddit, Imgur and Heroku, application monitoring startup Boundary claims it started noticing the problem.

The company is developing an early-warning system for cloud outages and following AWS's most recent incident, it says its service was proven to work at scale for the first time.

BACKGROUND: Amazon outage started small, snowballed into 12-hour event 

AWS ALTERNATIVES: 10 Most powerful IaaS companies 

Boundary is an application performance management (APM) tool that installs an agent that monitors second-by-second performance of virtual machines running in a public cloud or data center. The information from the VMs is sent into Boundary's cloud where it is analyzed. Boundary relays information to customers about the health of their system and aggregates data from its hundreds of customers to monitor trends in the cloud.

On Monday, almost two hours before AWS officially announced an issue in its Elastic Block Storage (EBS), which is a volume storage service used in conjunction with its Elastic Cloud Compute (EC2), Boundary started noticing abnormal activity in AWS's cloud, the company says.

During the next two hours, nearly a third of the agents among Boundary's more than 300 customers stopped reporting back to Boundary's cloud at one time or another. Data transfer from AWS's cloud to Boundary's data analysis servers dropped 27% from 38Mbps to 28Mbps.

Roundtrip latencies between the agents and Boundary's cloud increased by three times their normal levels, the company says. The latency in VM reporting continued until 2 a.m. PT on Tuesday, when AWS reported that the issue had mostly been resolved. Stamos detailed what the company found in a blog post.

Boundary only tracks the performance of the VMs, so she says there's no way to know what caused the issue on Monday. The decreased network traffic could mean customers were experiencing performance problems on their own instances, which were then being reflected in Boundary's tools, or that there was a problem the VMs' ability to send tracking data to Boundary from AWS. Either way, it was enough of an abnormal spike, Stamos says, that they knew something was up. "There's no way to go inside Amazon's infrastructure, what we're trying to do is be a leading indicator, alerting customers that there is a problem developing," she says.

Boundary hasn't quite fully developed that functionality yet though. In the most recent incident, Boundary didn't actually inform customers that AWS was experiencing some abnormal activity, it just reported the results on its website. The company hopes to in the future use this data to create that early warning system for users.

Having that knowledge could be critical, she argues. If customers were alerted to performance issues within an Availability Zone they could switch workloads out of it and into another unaffected Availability Zone, into a another cloud provider, or into their own data center.

MORE CLOUD: Does OpenStack need a Linus Torvalds? 

All of this is not easy to do, and to do right, says Jim Frey, managing research director for Enterprise Management Association. The predictive analytics industry is still very young. Frey says that if he had seen the change in activity that Boundary noticed on Monday, it may have raised an eyebrow for him, but it's difficult to predict if those anomalies would lead to a significant event, such as Monday's outage. "Many cases in IT there's smoke before there's fire," Frey says. "The problem is when you cry wolf."

Boundary is not alone in offering predictive analytics capabilities for the cloud either. The company does take a unique approach to the issue though. Traditionally, APMs have measured the output performance of virtual machines. A variety of players do this, from NetScout to Riverbed to Network Instruments. Big name tech companies such as IBM, HP and CA have APM tools as well.

Boundary, by contrast, installs an agent that tracks individual VMs, which Frey says provides both a more intricate and holistic view of a cloud environment. Plus, the agent is able to follow the VM wherever it goes, whether on a public cloud or in the data center.

Also, if an administrator does want to transfer workloads out of a certain Availability Zone, or across to another cloud provider, the system has to be architected to support that transition beforehand. Load balancers have to be in place, the application has to be horizontally scalable and the new VM instances have to be able to be onboarded quickly. If the system has been architected that way, then at the first sign of some problem, the workloads could theoretically be transferred out of the impacted Availability Zone. Just how many cloud users have such a system setup today is unclear, Frey says.

As for the false positives, Boundary is collecting a lot of data, and the more data it collects, the better it will be at knowing which issues are real problems and which are insignificant hiccups.

Boundary released its product in April but it hopes to roll out additional features in the coming weeks. Customers currently receive a 10-minute history of one-second intervals of their system's performance. The goal is to offer a 24-hour look-back of system performance, plus one month's worth of minute-by-minute data, or a year's worth of hourly data. Alerts for customers warning of potential outage events is a goal of the company as well.

Meanwhile, AWS has still not yet released details of exactly what caused this week's incident, but the outage represents the third significant incident in two years. About two weeks after the company's last major outage in July, it issued a detailed post-mortem report explaining that power outages, bugs and bottlenecks that caused the problem, which the company may do for this one as well.

Network World staff writer Brandon Butler covers cloud computing and social collaboration. He can be reached at and found on Twitter at @BButlerNWW.

Join the CSO newsletter!

Error: Please check your email address.

Tags predictive analyticsapplication managementNetworkinginfrastructure managementCloud outageherokuapplication performance managementcloud computinginternetmanagementBoundarysecuritysoftwareAmazon outagesystem managementAmazon Web ServicesAPM

More about Amazon Web ServicesAPMHPIBM AustraliaNetScoutNetScoutNetwork InstrumentsRiverbed

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by Brandon Butler

Latest Videos

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

  • 150x50

    IDG Live Webinar:The right collaboration strategy will help your business take flight

    Speakers - Mike Harris, Engineering Services Manager, Jetstar - Christopher Johnson, IT Director APAC, 20th Century Fox - Brent Maxwell, Director of Information Systems, THE ICONIC - IDG MC/Moderator Anthony Caruana

    Play Video

More videos

Blog Posts