Cloud architecture: Questions to ask for reliability

How do know your cloud provider has the right web-services architecture? Gregory Machler offers these questions to ask

I've been an architect on some complex applications and I have a significant concern about assessing architectural risk for public/private cloud applications. Traditional risk assessments focus on external/internal access to confidential information like social security numbers, credit card number, and for banks PINs for the ATMs. Access controls and network protection are high priorities because they suppress the risk.

I'm interested in something a little different -- I'll call it architectural reliability. The desire is to avoid single points of failure for critical applications so that catastrophic errors don't occur; those lead to huge financial losses and a diminished corporate brand. So, where would I start to shore up the architecture? Here are some storage and networking diagnostic questions I would ask for the top-10 applications within a corporation. Note that some questions that need to be asked are pertinent to all applications and some just within a given domain. I'm going to focus on just the storage and networking product domains that support the top-10 applications.

[See also: Five cloud security trends experts see for 2011]

Storage Architecture -- All Applications

Is only one SAN vendor used for storage of all of the applications?

How is data de-duplication addressed?

Is only one SAN switch vendor used for all of the applications?

Is only one data replication vendor used?

Is only one encryption vendor used to encrypt data for all of the applications?

Which encryption algorithm is used for a given encryption tool?

Is only one PKI vendor used to manage certificates?

Where are the certificates related to data at rest encryption stored?

Storage Architecture -- Each Application

What storage subsystem does the application run on?

Which other applications run on the same subsystem?

Is the data on the storage subsystem replicated elsewhere or is this the only copy?

How is the need for more data storage addressed for a given application?

What SAN switch is used for traffic to/from the storage subsystem?

What network components are used to replicate SAN data from one data center to another remote data center?

What is the application that performs data replication?

What is the software version and release for the data replication application?

Which encryption vendor is used to encrypt Confidential data on a given storage subsystem?

Does the storage for the encryption tool also run on a SAN shared with other applications?

Can corruption of the encryption data affect multiple applications or just this application?

What PKI vendor is used?

What version and release of PKI software is deployed?

Network Architecture -- All Applications

Is there only one switch/router vendor?

Is there only one firewall vendor?

Is there only one Intrusion Protection System/Intrusion Detection System (IPS/IDS) vendor?

Is there only one load balancer vendor?

Is there just one telecommunications vendor to the internet and/or WAN (Wide Area Network)?

Network Architecture -- Each Application

Which switch/routers are used within the data center?

Which switch/router models are used?

Are the switch/routers in an architecturally redundant design?

What version of embedded software and model of hardware is used in switch/router deployment?

Which firewall vendor is used?

What models of firewalls are deployed in the data center?

Are there a limited number of firewall permutations that are deployed? (embedded OS version, hardware model, features)?

What intrusion protection/detection products are deployed?

Which intrusion protection/detection vendors are used?

What permutations of IPS/IDS are deployed in the data center?

What version of IPS/IDS software is deployed?

Which vendor's load balancers are used?

Which load balancer model is used?

What is the version of the load balancer's embedded software and model of hardware?

Are they used to steer traffic between different global data centers?

Are the load balancers redundant, could one instantly take the place of another?

What telecommunications vendors are used for internet access?

What WAN telecommunications vendor is used for traffic between data centers?

What WAN telecommunications vendor is used for traffic between offices and the data center?

Is the telecommunications equipment redundant?

Is the telecommunications fiber underground physically separate?

These questions cover a large chuck of storage and networking diagnostic questions. I'm sure that I've missed some; but this should provide a flavor of what the critical web applications are using within the infrastructure cloud layer. These questions give insight into whether or not failure in a given product would affect multiple applications. It helps companies design and tune the architecture properly so that redundancy can be created in all products where possible. Then the failure of a given product does not cascade to multiple critical applications. It is very likely that it is much cheaper to over-engineer, thereby anticipating and reacting well to failure, than it is to have very expensive cloud services downtime.

The questions associated with whether or not only one vendor is used for a given product type reveals a potential enterprise weakness. Full reliance on one vendor can lead to significant failure if a specific product hardware/software release is flawed and occurs under stressful conditions only. Then, all cloud applications that use that product would be impacted negatively. The other questions address what I'll call use congestion. Multiple applications are sharing the same component (storage subsystem, server, or firewall). The product failure affects all those applications simultaneously.

In summary, this article focuses on architectural reliability. It creates a set of questions just focused on products within the storage domain, encryption of data-at-rest, and the networking domain. Since the cost of products is much cheaper than application downtime over-engineering is encouraged where possible. The need to deploy more product vendors must be balanced with a need to limit product and feature permutations so that realistic disaster recovery scenarios can be tested. Please see a previous article that I wrote on this. I'll visit other cloud layer diagnostic questions in the next article.

Join the CSO newsletter!

Error: Please check your email address.

Tags cloud computinginternet

More about etworkIntrusionIPS

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by Gregory Machler

Latest Videos

  • 150x50

    CSO Webinar: Will your data protection strategy be enough when disaster strikes?

    Speakers: - Paul O’Connor, Engagement leader - Performance Audit Group, Victorian Auditor-General’s Office (VAGO) - Nigel Phair, Managing Director, Centre for Internet Safety - Joshua Stenhouse, Technical Evangelist, Zerto - Anthony Caruana, CSO MC & Moderator

    Play Video

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

More videos

Blog Posts

Market Place