Can data lakes solve cloud security challenges?

"Data Lake" is a proprietary term. "We have built a series of big data platforms that enable clients to inject any type of data and to secure access to individual elements of data inside the platform. We call that architecture the data lake," says Peter Guerra, Principal, Booze, Allen, Hamilton. Yet, these methods are not exclusive to Booze, Allen, Hamilton.

"I have read what's available about it," says Dr. Stefan Deutscher, Principal, IT Practice, Boston Consulting Group, speaking of the data lake; "I don't see what's new. To me, it seems like re-vetting available security concepts with a name that is more appealing." Still, the approach is gaining exposure under that name.

In fact, enterprises are showing enough interest that vendors are slapping the moniker on competing solutions. Such is the case with the Capgemini / Pivotal collaboration on the "business data lake" where the vendors are using the name to highlight the differences between the offerings.

This enterprise curiosity stems from real big data ills that need equally genuine cures. Enterprises from government agencies to large concerns and on down use big data inside public multitenant cloud environments. All the risks of mutlitenancy apply in these scenarios including the vulnerabilities that come with the weaker security of another tenant, potential access by users of an adjacent tenant, PII/PHI exposure, and other regulatory non-compliance. Data lakes could protect big data from all the perils of the public cloud.

But, while Defense agencies need the protection data lakes offer for each individual data element, the typical enterprise does not. Nor can most enterprises afford the performance hit that comes with using data lakes in this way. That's why some vendors are using data lakes to protect the whole of big data rather than each piece while also avoiding the performance lag of the former approach. Enterprises in the market for solutions to security challenges that come with the public cloud should consider one or both data lake approaches.

Securing data elements

"The overarching concept is the ability to pull in different types of data, tag that data, and enable users and administrators to secure the individual data elements within the data lake," says Guerra. Rather than deidentifying PII/PHI and providing data privacy on the whole, this data lake approach determines what pieces of data are sensitive and what pieces are not and works from there.

"We like to bring all the data into the data lake in its rawest format," says Guerra; "we don't do any extraction or transformation of data ahead of time." Instead, this approach tags each data element with a set of metadata tags and attributes that describes the data and how the IAM systems that access it should handle it.

According to Guerra, the IAM system enforces the security of individual data elements using XACML (Extensible Access Control Markup Language) -based rules. An administrator or system writes rules in the IAM system, which enforces those rules when a user authenticates. The system passes the user's security authorizations to the big data architecture. "The big data architecture then matches the individual security authorizations with the XACML rules and returns only the appropriate data," says Guerra.

Pros and cons

Data lakes still require role-based access, policies, and policy enforcement. "You use PKI to ensure the person is who they say they are and to bind their attributes to the platform that stores the individual data attributes to ensure that security is complete," says Guerra. The system requires policies and policy enforcement to limit and permit access based on the metadata tags and attributes. The system uses a technology that brokers the data access requests in order to enforce the security policy.

"It's very difficult to implement those systems and attribute enforcements throughout the data lake platform stack," says Guerra. But Guerra has worked closely with clients to define policies, he says.

With this kind of system, a data assailant would have to break through the perimeter security around the data lake and through the security protecting the individual data elements in order to retrieve anything. The system uses PKI to cryptographically sign and enforce security tags for the data elements. "You can't change them nor can you break them. An attacker would have to break each tag in order to gain access to all data elements," says Guerra.

However, this kind of approach requires an IAM system with attribute-based access controls (ABAC). There are a number of ABAC vendors in the market. But, system scalability and performance are still concerns with ABAC systems, according to NIST Special Publication 800-162, "Guide to Attribute Based Access Control (ABAC) Definition and Considerations" (January, 2014).

But ABAC IAM systems in an unstructured data lake work differently than existing structured systems and legacy security solutions do, says Jerry Irvine, CIO, Prescient Solutions and member of the National Cyber Security Task Force. "Access and authorization controls within the data lake are distributed across multiple categories of service and systems," says Irvine. This offsets the potential for these IAM systems to experience load and performance issues at a single point of failure.

How data lakes identify and tag data from legacy platforms is another concern. "Most applications don't provide sufficient meta-information about data they generate," says Dr. Deutscher. This can make it difficult for data lakes to know how to tag data elements with attributes.

"We've handled that a couple of ways," says Guerra. One method is to query legacy systems and apply tagged attributes to the results. Another way is to classify legacy systems as a whole. A small subset of people can read an older financial transaction system, for example. "We integrate the output from that legacy system and pull it into the data lake," says Guerra. The data becomes part of the lake while retaining access rights for the appropriate people.

Ultimately, data lakes enable enterprises to swiftly input variegated data types and make it easier to process and exploit. "Because all the data is stored as unaltered, queries provide a more accurate report with a greater depth of information reported about the data," says Irvine. Data lakes provide higher levels of information to executive management, revealing correlations between data that they may have overlooked, allowing them to make more intelligent decisions, Irvine notes.

Securing only the lake

"Data lakes can act as repositories of log file information, user information, and behavioral and transactional information about the user," says Steve Jones, Strategy Director, Big Data & Analytics, Capgemini. Enterprises can use massive amounts of data to establish a robust baseline of expected user behavior. With a fine grain model of normal behavior, data lakes can quickly and precisely detect anomalous behavior, intrusions, IP theft, and data leakage.

This data lake approach avoids the costs and performance lags of the other approach, which are associated with enriching every single piece of data with the right metadata and with validating every query and hit on every piece of information against the security policy, explains Jones.

While the level of security detail in the other data lake approach is laudable, says Jones, it is probably too expensive for most enterprises. "The raw data that data lakes can store is, however, useful in securing a cloud approach by performing threat, intrusion, and anomalous behavior analysis," says Jones.

CSOs need to know what they are trying to achieve. "Is it fine-grained security in the defense sector or simply a better way to create a 360-degree view of internal and external threats," asks Jones? "Understanding the real business challenge will help them undertake the right approach," says Jones.

For many, the simpler solution is the right one.

Join the CSO newsletter!

Error: Please check your email address.

Tags security

More about Boston ConsultingBoston ConsultingPivotal

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by David Geer

Latest Videos

  • 150x50

    CSO Webinar: Will your data protection strategy be enough when disaster strikes?

    Speakers: - Paul O’Connor, Engagement leader - Performance Audit Group, Victorian Auditor-General’s Office (VAGO) - Nigel Phair, Managing Director, Centre for Internet Safety - Joshua Stenhouse, Technical Evangelist, Zerto - Anthony Caruana, CSO MC & Moderator

    Play Video

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

More videos

Blog Posts

Market Place