How to Secure Big Data in Hadoop

The potential is enormous-as businesses transform into data-driven machines, the data held by your enterprise is likely to become the key to your competitive advantage. As a result, security for both your data and your infrastructure becomes more important than ever before.

Big Data Could Be Toxic Data If Lost

In many cases, organizations will wind up with what Forrester Research calls "toxic data." For instance, imagine a wireless company that is collecting machine data-who's logged onto which towers, how long they're online, how much data they're using, whether they're moving or staying still-that can be used to provide insight to user behavior.

That same wireless company may have lots of user-generated data as well: credit card numbers, social security numbers, data on buying habits and patterns of usage-any information that a human has volunteered about their experience. The capability to correlate that data and draw inferences from it could be valuable, but it is also toxic because if that correlated data were to go outside the organization and wind up in someone else's hands, it could be devastating both to the individual and the organization.

With Big Data, Don't Forget Compliance and Controls

9 Tips for Securing Big Data

1. Think about security before you start your big data project. You don't lock your doors after you've already been robbed, and you shouldn't wait for a data breach incident before you secure your data. Your IT security team and others involved in your big data project should have a serious data security discussion before installing and feeding data into your Hadoop cluster.

2. Consider what data may get stored. If you're planning to use Hadoop to store and run analytics against data subject to regulation, you will likely need to comply with specific security requirements. Even if the data you're storing doesn't fall under regulatory jurisdiction, assess your risks--including loss of good will and potential loss of revenue--if data like personally identifiable information (PII) is lost.

3. Centralize accountability. Right now, your data probably resides in diverse organizational silos and data sets. Centralizing the accountability for data security ensures consistent policy enforcement and access control across these silos.

4. Encrypt data both at rest and in motion. Add transparent data encryption at the file layer. SSL encryption can protect big data as it moves between nodes and applications. "File encryption addresses two attacker methods for circumventing normal application security controls," says Adrian Lane, analyst and CTO of security research and advisory firm Securosis. "Encryption protects in case malicious users or administrators gain access to data nodes and directly inspect files, and it also renders stolen files or disk images unreadable. It is transparent to both Hadoop and calling applications and scales out as the cluster grows. This is a cost-effective way to address several data security threats."

5. Separate your keys and your encrypted data. Storing your encryption keys on the same server as your encrypted data is similar to locking your front door and then leaving the keys dangling from the lock. A key management system allows you to store your encryption keys safely and separately from the data you're trying to protect.

6. Use the Kerberos network authentication protocol. You need to be able to govern which people and processes can access data stored within Hadoop. "This is an effective method for keeping rogue nodes and applications off your cluster," Lane says. "And it can help protect web console access, making administrative functions harder to compromise. We know Kerberos is a pain to set up, and (re-)validation of new nodes and applications take work. But without bi-directional trust establishment, it is too easy to fool Hadoop into letting malicious applications into the cluster, or into accepting the introduction of malicious nodes---which can then add, alter or extract data. Kerberos is one of the most effective security controls at your disposal, and it's built into the Hadoop infrastructure, so use it."

7. Use secure automation. You're dealing with a multi-node environment, so deployment consistency can be difficult to ensure. Automation tools like Chef and Puppet can help you stay on top of patching, application configuration, updating the Hadoop stack, collecting trusted machine images, certificates and platform discrepancies. "Building the scripts takes some time up front but pays for itself in reduced management time later, and additionally ensures that each node comes up with baseline security in place."

8. Add logging to your cluster. "Big data is a natural fit for collecting and managing log data," Lane says. "Many web companies started with big data specifically to manage log files. Why not add logging onto your existing cluster? It gives you a place to look when something fails, or if someone thinks perhaps you've been hacked. Without an event trace you are blind. Logging MR requests and other cluster activity is easy to do and increases storage and processing demands by a small fraction, but the data is indispensable when you need it."

9. Implement secure communication between nodes and between nodes and applications. To do this, you'll need an SSL/TLS implementation that protects all network communications rather than just a subset. Some Hadoop providers, like Cloudera, already do this, as do many cloud providers. If your setup doesn't have this capability, you'll need to integrate the services into your application stack.

Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for Follow Thor on Twitter @ThorOlavsrud. Follow everything from on Twitter @CIOonline and on Facebook. Email Thor at

Read more about business intelligence (bi) in CIO's Business Intelligence (BI) Drilldown.

Join the CSO newsletter!

Error: Please check your email address.

Tags applicationsNetworkinghadoopauthenticationbig dataData managementApplications | Business Intelligence (BI)managementMicrosoftsecuritysoftwaretwitterForrester Researchdata protectiondata lossaccess controlToxic Data

More about FacebookForrester ResearchIT SecurityMicrosoft

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by Thor Olavsrud

Latest Videos

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

  • 150x50

    IDG Live Webinar:The right collaboration strategy will help your business take flight

    Speakers - Mike Harris, Engineering Services Manager, Jetstar - Christopher Johnson, IT Director APAC, 20th Century Fox - Brent Maxwell, Director of Information Systems, THE ICONIC - IDG MC/Moderator Anthony Caruana

    Play Video

More videos

Blog Posts