Big Data security, privacy concerns remain unanswered

Big Data creates new security and privacy challenges that de-identification can't meet

Approaches to storing, managing, analyzing and mining Big Data are new, introducing security and privacy challenges within these processes. Big Data transmits and processes an individual's PII as part of a mass of data--millions to trillions of entries--flowing swiftly through new junctions, each with its own vulnerabilities.

[Chomsky, Gellman talk Big Data at MIT conference]

Deidentification masks PII, separating information that identifies someone from the rest of his or her data. The hope is that this process protects people's privacy, keeping information that would kindle biases and other misuse under wraps.

Reidentification science, which pieces PII back together reattaching it to the individual thwarts deidentification approaches that would protect Big Data, making it unrealistic to believe that deidentification can really maintain the security and privacy of personal information in Big Data scenarios.

Vulnerabilities, Exposure and Deidentification

Enterprises manage Big Data using large, complex systems that must execute hand-offs from system to system. "Typically an ETL procedure (extract, transfer, load) loads Big Data from a traditional RDBMS data warehouse onto a Hadoop cluster. Since most of that data is unstructured, the system runs a job in order to structure the data. Then the system hands it off to a relational database to serve it up, to a BI analyst, or to another data warehouse running Hadoop for storage, reference, and retrieval," explains Brian Christian, CTO, Zettaset. Any Big Data hand-offs or moves cross vulnerable junctions.

Creators of Big Data solutions never intended many of them to do what they do today. Take map reduce, for example. "Google invented map reduce to store public links so people can search them," says Christian. There were no worries about security because these were public links. Now enterprises use map reduce and NoSQL systems on medical and financial records, which should remain private. Because security is not inherent, enterprises and vendors have to retrofit these systems with security. "That's a big problem," says Christian, "vendors did not design firewalls and IDS for distributed computing architectures." These architectures tend to scale up to extremes beyond what traditional firewalls and IDS can natively address.

[Can we use Big Data to stop healthcare hacks?]

According to the Stanford Law Review article, vulnerabilities that expose PII subject people to scrutiny, raising concerns about acts of profiling, discrimination and exclusion based on an individual's demographics. These abuses can lead to loss of control for the individual. While brands use PII to market to customers to their benefit, those same vendors as well as law enforcement, government agencies and other third parties could also interpret and apply that personal data to the individual's detriment.

To prevent that, organizations charged with protecting private data have traditionally used de-identification methods including anonymization, pseudonymization, encryption, key-coding and data sharding to distance PII from real identities, according to the Stanford Law Review article. While anonymization protects privacy by removing names, addresses and social security numbers, pseudonymization replaces this information nicknames, pseudonyms and artificial identifiers. Key-coding encodes the PII and establishes a key for decoding them. Data sharding breaks off part of the data in a horizontal partition, providing enough data to work with but not enough to reidentify an individual.

Reconstituting Identities

However, computer scientists have shown they can use data that is not PII to reconstitute the associated person's identity. "There are many ways to piece data back together once you have even one type of data to work with," says Keith Carter, Adjunct Professor, The Business School of the National University of Singapore. If a brand or government acquired a list of GPS records covering one year, it could use that to learn a lot about a person or persons including their identities.

"You would easily be able to find out who they are by identifying the address they regularly come from at seven or eight in the morning. You would be able to see the school or office where they then show up. You would be able to learn where they went back to in the evening," says Carter, a speaker at the "Big Data World Asia 2013" conference.

[Big Data without good analytics can lead to bad decisions]

From that, someone could get their name and address with a high degree of accuracy using a public address lookup tool. Having the family name, they could determine which family member it is by where they end up once they leave home in the morning, whether at a primary or secondary school or at a certain place of work.

Losing Faith

The Stanford Law Review article suggests that the ability to reidentify people from pieces of data has a negative impact on privacy policy and undermines faith in anonymization. Further, the article suggests that deidentification is a key component of business models especially in healthcare, online behavioral advertising, and cloud computing. One implication is that if enterprises are entrenched in deidentification as a privacy solution, this could leave them hard pressed to find and fund an alternate solution. So, abuses that result from reidentification could go on for a long time.

But, this assumes that governments and businesses had faith in anonymization in the first place, according to Carter, who has had roles with Roles with Accenture, Goldman Sachs and Estee Lauder. There is also an assumption here that businesses and governments have spent a lot of money on something that doesn't deliver business value, Carter notes. In fact, what governments and businesses have done is to give themselves safe harbor by using deidentification/anonymization. And, even when companies don't use deidentification, the legal repercussions are a slap on the wrist, Carter confirms.

The truth is there may never be an adequate solution for Big Data privacy concerns, affordable or otherwise. There may only be solutions that protect enterprises and other entities from liability while pacifying people whose data is at risk. Unfortunately, for the individual, this means that abuses will indeed go on, regardless of the solution at hand.

Join the CSO newsletter!

Error: Please check your email address.

Tags securitydata protectionprivacy

More about Accenture AustraliaExposureGoldmanGoogleindeedMIT

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by David Geer

Latest Videos

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

  • 150x50

    IDG Live Webinar:The right collaboration strategy will help your business take flight

    Speakers - Mike Harris, Engineering Services Manager, Jetstar - Christopher Johnson, IT Director APAC, 20th Century Fox - Brent Maxwell, Director of Information Systems, THE ICONIC - IDG MC/Moderator Anthony Caruana

    Play Video

More videos

Blog Posts