Big Data without good analytics can lead to bad decisions

These could undermine the competitiveness of an enterprise or damage the personal lives of individuals

Big Data does not necessarily mean Good Data. And that, as an increasing number of experts are saying more insistently, means Big Data does not automatically yield good analytics.

If the data is incomplete, out of context or otherwise contaminated, it can lead to decisions that could undermine the competitiveness of an enterprise or damage the personal lives of individuals.

One of the classic stories of how data out of context can lead to distorted conclusions comes from Harvard University professor Gary King, director of the Institute for Quantitative Social Science. A Big Data project was attempting to use Twitter feeds and other social media posts to predict the US unemployment rate, by monitoring key words like "jobs," "unemployment," and "classifieds."

Using an analytics technique called sentiment analysis, the group collected tweets and other social media posts that included these words to see if there were correlations between an increase or decrease in them and the monthly unemployment rate.

[Related: Big Data investigations: Opportunity and risk]

While monitoring them, the researchers noticed a huge spike in the number of tweets containing one of those key words. But, as King noted, they later discovered it had nothing to do with unemployment. "What they hadn't noticed was Steve Jobs died," he said.

In the telling, it's a somewhat humorous story, outside of the tragedy of Jobs' untimely passing. But the lesson is a deadly serious one for those looking to rely on the magic of Big Data to guide their decisions.

King said the mix-up over the dual meanings of "jobs" is, "just one of many similar anecdotes. Anyone working in this area has had similar experiences."

"Lists of keywords, curated by human beings, work OK for the short run, but tend to fail catastrophically over the long run," he said. "You can fix it up by adding exceptions, but there's a lot of human labor involved."

He said it is easy for anyone to create their own example just by entering a keyword into the Bing Social page.

"You'll see some relevant things and some irrelevant. If you don't change the query and watch over time, you will often find the conversation veering away in some way -- sometimes a little, sometimes not at all for a while, and sometimes dramatically," he said.

But King said that overall there are many examples of big data analytics producing useful things, "so failures tend not to appear in the literature."

Kim Jones, senior vice president and CSO of Vantiv, said this is not a new problem, but one that can be magnified if people think massive amounts of data are magically going to produce good analytics.

"The Jobs example was a classic case of data without context. Data by itself doesnt equal intelligence," he said.

King agrees that context is key. He is co-founder and chief scientist of Crimson Hexagon, a big-data analytics firm that, in the words of Wayne St. Amand, its executive vice president of marketing, seeks to provide, "context, meaning and structure to online conversations."

Yet there are increasing examples of data without context driving decisions. The Wall Street Journal reported in February on health insurance companies using Big Data to create profiles of their members. Among the things the companies tracked was a history of buying plus-sized clothes, which could lead to a mandatory referral to weight-loss programs.

Few people would argue with encouraging people to live healthier lives, but the privacy implications are disturbing. It is possible the person buying those clothes might have been doing so for another family member. And it is not always so benign. Bloomberg BusinessWeek reported in 2008 on individuals being denied health insurance based on a history of prescription drug purchases that suggested even minor mental health conditions.

Adam Frank, writing on the National Public Radio blog, noted that in some cases banks will deny a loan to someone based in part on their contacts on the employment networking site LinkedIn or the social networking behemoth Facebook. If your "friends" are deadbeats, your credit-worthiness may be based on their reliability.

Frank quoted Jay Stanley, senior policy analyst at the ACLU, noting on that groups blog that, "Credit card companies sometimes lower a customer's credit limit based on the repayment history of the other customers of stores where a person shops. Such 'behavioral scoring' is a form of economic guilt-by-association based on making statistical inferences about a person that go far beyond anything that person can control or be aware of."

Kim Jones said the tendency to jump to a conclusion from correlations without further analysis could have affected him personally. "During the late '80s and early '90s, data showed that Hispanic and Black males between the ages of 20 and 27 who were driving an entry-level luxury car on the I-95 corridor were likely to be drug runners," he said.

"I fit some of that profile -- I'm African American, I was that age and at that time I was driving a car like that. But if I had been stopped, the police would have seen that I was wearing an Army uniform with Second Lt. bars and had a West Point ring," he said.

The point, he said, is that, "its always bad to rely just on data analytics. When you take the human element out of the equation, you by definition create a higher error rate."

In short, Big Data is a tool, but should not be considered the solution. "It can help you narrow something down from millions to perhaps 150," Jones said, "but the temptation is to let the computer do it all, and that is what is going to get you in trouble."

Join the CSO newsletter!

Error: Please check your email address.

Tags applicationssoftwarebusiness intelligence

More about BloombergCSOFacebookHarvard UniversityWall StreetWest

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by Taylor Armerding

Latest Videos

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

  • 150x50

    IDG Live Webinar:The right collaboration strategy will help your business take flight

    Speakers - Mike Harris, Engineering Services Manager, Jetstar - Christopher Johnson, IT Director APAC, 20th Century Fox - Brent Maxwell, Director of Information Systems, THE ICONIC - IDG MC/Moderator Anthony Caruana

    Play Video

More videos

Blog Posts