Scraping away profits

Web scraping is on the rise. How to detect and protect your intellectual property.

At technology company Graphiq, web-scraping bots were becoming more than just a nuisance. They were impacting its bottom line.

The company collects and interprets billions of data points from thousands of online data sources and turns them into visual graphs that website visitors can use for free. Scrapers were extracting data from hundreds of millions of these pages and building duplicate sites. 

“We don’t want people to reuse [our data] commercially for free because there is a cost associated with creating that content,” says Ivan Bercovich, vice president of engineering. “It also undermines the value of our content” and steals traffic away from its site. Then there are the operational costs associated with blocking those web-scraping attempts. “We may have months where we block 5% to 6% of all requests,” Bercovich says. “For a site of our volume – about 30 million visitors a month -- that’s a lot of wasted requests.”

Web scraping is on the rise, especially attacks on businesses to steal intellectual property or competitive intelligence. Scraping attacks increased 17 percent in 2014, up for the fifth year in a row, according to ScrapeSentry, an anti-scraping service. Some 22 percent of all site visitors are considered to be scrapers, according to the report.

“One of the big things driving this up is that it’s just getting easier,” says Gus Cunningham, CEO at ScrapeSentry. Would-be scrapers can pay for commercial scraping services, write the code themselves using step-by-step online tutorials or even get free automated tools that do all the work.

Web-scraping tools are out in the open because web scraping is legal in some cases, such as gathering data for personal use. But it also creates a loophole for nefarious scrapers and a security hole for companies that don’t update their legal terms or their IT security processes.

“A lot of people are under the misconception that this kind of thing is considered ‘fair use,’ which is absolutely incorrect,” says Michael R. Overly, a partner and intellectual property lawyer focusing on technology at Foley & Lardner LLP in Los Angeles. “They think that because [the data provider’s] website doesn’t require any payment of fees there’s this exception. In general, if you (the scraper) are selling ads on your site, even if your end users don’t pay you any money, you’re getting revenues from ad displays. It’s a commercial purpose, so it’s highly unlikely it’s going to be fair use.”

Michael R. Overly, a partner and intellectual property lawyer focusing on technology at Foley & Lardner LLP

Ticketmaster and Massachusetts Institute of Technology have successfully gone after scrapers of their data who claimed that their actions were fair use or didn’t violate copyright laws.

Today’s botnets take web scraping to a whole new and elusive level.“The reward-to-risk ratio in cybercrime is actually pretty good, and that’s why we're seeing an uptick in volume in web scraping, says Morgan Gerhart, vice president of product marketing at cyber security company Imperva. “Generally 60 percent of traffic that hits a website is bots. Not all bot traffic is bad, but we know half of that is from somebody that is up to no good.”Random spikes in any bot traffic reduces website performance and increases infrastructure cost, which impact the user’s experience.

What you can do

Web scraping is not going away, Bercovich says, but companies can take several steps to fight back. Botnets come fast and furiously in large volumes and usually slow down systems. “If they’re filtering at superhuman speeds, or paginating quickly or never scroll down the page,” then it’s probably a bot. Even after bots are detected and blocked, the fight is rarely over. “They’ll try a few variations to see if they can escape detection, but by then we’re totally on top of it,” Bercovich adds.

A multi-layer defense is the best offense to combat web-scraping bots, Gerhart says.

Application level intelligence

A product with application level intelligence can look at traffic to determine if it’s a browser on the other end or a bot. “People who are good at [scraping] can make it look like a browser,” Gerhart says. “That’s when you get into more sophisticated behavior.”

Rate limiting

Not all web scraping is inherently bad, Gerhart says, “but you don’t want web scraping traffic to interfere with other users.” On a per-connection basis, limit a user’s actions to no more than x number of actions in x amount of time, he says. “Even if you’re OK with scraping of your site, you may not want it at a rapid pace so it overwhelms your CPUs.”

Obfuscate data

Render data meaningless to the person who is scraping it. Displaying web content as images or flash files can deter site scrapers, “although more sophisticated scrapers can get around it,” Gerhart says. Another option -- applications can compile text using script or style sheets since most scraping tools cannot interpret JavaScript or CSS.

Other deterrents include constantly changing html tags to deter repeated scraping attacks, and using fake web content, images, or links to catch site scrapers who republish the content, Gerhart says.

Safety in numbers

Some companies combat distributed botnets by partnering with large service providers that have exposure to a big portion of all of the requests on the Internet. They’re able to see attack patterns, collect those IP addresses and block them for all of their clients. Graphiq chose to outsource its bot protection to a provider with broader knowledge of scraping attacks.

Legal protection

Scrapers and botnet users are extremely hard to find and prosecute, security experts say. Still, companies have to lay the groundwork for legal action by clearly stating in their website’s terms and conditions of use that web scraping or automated cataloging is prohibited, Overly says.

The second line of a legal defense is copyright law. When scrapers make off with material on a site, they are infringing on that copyright. Website owners don’t even have to prove that scraping led to any real harm, Overly says. “They can simply show that it was intentional, and they get mandated damages from the copyright act, which can be very substantial.“

Today, Graphiq “rarely if ever” has its data stolen by web scrapers, Bercovich says, but they’ll never be able to eliminate botnet attempts. “You can only detect and block them so they don’t get your content,” he says. “The more effectively you can do that, the more understanding and good reporting that you have, the more quickly you can act.”

Join the CSO newsletter!

Error: Please check your email address.

More about CSOGoogleImpervaMassachusetts Institute of TechnologyMorganTechnologyTicketmaster

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by Stacy Collett

Latest Videos

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

  • 150x50

    IDG Live Webinar:The right collaboration strategy will help your business take flight

    Speakers - Mike Harris, Engineering Services Manager, Jetstar - Christopher Johnson, IT Director APAC, 20th Century Fox - Brent Maxwell, Director of Information Systems, THE ICONIC - IDG MC/Moderator Anthony Caruana

    Play Video

More videos

Blog Posts