Hackathon geared toward the 'liberation' of data from public PDF documents

The Sunlight Foundation and others will sponsor a three-day hackathon starting Friday

Massive amounts of unstructured data are held in the form of PDF documents, but extracting key figures and words out of PDFs in a programmatic manner can be difficult and costly. This poses a challenge to public-interest groups, journalists and others who are interested in running large-scale analyses on PDF documents in order to uncover valuable insights.

In a hackathon set for this week, participants will work on ways to improve the open-source software tools available for PDF data extraction.

"Say, for example, you want to model student loan securitizations," wrote Marc Joffe, principal consultant at Public Sector Credit Solutions and an organizer along with the Sunlight Foundation and others of the PDF Liberation Hackathon, in a guest post on the Mathbabe blog. "A corporation or well funded research institution can purchase an expensive, enterprise-level ETL (Extract-Transform-Load) tool to migrate data from the PDFs into a database. But this is not much help to insurgent modelers who want to produce open source work."

"Data journalists face a similar challenge," he added. "They often need to extract bulk data from PDFs to support their reporting. Examples include IRS Form 990s filed by non-profits and budgets issued by governments at all levels."

Data journalists have developed open-source PDF harvesting tools such as Tabula, Joffe added.

"Unfortunately, the free and low cost tools available to modelers, data journalists and transparency advocates have limitations that hinder their ability to handle large scale tasks," he wrote. "If, like me, you want to submit hundreds of PDFs to a software tool, press 'Go' and see large volumes of cleanly formatted data, you are out of luck."

The hackathon runs from Friday through Sunday and will be held at six sites, including the Sunlight Foundation's headquarters in Washington, D.C., according to the event's website. Remote participation is also possible.

Contestants will be able to work on "a PDF extraction challenge provided by one of our sponsoring organizations, can work on their own challenges or develop enhancements to an open source PDF extraction tool," according to the site.

While the use of open-source tools is encouraged, commercial tools are allowed as long as licensing costs less than US$1,000 and an unlimited trial is available.

It's true that some of the best tools for PDF extraction are proprietary and expensive, said analyst Curt Monash of Monash Research, who closely tracks the database and data-analysis market as well as public policy on technology.

"One of the leading filter/extraction libraries was bought by Verity, which was bought by Autonomy, which was bought by HP," he said via email on Thursday. "Another one, with a somewhat different orientation, was developed by Xerox, which spun it out as Inxight, which was bought by Business Objects, which was bought by SAP."

"It's worth remembering that there's a multi-stage process here," Monash added. "For example, a PDF can be converted to text (and image) data, (Name, value) pairs can be extracted. Those can have their spelling corrected. Then the company names can be regularized. In real life, there can be tens of steps."

As for the hackathon's potential value, "a large fraction of the world's interesting information is on paper, or in paper-like formats such as PDF," he added. "Of course it's worthwhile to make all that more accessible."

Chris Kanaracus covers enterprise software and general technology breaking news for The IDG News Service. Chris' email address is Chris_Kanaracus@idg.com

Join the CSO newsletter!

Error: Please check your email address.

Tags application developmentSunlight Foundationapplicationssecuritylegalsoftwareinternetprivacy

More about AutonomyBusiness ObjectsHPIDGInxightIRSIRSSAP AustraliaVerityXerox

Show Comments

Featured Whitepapers

Editor's Recommendations

Solution Centres

Stories by Chris Kanaracus

Latest Videos

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

  • 150x50

    IDG Live Webinar:The right collaboration strategy will help your business take flight

    Speakers - Mike Harris, Engineering Services Manager, Jetstar - Christopher Johnson, IT Director APAC, 20th Century Fox - Brent Maxwell, Director of Information Systems, THE ICONIC - IDG MC/Moderator Anthony Caruana

    Play Video

More videos

Blog Posts