Scraping and tabling

Today I was reading an excellent post called Get Started With Scraping – Extracting Simple Tables from PDF Documents by Tony Hirst (@psychemedia). Now being a big fan of Apache Tika which I use a lot at work, I immediately tweeted him and suggested it as a possible tool (he’d used scraperwiki).

Over a pot of tea I thought I’d give it a go and see if it was actually was easier. However, given his source pdf from Varner-Gruppen the problem I hadn’t recognised was that Tika returns just data, rather than formatting. That’s not usually a problem for us, we deal with blocks of text in paragraphs, so vertical order is usually our issue. However, it’s a problem if the data has a horizontal, visual order, which isn’t part of the actual text. For example, if I have a table in a pdf that’s expressed via the layout:

ColumnA      ColumnB                ColumnC
TextA        TextB which is longer  TextC

And I extracted the data via Tika, what I’ll get is something like this:

ColumnA ColumnB ColumnC
TextA TextB which is longer TextC

Now, where does the ColumnC data start? at ‘which’, or ‘longer’? Without the format, there isn’t enough information to work out where the split is.

Enter my second favourite tool for this sort of thing, pdfToText. What this has, that Tika doesn’t, is a brilliant -layout option that pads the extracted text in an approximation of how it originally looked. As a test we can let it work it’s magic Tony’s original example file :-

pdftotext -layout Fabrikklister_VarnerGruppen_2013.pdf tmp.txt

If I pull the resulting tmp.txt file into TextMate and turn on hidden characters then all the spaces that approximate the layout are apparent as little dots.

Untitled

Looking at this it’s obvious that although there are spaces within the text, inter-field there are at least 3 spaces. Next step then is to replace them with something more useful, say TAB. That’s pretty simple in a Unix-like OS as on my MacBook and there’s a few ways to go – sed is my choice of tool here. While I’m there, I’ll trim off the top few lines above the table with tail as well:

Note. On a Mac, cmdline programs working on unicode files can behave rather weirdly. The LANG=C fixes it, don’t ask me how.

tail -n +5 tmp.txt | LANG=C sed 's/ \{3,\}/*/g'

The sed command looks for all occurrences of at least three spaces, i.e. s/{3,} (without escaping), and replaces them with a star *. Now the text looks like this:

Untitled

We’re pretty much done in two lines of code. At this point the file could be imported into SQL, a spreadsheet etc without more than a little cleaning up of things like the repeated table headers and the odd place where the data has overflowed onto a second line in the original pdf. The morals here are probably, don’t tweet off the cuff and different tools for different purposes.

Advertisements