The Easy Path
So you have a PDF, or a bunch of PDFs, and want to extract the text out of them? A few years ago, this would have been a horrible task, but life has gotten easier since then.
If your PDF is just filled with text, this becomes really easy:
pdftotext pdfname.pdf
You can find pdftotext
for most operating systems.
How you you know that it's just text? If you open it up in Acrobat/Preview/XPDF/etc and can highlight the text, then pdftotext should work fine.
But if you can't do that, then what the author probably did was make an image and embedded it in a PDF file. You then have to use OCR, which can give you some output which isn't always right. A Google-sponsored tool called [tesseract][] does a good job with this OCR stuff.. I remember that it used to stink, but it doesn't anymore. Simply:
tesseract pdfname.pdf textpat
That will try to do an OCR scan of pdfname.pdf
and save each page into a file called textpat.txt
.
But, of course, the path isn't always easy.
The Long and Winding Road
which have to be typed in. Lucky me. We have a scanner on-site and I asked if it does OCR, and I was told that it doesn't. I'm even getting luckier.
But I've parsed PDF's before. I should be able to handle it.
I scanned in a few and had the PDFs sent to me. I installed tesseract via Homebrew. The results were. . . disappointing:
$ tesseract pdfname.pdf out
Tesseract Open Source OCR Engine v3.01 with Leptonica
Unsupported image type.
So a quick google shows that either tesseract doesn't have the right libraries installed, or the PDF wasn't well-formed. Since tesseract
told me it found [Leptonica][], I have to assume the proper libraries are there. So our scanner is making improper PDFs. This is great.
After some googling and head scratching, I discovered that tesseract works very well on Tiff files. I used Preview to export the PDF to a Tiff and -- success!
$ tesseract pdfname.tiff out
Tesseract Open Source OCR Engine v3.01 with Leptonica
Page 0
Page 1
Page 2
$ ls out*
out.txt
Ok, I didn't want to open all of these files in Preview. How to convert them from the command-line? Well, the first tool to think of is convert
from ImageMagick. That has always been a tricky road for me nnd, sure enough, the resulting Tif file had horrid resolution. That made tesseract spit out garbage. I searched some more, even for OSX-specific solutions. I found sips
which comes with OSX, but most people haven't heard of it. [The usage is a bit arcane][] but it uses the OSX libraries (i.e. the same thing my Preview export used). And, yes, it worked great out of the box -- except that it doesn't handle multi-page PDF's. Ugh.
How does one break up a PDF into pages? More googling, and I found [pdftk][] which is a little swiss army knife of PDF processing. And, hey, it can break a PDF into pages with the burst
option! Or, maybe not:
$ pdftk pdfname.pdf burst
Unhandled Java Exception:
java.lang.NullPointerException
at com.lowagie.text.pdf.PdfCopy.copyIndirect(pdftk)
at com.lowagie.text.pdf.PdfCopy.copyObject(pdftk)
at com.lowagie.text.pdf.PdfCopy.copyDictionary(pdftk)
That's not good. A few searches showed someone else with that same problem. The cause? A bad PDF of course! The thing that has started me down this path! But I could extract the PDF a page at a time . . but that's bad to me.
Ok, time to refocus. I thought, "What I am trying to accomplish?" And that was converting the broken PDFs to Tifs so I can run tesseract. So let's focus back on the PDF->Tiff part. I did more searching and found [a StackOverflow entry that talked about the problem I had with ImageMagick and tesseract.][] and someone posted a nice recipe for using Ghostscript:
/usr/local/bin/gs -o out.tif -sDEVICE=tiffgray -r720x720 \
-g6120x7920 -sCompression=lzw in.pdf
And I got a Tiff file out that tesseract could process wonderfully! Woot! The bad part was that tesseract took a long time to process this tif -- much longer than the one from Preview. Most of that processing time was done in the first page of my PDF, which is essentially a cover page. How do I get rid of that cover page? Well, back to pdftk:
pdftk pdfname.pdf cat 2-end output nocover.pdf
So that makes another PDF from the second page on (these PDF's have a variable number of pages).
Running the PDF->Tiff conversion on the nocover.pdf
command gave some errors. But then I ran tesseract on the resulting tif file and I had no problems.
Just for fun, I ran tesseract on the nocover.pdf
that pdftk created -- same error and the first thing. I figured as much but it was worth a shot.
So, in the end, I wrote a shell script that takes a PDF as a parameter and does this:
oldname=`basename $1`
name=$oldname.pdf
pdf=nocover/$name.pdf
tiff=tiffs/$name.tiff
text=extracted/$name
pdftk $1 cat 2-end output $pdf
/usr/local/bin/gs -o $tiff -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw $pdf
tesseract $tiff $text
And that, my dear readers, is how to put a PDF through an OCR process.
[a StackOverflow entry that talked about the problem I had with