Command-line gems: pdftotext

Jul 6, 02:51 pm

Say what you like about PDF files but they’re here to stay. A visit to virtually any government site will prove this. Want a tax form? Want to renew your driving license? You’ll be downloading a PDF file. They’ve become the lingua franca, if you like, of document transfer.

From an open source perspective PDF has good and bad points. It’s good in that it’s an open standard that’s freely implementable, but bad because it remains a proprietary format under the control of Adobe.

What’s little realized among Linux users is the sheer quantity of PDF tools available at the command line. Need to convert a PDF to HTML, text or postscript file? It’s easy when you know how.

Perhaps simplest and most useful tool is pdftotext, which is used like this:

pdftotext filename.pdf

This will output a .txt file with the same filename as the original, in the current directory.

Don’t expect miracles. Images are ignored and complex formatting can fox the converter. The resulting text files nearly always require clean-up of some kind.

However, using pdftotext is definitely easier than cutting and pasting straight from the original document, particularly considering that Acrobat Reader has a nasty habit of inserting paragraph breaks at the end of each line when you cut and paste from documents.

pdftohtml is slightly more advanced. It will convert PDF files to HTML and attempt to carry across elementary formatting as well as images. Using the -c command switch, which activates “complex document” mode, is a good idea:

pdftohtml -c filename.pdf

Again, it’s not perfect, and don’t think that you can simply create a batch job to convert PDF files before throwing them online. But to quickly make a PDF file more accessible then it’s a good choice.

There are other PDF conversion tools you can experiment with, and most come installed by default on Linux.



    1. Interesting post. Two other PDF-related applications I use are PDF Download extension, which allows you to choose between viewing or downloading a PDF from within Firefox and PDFCreator, which allows you to easily save Word documents in PDF format.



    1. I forgot to mention that there’s a Win32 port of pdftohtml available. It even has a GUI for those of you who like the finer things in life. Here’s the URL:

      http://guiguy.wminds.com/downloads/pdf2htmlgui/



    1. The real benefit of software like this are in extracting text for storage in an index like lucene for instance. None of the open source indexing systems can index these complex binary files on their own thus these little helper program come in real handy. It would be really nice if there was single library out there that would extract text from the entire suite of Microsoft filetypes as well as pdf and perhaps a few others like header info in images etc.




Add your comments

Please keep your comments relevant to this blog entry: inappropriate or purely promotional comments may be removed. To add hyperlink, please follow this example: "your link text":http://your.link.url