First, with regards to pdf files, the main Python library for opening pdf files is PDFMiner. There exist several additional libraries that essentially serve as wrappers to PDFMiner, including Slate. Slate is significantly simpler to use than PDFMiner, but this comes at the expense of very basic functionality. Even though I first tried to use Slate, it ended up not performing well for the pdfs I was working with. Specifically, it did not fully respect the original spacing between words, thereby cutting certain words into multiple fragments or concatenating others. I thus switched to PDFMiner because of its customizability. Using the pdf2txt.py command line utility, PDFMiner experienced a similar problem with word spacing. However, this turned out to be extremely easy to tune just using a word margin option passed to the pdf2txt.py utility. Specifically, I ran the following in the command line:
When it comes to Word 2007 .docx files, the Python-based utility that worked well is the python-docx library. It worked well in the command line as follows:
For older Word documents (for example Word 2003), the python-docx library does not work. I ended up using the C-based antiword utility. Originally a Linux-based utility, antiword (version 0.37) can be installed on Mac OS X as follows:
From within Python, I was then easily able to convert a .doc document to text: