Best way to convert .doc and .pdf to text?

echan00 · May 21, 2018, 9:45am

I’m looking to start a project relating to classification of legal documents. What are good tools to convert .doc and .pdf files to text?

Since legal documents sometimes have very particular spacing, ordering, and numbering, does anybody have tips for what I should or should not keep in my raw text data?

EDIT: https://github.com/deanmalmgren/textract seems pretty great

Kasianenko · January 25, 2019, 11:31am

Hi Erik,
How is your project going?
I take a look on the tool, that you suggest, and it really looks good. You might add special tokens for tabs and spaces, if you tell that it is critical to relate for structure of document.
It is really interesting how to represent numbered lists or tables with NLP, what tokens to use to define start and end of paragraphs and so on.
Looking forward to hearing from you.

echan00 · January 25, 2019, 12:09pm

Hi @Kasianenko my project is doing great. I ended up using pdftotext library and an assortment of regex heavy custom scripts to process my text data