eDiscovery and TAR using NLP

Hi fellow members,
I am currently exploring using NLP to classify relevance of a pool of documents which can be word docs, pdf and emails for legal evidence. The lesson on NLP uses csv files to do that and I am wondering if there is an easier way to do it. There could be thousands if not hundreds of thousands of such data and to present it as csv in my view is laborious and very difficult. Is there a simpler way to do it directly in their raw format to classify them? Can someone enlighten me?

