Extracting specific information from documents: NLP

I was exploring techniques to extract specific information of interests from documents like

  • Extract important clauses from a legal document
  • Understand the R&D budget from a company’s annual report
  • etc

Key Phrase Extraction or Document Summarization won’t help in here, cause I won’t be able to selectively pick certain parts from the whole document. I am not sure what is this called, has anyone worked in this space or could point me to resources?


First extract the text from the documents
Apache tika is the way to go