Extracting only actual sentences from noisy text files, using LM/DL?

msp · May 3, 2018, 2:32pm

I have a project involving various formats of text (scraped html and pdf files), and the resulting extracted txt files are quite messy. Often they contain stuff like menu entries or other headers at the beginning and end, and also lots of garbage in between.

Now I would like to ‘clean’ the text, by keeping only subsets of the files that look like actual sentences.

So I was wondering, has anyone used a language model or DL approach for doing the cleaning? And if yes, any lessons learned?

Cheers!