Importance of Text Pre-Processing

mhanan · July 30, 2019, 1:13pm

Hello,
Following a discussion with my manager at work, I would love to hear your thoughts.
Since I’m a “follower” of fast.ai, and as explained in the lessons, for each NLP task I do text pre-processing: Separates words, delete char repetition, lowercase, etc.
My manager claims that all this is unnecessary, because large and strong model can learn everything.
So my question is:
Is early processing a constraint that we perform due to a lack of data / computational power, or is it an essential part of DL?

concrete example:
At work, we identify OCR text and want to create a classification model.
Should the input be raw text by converting the words into sentences, or each word + coordinates and assumption that a model will learn the sentence structure from there?

thanks

jbuzza · July 31, 2019, 10:25pm

If you didn’t already, you may want to look at the fast.ai text preprocessing documentation here to see the special tokens that reflect examples you mention and more. There is also the option to customise the tokeniser if necessary.

As covered in the new fast.ai NLP course, the trend has certainly moved away from discarding information (such as using stop words, stemming, lemmatization) since we can now build more complex models.