From lesson 4: Preprocessing of text for the Language model

Hi , I would like to know about the preprocessing steps on the text, before the creation of the language data model object ‘md’. This is with reference to the sentiment classification task. Any information would be really helpful.

Eagerly awaiting a response on the above query

I don’t think Jeremy did any pre-processing in Lesson 4.

Perhaps he or @sebastianruder, who co-authored the FitLaM paper with Jeremy, can enlighten us with respect to what, if any, pre-processing would be helpful for such models.

Ok, thanks for the clarification. However, I was also curious to know, whether multi class classification is possible. I have a set of documents that belong to 4 different categories, I was hoping to utilise this approach to categorise them.

The approach in lesson 4 is entirely usable for multiple classes.

Another question that I have is related to the format of the input files. I noticed that the reviews are all in one line (without line feed+carriage return). So if I were to collect a few review samples of my own, then do I need to format them to be all in one line?