From lesson 4: Preprocessing of text for the Language model

Prithviraj · January 19, 2018, 1:46pm

Hi , I would like to know about the preprocessing steps on the text, before the creation of the language data model object ‘md’. This is with reference to the sentiment classification task. Any information would be really helpful.

Eagerly awaiting a response on the above query

wgpubs · January 23, 2018, 7:36pm

I don’t think Jeremy did any pre-processing in Lesson 4.

Perhaps he or @sebastianruder, who co-authored the FitLaM paper with Jeremy, can enlighten us with respect to what, if any, pre-processing would be helpful for such models.

Prithviraj · February 1, 2018, 6:50am

Ok, thanks for the clarification. However, I was also curious to know, whether multi class classification is possible. I have a set of documents that belong to 4 different categories, I was hoping to utilise this approach to categorise them.

jeremy · February 1, 2018, 12:36pm

The approach in lesson 4 is entirely usable for multiple classes.

Prithviraj · February 8, 2018, 6:23am

Another question that I have is related to the format of the input files. I noticed that the reviews are all in one line (without line feed+carriage return). So if I were to collect a few review samples of my own, then do I need to format them to be all in one line?