Fine Tuning Language Model vs Adding To The Corpus

At the end of lesson four @jeremy said we can pre-train with Wikipedia corpus and then fine tune it with IMDB language model and then, do sentiment analysis on top of that. My questions are:

What if some IMDB vocabs do not exist in Wikipedia vocab?

Why do not we add the IMDB comments to the Wikipedia corpus, and create a bigger corpus from the beginning, instead of first train with Wikipedia and then fine tune with IMDB?

When we want to try this text classification on some small data of our own, which we can not create a good language model with them, should we add our corpus to the existing big corpus (like Wikipedia or IMDB or sth else) or should we first train on big corpuses and then fine tune the language model with our dataset? What if some of our vocabs do not exist in the big corpus we used?

1 Like

Frankly, none of this has really been studied or solved AFAIK. My guess would be it’s better to pretrain on a big corpus, then fine tune on your little one, otherwise your little one will be “drowned out” by the big one.

Handling different transferring across different vocabs isn’t something that we’ve built into fastai yet, and is the main issue stopping us from doing this kind of transfer learning. It won’t be hard to add, but unless one of the students here in class implements it, I don’t think I’ll have time until part 2 of the course.

(These are great questions BTW!)