Continue training a language model on new dataset

stefan-ai · June 24, 2020, 3:49pm

You need to pass in the vocab of the old databunch (in this case IMBD) when creating the new databunch (Yelp) in this way: data_new = (TextList.from_folder(path, vocab=data_old.vocab)...). I think the difference is that rather than replacing the vocab (as data_new.vocab = data_old.vocab would do), it actually aligns the new vocab with the old one and also expands the vocab with tokens that haven’t appeared in your old corpus but are part of your new corpus. Hope this helps.

On a different note, are you sure you actually want to fine-tune your Yelp language model from the IMDB model instead of the orginal Wikitext model? To me it seems that transferring learned representations from the more general Wikitext model should work better than using the IMDB model, which is very specialized in understanding movie reviews.