Continue training a language model on new dataset

Hi all,

I am trying to continue training the language model generated by the lesson-3-imdb notebook on a different dataset. I would like to continue from the fine tuned model that Jeremy trained and not start from a new Ulmfit model. The dataset that I would like to continue training it on is the Yelp customer reviews dataset that comes with fastai as well.

Things seem to go well when I use the entire training dataset, and I believe that this working properly:

data_lmYelp = (TextList.from_csv(path=Path('yelp/yelp_review_polarity_csv/train/'), csv_name= 'train.csv', cols=1)
            .split_by_rand_pct(0.1)
            .label_for_lm()
            .databunch(bs=bs))
data_lmYelp.save('data_lmYelp_newnb2.pkl')

data_lmYelp = load_data(path,'train/data_lmYelp_newnb2.pkl', bs)
data_lmYelp.show_batch()

learn = language_model_learner(data_lmYelp, AWD_LSTM, drop_mult=0.3)
learn.load('lm/models/fine_tuned')

and everything seems to work properly. That fine_tuned is the fine_tuned.pth file from the imdb notebook.

However, when I create a smaller version of the dataset, either by creating a new csv with 10,000 samples from the Yelp dataset or importing a subset of the dataset dataframe into the databunch, I get errors.

So the code is the same as above, I’m just importing a csv with only 10,000 samples instead of the entire training set (I can share the code of the from_df if that matters but it’s the same error) and after the last line above I get:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-63-67473bdf2756> in <module>
      1 learn = language_model_learner(data_lmYelp, AWD_LSTM, drop_mult=0.3)
----> 2 learn.load('lm/models/fine_tuned')

/usr/local/lib/python3.7/dist-packages/fastai/basic_train.py in load(self, file, device, strict, with_opt, purge, remove_module)
    271             model_state = state['model']
    272             if remove_module: model_state = remove_module_load(model_state)
--> 273             get_model(self.model).load_state_dict(model_state, strict=strict)
    274             if ifnone(with_opt,True):
    275                 if not hasattr(self, 'opt'): self.create_opt(defaults.lr, self.wd)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    828         if len(error_msgs) > 0:
    829             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 830                                self.__class__.__name__, "\n\t".join(error_msgs)))
    831         return _IncompatibleKeys(missing_keys, unexpected_keys)
    832 

RuntimeError: Error(s) in loading state_dict for SequentialRNN:
	size mismatch for 0.encoder.weight: copying a param with shape torch.Size([60000, 400]) from checkpoint, the shape in current model is torch.Size([12136, 400]).
	size mismatch for 0.encoder_dp.emb.weight: copying a param with shape torch.Size([60000, 400]) from checkpoint, the shape in current model is torch.Size([12136, 400]).
	size mismatch for 1.decoder.weight: copying a param with shape torch.Size([60000, 400]) from checkpoint, the shape in current model is torch.Size([12136, 400]).
	size mismatch for 1.decoder.bias: copying a param with shape torch.Size([60000]) from checkpoint, the shape in current model is torch.Size([12136]).

Does anyone know what is going on?
Thanks!

1 Like

Sure, you have a different vocab you’re using here. What you should do is save away the original vocab so when you make your new one, you can specify the vocab to be your old vocab, this will fix that problem right away. (Look at the downstream task from the LM to the Classifier in the ULMFiT tutorial for an exmaple)

1 Like

Thanks @muellerzr , I’ll try that out. Do you know why it wasn’t throwing an error on the entire training set then as well?

The issue is the model architecture mismatch. This is done at a Learner level, not at a dataloader level. The DataLoader doesn’t know that it’s continued based on previous work unless you give it a vocab to make, otherwise it’ll just make a new one :slight_smile:

Sorry, I’m definitely missing something here. Wouldn’t the large yelp dataset require a new vocabulary as well? The only difference between them is the size of the dataset. Should I have expected that to crash as well?

With regard to the downstream classifier task, isn’t that only loading the encoder? Would I still be able to continue training the LM with only the encoder loaded as is done in the classifier section?

Thanks for the help

Yes, which means you need to modify the internal language model because it’s vocabulary would be wrong. There is a function (it’s been brought up literally twice in the last 48 hours) in which fastai converts the Wiki-text weights over to weights we can use, which is done in language_model_learner. I think there should be a parameter that you can pass in to use a custom model.

The LM is the encoder, at least the embeddings. So yes, you should be able to, but again you need to transfer your transfer-learned weights again into your new model, which is done automatically by language_model_learner and you point to the file to use from. You can’t use learn.load for this task (or even learn.load_encoder I believe)

@muellerzr I’m trying to load the old vocabulary but I still seem to be getting the same error. Am I doing this correctly? The file data_lm.pkl is the one generated by the imdb dataset in that notebook. If I’m not doing this right can you show me the correct syntax. I can’t seem to find good examples of this.

oldLM = load_data(path=path, file='../data_lm.pkl')
data_lmYelp.vocab = oldLM.vocab
learn = language_model_learner(data_lmYelp, AWD_LSTM, drop_mult=0.3)
learn.load('lm/models/fine_tuned')

and the error I get is the same

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-78-52a22de73057> in <module>
      1 oldLM = load_data(path=path, file='../data_lm.pkl')
      2 data_lmYelp.vocab = oldLM.vocab
----> 3 learn = language_model_learner(data_lmYelp, AWD_LSTM, drop_mult=0.3)
      4 learn.load('lm/models/fine_tuned')

/usr/local/lib/python3.7/dist-packages/fastai/text/learner.py in language_model_learner(data, arch, config, drop_mult, pretrained, pretrained_fnames, **learn_kwargs)
    217             model_path = untar_data(meta[url] , data=False)
    218             fnames = [list(model_path.glob(f'*.{ext}'))[0] for ext in ['pth', 'pkl']]
--> 219         learn = learn.load_pretrained(*fnames)
    220         learn.freeze()
    221     return learn

/usr/local/lib/python3.7/dist-packages/fastai/text/learner.py in load_pretrained(self, wgts_fname, itos_fname, strict)
     80         if 'model' in wgts: wgts = wgts['model']
     81         wgts = convert_weights(wgts, old_stoi, self.data.train_ds.vocab.itos)
---> 82         self.model.load_state_dict(wgts, strict=strict)
     83         return self
     84 

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    828         if len(error_msgs) > 0:
    829             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 830                                self.__class__.__name__, "\n\t".join(error_msgs)))
    831         return _IncompatibleKeys(missing_keys, unexpected_keys)
    832 

RuntimeError: Error(s) in loading state_dict for SequentialRNN:
	size mismatch for 0.encoder.weight: copying a param with shape torch.Size([12136, 400]) from checkpoint, the shape in current model is torch.Size([60000, 400]).
	size mismatch for 0.encoder_dp.emb.weight: copying a param with shape torch.Size([12136, 400]) from checkpoint, the shape in current model is torch.Size([60000, 400]).
	size mismatch for 1.decoder.weight: copying a param with shape torch.Size([12136, 400]) from checkpoint, the shape in current model is torch.Size([60000, 400]).
	size mismatch for 1.decoder.bias: copying a param with shape torch.Size([12136]) from checkpoint, the shape in current model is torch.Size([60000]).
1 Like

Ok, so I changed the code to

oldLM = load_data(path=path, file='../data_lm.pkl')
data_lmYelp.vocab.itos = oldLM.vocab.itos
learn = language_model_learner(data_lmYelp, AWD_LSTM, drop_mult=0.3)
learn.load('lm/models/fine_tuned')

(added itos) and now it at least runs. However, all of the text in the yelp dataset has now been replaced with text from the imdb dataset which is not what I want at all (please correct me if this is what I want).
I am trying to take the pretrained-on-imdb language model and continue training it on the new dataset, the yelp one.
Is this doable in fastai?

@muellerzr Ok, I think I figured out what was throwing me off. When I replace the vocab.itos as shown above that dictionary gets replaced as you mentioned so when I then do a show_batch on the new databunch it uses the vocab from the movies dataset and looks like an entry from the imdb dataset. For example:

Which looks imdb-ish…
Can I continue training the language model with this by loading it with

learn = language_model_learner(data_lmYelp, AWD_LSTM, drop_mult=0.3)
learn.load('lm/models/fine_tuned')

?

1 Like

Yes, it needs the vocab. However (after giving it some thought), if you want it to be similar to how we do Wikitext -> IMDB (So you do WikiText->IMDB->Your Dataset), when you make the LMLearner there should be a way to pass in a custom pre-trained model, and you’d use that instead of WikiText.

1 Like

Are you saying that it needs more than vocab.itos (such as doing data_lmYelp.vocab = data_lm.vocab) or is what I did above with that good enough?

Is that related to the pretrained_fnames argument at https://docs.fast.ai/text.learner.html#language_model_learner? I can’t find much information about what that does. I’ll dig through the code though to see if I can figure it out.

1 Like

You need to pass in the vocab of the old databunch (in this case IMBD) when creating the new databunch (Yelp) in this way: data_new = (TextList.from_folder(path, vocab=data_old.vocab)...). I think the difference is that rather than replacing the vocab (as data_new.vocab = data_old.vocab would do), it actually aligns the new vocab with the old one and also expands the vocab with tokens that haven’t appeared in your old corpus but are part of your new corpus. Hope this helps.

On a different note, are you sure you actually want to fine-tune your Yelp language model from the IMDB model instead of the orginal Wikitext model? To me it seems that transferring learned representations from the more general Wikitext model should work better than using the IMDB model, which is very specialized in understanding movie reviews.

2 Likes

That makes a ton of sense, I was wondering how that was going to play out. I tried it out and it looks good for now and it loads!

Yeah, I’m running a bunch of experiments to figure out the best way to retrain a model on text which is slightly outside of it’s domain, hence the imdb-review -> yelp review (kind of similar but not exactly the same). It’s very possible that it won’t do much and that’s valuable information. But there is also a chance that it will either perform better on the new dataset or perform worse but with quicker training time which is valuable for my (company’s) use case. I just want to have evidence either way.

2 Likes

That sounds reasonable. I was just thinking that in general for that kind of use case, a language model pre-trained on the much larger and probably more diverse Yelp reviews dataset is probably a better backbone for further fine-tuning downstream models. But of course it depends on your specific objective. Btw, the Amazon reviews dataset might be another good source for your experiments.

1 Like

Thanks! I’ll be sure to check it out as well. Thanks for the help!
Also thanks to @muellerzr for all the guidance!