How to fix 'Error(s) in loading state_dict for AWD_LSTM' when using fast-ai

geoph · May 1, 2019, 10:48am

I am using the fast-ai library in order to train a sample of the IMDB reviews dataset. I want to use it for sentiment analysis and I have trained the model in a VM by using this tutorial.

I saved the data_lm and data_clas models, then the encoder ft_enc and after that I saved the classifier learner sentiment_model. I, then, got those 4 files from the VM and put them in my machine and wanted to use those pretrained models in order to classify sentiment.

This is what I did:

# Use the IMDB_SAMPLE file
path = untar_data(URLs.IMDB_SAMPLE)

# Language model data
data_lm = TextLMDataBunch.from_csv(path, 'texts.csv')

# Sentiment classifier model data
data_clas = TextClasDataBunch.from_csv(path, 'texts.csv', 
                                       vocab=data_lm.train_ds.vocab, bs=32)

# Build a classifier using the tuned encoder (tuned in the VM)
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('ft_enc')

# Load the trained model
learn.load('sentiment_model')

After that, I wanted to use that model in order to predict the sentiment of a sentence. When executing this code, I ran into the following error:

RuntimeError: Error(s) in loading state_dict for AWD_LSTM:
   size mismatch for encoder.weight: copying a param with shape torch.Size([8731, 400]) from checkpoint, the shape in current model is torch.Size([8888, 400]).
   size mismatch for encoder_dp.emb.weight: copying a param with shape torch.Size([8731, 400]) from checkpoint, the shape in current model is torch.Size([8888, 400]).

And the Traceback is:

Traceback (most recent call last):
  File "C:/Users/user/PycharmProjects/SentAn/mainApp.py", line 51, in <module>
    learn = load_models()
  File "C:/Users/user/PycharmProjects/SentAn/mainApp.py", line 32, in load_models
    learn.load_encoder('ft_enc')
  File "C:\Users\user\Desktop\py_code\env\lib\site-packages\fastai\text\learner.py", line 68, in load_encoder
    encoder.load_state_dict(torch.load(self.path/self.model_dir/f'{name}.pth'))
  File "C:\Users\user\Desktop\py_code\env\lib\site-packages\torch\nn\modules\module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))

So the error occurs when loading the encoder. But, I also tried to remove the load_encoder line but the same error occurred at the next line learn.load('sentiment_model').

I searched through the fast-ai forum and noticed that others also had this issue but found no solution. In this post the user says that this might have to do with different preprocessing, though I couldn’t understand why this would happen.

Could anyone please help me with that issue. I have tried many different things but I cannot find something to get past this. Any help would be appreciated.

maxmatical · May 6, 2019, 8:07pm

I’ve encountered the same issue. I suspect it’s due to the validation split, different data kept in the training set end up resulting in different vocabs when tokenized. You can try to keep the training and validation data the same, see if that works.

However this brings up a problem for using the fastai library. We can’t realistically assume that our vocab for the language model and classifier to always be the same can we, or do we have to always use the exact same training/validation data if we want to load in a network we’ve trained?

davidnb · May 13, 2019, 12:45pm

I had exactly the same problem. It is extremely frustrating. What is the purpose of the learner.save method at all if learner.load fails, because of some discrepancy outside of the learner? It seems like export is the only way to have model persistence from one session to the next.

Isn’t there any way to simply lock down the vocabulary that was used when training the model, or even just locking down the exact train/validation split that was used in the original training data bunch? I’m a newcomer to the Fast.ai API… Because so many of the implementation details are abstracted away from the user, I don’t know how to even begin solving this problem.

basharcse · June 13, 2019, 12:42am

It seems vocabulary size of data_clas and data_lm are different. I guess the problem is caused by different preprocessing used in data_clas and data_lm. To check my guess I simply used

data_clas.vocab.itos = data_lm.vocab.itos

Before the following line

learn_c = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.3)

This has fixed the error.

tinakeshav · November 19, 2019, 4:32pm

Unfortunately I still have the problem even after this Has anyone else come up with any solutions to this problem?

mrjp · November 21, 2019, 1:04am

I encountered the same problem when using cross validation after loading a pretraining language model. Even though I put ‘data_clas.vocab.itos = data_lm.vocab.itos’, I would get the error message “IndexError: list index out of range”. Any ideas?

rajibmondal · January 29, 2020, 7:41am

Hello guys, i still got same problem. If anyone got any idea how to solve this problem please shear it… @Tendo, @msivanes, @shahnoza

Tendo · February 3, 2020, 10:46am

Try doing
data_clas.vocab.stoi = data_lm.vocab.stoi
also after
data_clas.vocab.itos = data_lm.vocab.itos
Then create your learner. I haven’t come across this error before though but just try this to see if it’ll fix it. Also Confirm that your
len(data_clas.vocab.itos) == len(data_lm.vocab.itos)

msivanes · February 4, 2020, 3:34pm

@rajibmondal
I was facing the same error yesterday. Finally figured out what the problem is. Posting it here for future learners for reuse & nobody makes the same mistake again & get stuck.

Issue

Facing the error in “Error(s) in loading state_dict for AWD_LSTM” when creating text_classifier_learner & loading it with fine tuned encoder by calling load_encoder.

Diagnostics

Check your fastai version by running the following
#import fastai; fastai.utils.show_install()
The above was tested in 1.0.60
Ensure that when you are creating classifier databunch, you are using the same vocab as your language model databunch

screenshot-colab.research.google.com-2020.02 (3)1172×402 63.3 KB

Root Cause

Check how you are the saving encoder created from language modeling task.
lang_learner.save_encoder should be used instead lang_learner.save when you are saving the encoder.

Tendo · February 4, 2020, 3:48pm

Good work

PCJimmmy · February 5, 2020, 12:56am

Having the same issue. I saved my encoder using the .save_encoder in a seperate script. New script where I want to only build the classifier. Each time I load the saved encoder I get the error message. Run it a dozen times - same error

BUT
Each time the shape of the current model is different - values ranging from 25528 to 25744 - the saved encoder is 25432. Looks like some random function at work.

Will report back if I figure it out.

msivanes · February 17, 2020, 1:58pm

@PCJimmmy - Hard me for to guess what might be causing the issue. Possible diagnostic steps a) check the fastai version b) check if seed is initialized properly

simoneva · March 4, 2020, 5:13pm

I have the same problem. set the vocab the same as suggested and but get “list index out of range”

simoneva · March 4, 2020, 5:46pm

The answer is to set np.random.seed to the same value before creating the two data items.

waydegg · April 7, 2020, 12:00am

For v2, make sure that the length of the vocab you’re loading in is the same size as the one you saved for your LM. Check the vocab length in the classification learner with something like len(learn.dls.train.vocab[0]).

PabloMC · August 27, 2020, 4:02pm

I am having the same

RuntimeError: Error(s) in loading state_dict for SentenceEncoder:

but I am not using any data loader at this stage. In fact I am only using

dls = TextDataLoaders.from_csv(path=path, 
                           csv_fname='train.csv', 
                           text_col='text', 
                           label_col='target', 
                           valid_pct=0.2)

config = awd_lstm_clas_config.copy()
config['bidir'] = True
learn = text_classifier_learner(dls, AWD_LSTM, config=config, drop_mult=0.5, wd=1e-2, pretrained=True, metrics = F1Score())

The error happens because I attempt the bidirectional configuration, and at the same time set pretrained = True.

daveramseymusic · August 6, 2021, 3:18pm

Thank you. Worked perfectly for me. I used

dls_clas.vocab = dls_lm.vocab

then

learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
metrics=accuracy).to_fp16()