Issue with loading model

VLavorini · March 22, 2019, 11:10am

Hi,

I dont arrive to understand why the save/load procedure do not work.

In the following test code, I simply create the model, save the weigths, then create another one similar, and load the previously saved weights:

data = ImageDataBunch.from_folder(path_img, train='train', valid_pct = 0.2, size = 224, ds_tfms=get_transforms(flip_vert=True))
learn = cnn_learner(data, models.resnet34, metrics=error_rate)
learn.save('prova')

data = ImageDataBunch.from_folder(path_img, train='train', valid_pct = 0.2, size = 224, ds_tfms=get_transforms(flip_vert=True))
learn = cnn_learner(data, models.resnet34, metrics=error_rate)
learn.load('prova')

As output, I get the error:

RuntimeError: Error(s) in loading state_dict for Sequential:
size mismatch for 1.8.weight: copying a param with shape torch.Size([618, 512]) from checkpoint, the shape in current model is torch.Size([616, 512]).
size mismatch for 1.8.bias: copying a param with shape torch.Size([618]) from checkpoint, the shape in current model is torch.Size([616]).

What am I missing here?

Thank you

sgugger · March 22, 2019, 1:15pm

Since you are using a random validation set, your vocabulary changed in the two data objects (it’s computed on the training set only). That’s why you get that mismatch. You should save your data object or the vocabulary you’re using.

jamesramsay100 · May 8, 2019, 8:38am

I am having the exact same problem as @VLavorini.

I have spent a long time training a model and saved using learn.save(‘model_name’). Since I am using a random validation set and have not saved the data object, is there no option but to start training again from scratch?

Many thanks.

enr · November 3, 2019, 5:49pm

how did you solve it?
I have a similar issue

enr · November 3, 2019, 5:55pm

can you please explain how to save the data object or the vocabulary and how to load it into the new leaner?

KristerV · November 12, 2019, 1:27pm

Oh please answer this… I’ve spent days here… what exactly does “save your dataset” mean? I have my dataset in folders… There is nothing in the docs or in the forum… how are we the few who struggle with this issue, should be pretty basic…

edit: okay I may have gotten somewhere with this in this topic. I don’t actually have a clue what’s going on though.

AIML2 · January 3, 2020, 3:22pm

Hi, I have encountered the same issue. I trained a large model in Google Colab and trying to continue where I left off but I am not able to load the model. Any solution?

AIML2 · January 3, 2020, 3:57pm

I found this solution from another page of Fasi AI forum. It seems the issue can be solced adding, “remove_module=True”. It seems there is a mismatch between the Tensor size of the new model I create from resnet18 and the one that I saved. The following code might help:

learn.load('YOUR_SAVED_MODEL',strict=False,remove_module=True)

Alexfink · April 11, 2020, 9:31am

In my case it ended up being wrong data loading, I was loading unlabeled test data, and the loader assumed one class.
When I initialized the model it accidentally thinks there is one class in the data, and hence initialize the last layer with wrong dimension

dokuboyejo · August 21, 2021, 8:52pm

Faced similar issue and can confirm removing the module and marking as not strict resolve for me as indicated by @AIML2

drscotthawley · October 10, 2021, 9:59pm

For others’ reference:

I was facing a similar size mismatch problem when loading weights (from the same model, learning & dataset specs from the same notebook just trained on a different machine), but remove_module is longer a recognized kwarg, and just adding strict=False had no effect.

I could go into details of how I traced the error but TL/DR: What fixed it in this case was just restarting the kernel on Colab and trying again.