RuntimeError: Error(s) in loading state_dict for Sequential:
size mismatch for 1.8.weight: copying a param with shape torch.Size([618, 512]) from checkpoint, the shape in current model is torch.Size([616, 512]).
size mismatch for 1.8.bias: copying a param with shape torch.Size([618]) from checkpoint, the shape in current model is torch.Size([616]).
Since you are using a random validation set, your vocabulary changed in the two data objects (itās computed on the training set only). Thatās why you get that mismatch. You should save your data object or the vocabulary youāre using.
I have spent a long time training a model and saved using learn.save(āmodel_nameā). Since I am using a random validation set and have not saved the data object, is there no option but to start training again from scratch?
Oh please answer thisā¦ Iāve spent days hereā¦ what exactly does āsave your datasetā mean? I have my dataset in foldersā¦ There is nothing in the docs or in the forumā¦ how are we the few who struggle with this issue, should be pretty basicā¦
edit: okay I may have gotten somewhere with this in this topic. I donāt actually have a clue whatās going on though.
Hi, I have encountered the same issue. I trained a large model in Google Colab and trying to continue where I left off but I am not able to load the model. Any solution?
I found this solution from another page of Fasi AI forum. It seems the issue can be solced adding, āremove_module=Trueā. It seems there is a mismatch between the Tensor size of the new model I create from resnet18 and the one that I saved. The following code might help:
In my case it ended up being wrong data loading, I was loading unlabeled test data, and the loader assumed one class.
When I initialized the model it accidentally thinks there is one class in the data, and hence initialize the last layer with wrong dimension
I was facing a similar size mismatch problem when loading weights (from the same model, learning & dataset specs from the same notebook just trained on a different machine), but remove_module is longer a recognized kwarg, and just adding strict=False had no effect.
I could go into details of how I traced the error but TL/DR: What fixed it in this case was just restarting the kernel on Colab and trying again.