Downloading trained models and using them for further training

I downloaded the .pth file of my trained model, however when I returned to training the model (I am working on colab and uploaded the file to /content/models and had created the learner),I tried loading the state to the learner using-

learn.load(‘stage_x’)

I was given the error–

unexpected EOF. The file might be corrupted.

Could you help me with this?

EOF means “End of File” so your file transfer might have been interrupted. Did you download the .pth to your hard drive and later upload it back to Colab, and did you ensure the file was transferred completely? Downloading and uploading stuff in Colab is prone to disconnections.

1 Like

I now realised the file sizes of the file on colab and on my hard disk do not match.
I also tried copying the file to my drive, however the sizes do not match even on my drive.
Is there a way to ensure downloading the complete file?

The most reliable way in my experience is to copy the file to Google Drive. First you need to mount Drive to Colab with the following lines and verify your Google account.

from google.colab import drive
drive.mount('/content/gdrive')

Then you can browse your Google Drive within Colab and copy the file to your drive using !cp such as:

!cp /content/models/stage_x.pth gdrive/'My Drive'/models/stage_x.pth

To use the model in the next session, mount Google Drive again and you can either use the model directly from Drive by telling the learner to find the model in your Drive: learn.load('gdrive/My Drive/models/stage_x'). Or you can copy the model back from Drive to your Colab home directory:

!cp gdrive/'My Drive'/models/stage_x.pth /content/models/stage_x.pth

Hope this works.

1 Like

Thanks for your help!! Its working!

1 Like

Now when I load the trained model from my drive, I get the error-

size mismatch for 0.encoder.weight: copying a param with shape torch.Size([37944, 400]) from checkpoint, the shape in current model is torch.Size([37912, 400]).
size mismatch for 0.encoder_dp.emb.weight: copying a param with shape torch.Size([37944, 400]) from checkpoint, the shape in current model is torch.Size([37912, 400]).
size mismatch for 1.decoder.weight: copying a param with shape torch.Size([37944, 400]) from checkpoint, the shape in current model is torch.Size([37912, 400]).
size mismatch for 1.decoder.bias: copying a param with shape torch.Size([37944]) from checkpoint, the shape in current model is torch.Size([37912]).

I tried multiple times- it always should be 37944 but is 37912-- any particular reason why?
Thanks.

Take a look at learn.model on both machines. Any difference?

I had previously saved that model in a different session which has now expired. So cannot compare.
What are the possible reasons for the difference in models?

Hard to tell. Did you create the model architecture yourself or is it generated by fastai? If the latter, are you using the same version of fastai?

Model by fastai. Yup same version of fastai.

Maybe you can recreate the original learner on the first machine (without having to train), and see if the path file loads there; if it does, check if there is any difference between the models.

Hard to find what’s wrong without digging into it.

Where does the 37944/37912 come from? Is that a parameter of the model that fastai decides based on the data? Vocabulary size? Maybe there’s a difference in your data between the two machines?