How can I load a pretrained model on Kaggle using fastai?

Struggled with this for a while, some tips for Kaggle kernels where you don’t have internet access to download pre-trained weights:

  1. Under draft-environment in your kernel, click the “add-data” button, search for the relevant pytorch model. For example I wanted to use the Resnet-50, so I added the Resnet-50 Pytorch (not the Keras) model to my kernel (click “Add”).
  2. This will give you a new Resnet-50 directory and the *.pth weight file inside of that directory. Now you need to copy the *.pth file into the torch model directory using the same filename it’s looking for. One way to do this is to try to run the model without copying anything. I would get an error like:

failed download to /tmp/.torch/models/resnet50-19c8e357.pth

  1. So we need to copy the resnet50.pth file from the ResNet50 directory that was just automatically added we to the directory where it’s just errored out. My resnet50 file was in:

../input/resnet50/resnet50.pth

  1. Therefore, run a copy command that takes the file you have and put in the place where it’s looking for it, using the same model-sha_hash naming convention.

!cp ../input/resnet50/resnet50.pth /tmp/.torch/models/resnet50-19c8e357.pth

  1. Make sure you remember that your …/input directory is read only and the models are going to be changed during the learning process so you need to go up one level when creating your learner:
    ./ : put path here
    ./input : read only, don’t put path here
19 Likes

Thanks a lot! I will just mention that in order to change learner’s directory one should add model_dir kwarg to create_cnn function so it looks like that:
learn = create_cnn(data, models.resnet34, metrics=error_rate, model_dir='/tmp/models')

5 Likes

There’s a kaggle dataset for wt103
https://www.kaggle.com/mnpinto/fastai-wt103-1
But when I use used hyper-parameter fnames="…/input/fastai-wt103-1/wt103-1.tgz" for language model_learner, it reported “FileNotFoundError: [Errno 2] No such file or directory: ‘…/models/…pkl’” How can I deal with that?

1 Like

One thing to note is that `/tmp/.torch/models/’ has to exist, so here is what I do:

!mkdir -p /tmp/.torch/models
!cp /kaggle/input/resnet50/resnet50.pth /tmp/.torch/models/resnet50-19c8e357.pth
learn = cnn_learner(data, models.resnet50,path='/kaggle/working/', .... )
3 Likes

Thanks! This was very helpful. I wonder if it is worth updating the learner code to not force download a model if we are giving it a path to a local model.

3 Likes

!cp /kaggle/input/resnet34/resnet34.pth /tmp/.cache/torch/checkpoints/resnet34333f7ec4.pth

learn = cnn_learner(data, models.resnet34, metrics=error_rate, 
model_dir = Path('..input/working'),
                   path = Path(".")) 

the files seems to be copying running fine until I try to commit in kaggle script kernel. (which attempts to run the whole file at a go) Then I get an error because cnn_learner is attempting to download resnet.

Any help is appreciated. Thank you

This does not work for me, I am working in Kaggle and it is still trying to download the model. Can you please guide me?

@amyku Did you find a solution to your problem? I am having the same problem. I am able to run the kernel successfully, but once I try and commit it, it attempts to download the resnet101 model, even though I have saved it in /kaggle/working/models/ folder.

learn = cnn_learner(data, models.resnet101, metrics=[error_rate, accuracy], model_dir="/kaggle/working/models")

Yes, it eventually worked for me.

First try restarting the kernel. It’s probably better not to save the model in working since it’s temporary and I think the content is deleted after your kernel session ends. Take a look at what I did in this link

https://www.kaggle.com/aminyakubu/aptos-2019-blindness-detection-fast-ai
You can simply go to Line 3

2 Likes

@amyku Thanks for the fast response. I followed your method in the kernel, and I managed to commit it successfully.

Thanks again!

Hi,
may be my way will be useful for someone…

  1. download weights to input (…/input/resnet34…)

from torchvision.models import resnet34

def my_resnet(pretrained=False, progress=True, **kwargs):
m = resnet34(pretrained=False, progress=True, **kwargs)
m.load_state_dict(torch.load("…/input/resnet34/resnet34.pth"))
return m

learn = cnn_learner(data, my_resnet,metrics=accuracy)

(based on what I’ve figured out from pytorch and fastai code)

Thanks!

!mkdir -p '/tmp/.cache/torch/checkpoints'
!cp ../input/fastai-pretrained-models/densenet121-a639ec97.pth /tmp/.cache/torch/checkpoints/densenet121-a639ec97.pth

learn_cd = cnn_learner(data_cd, models.densenet121, metrics=[error_rate, accuracy],model_dir = Path('../kaggle/working'),path=Path('.'),).to_fp16()

I am still getting GAIError while trying to commit. Any advice

!mkdir -p '/tmp/.cache/torch/checkpoints'
!cp ../input/fastai-pretrained-models/densenet121-a639ec97.pth /tmp/.cache/torch/checkpoints/densenet121-a639ec97.pth

learn_cd = cnn_learner(data_cd, models.densenet121, metrics=[error_rate, accuracy],model_dir = Path('../kaggle/working'),path=Path('.'),).to_fp16()

I am still getting GAIError while trying to commit. Any advice ?

I want to use the AWD LSTM pretrained model but the competition doesn’t allow internet access. I have added the model as external data but I don’t know where to move it or how to load it from the directory

Create your learner (with pretrained=False if it has that option), then use learn.load(path/to/your/model) to load the pretrained weights

1 Like

Alright, it worked but now I am getting error, apparently the size of the weights have changed.

RuntimeError: Error(s) in loading state_dict for SequentialRNN:
size mismatch for 0.encoder.weight: copying a param with shape torch.Size([60002, 400]) from checkpoint, the shape in current model is torch.Size([7224, 400]).
size mismatch for 0.encoder_dp.emb.weight: copying a param with shape torch.Size([60002, 400]) from checkpoint, the shape in current model is torch.Size([7224, 400]).
size mismatch for 0.rnns.0.weight_hh_l0_raw: copying a param with shape torch.Size([4600, 1150]) from checkpoint, the shape in current model is torch.Size([4608, 1152]).
size mismatch for 0.rnns.0.module.weight_ih_l0: copying a param with shape torch.Size([4600, 400]) from checkpoint, the shape in current model is torch.Size([4608, 400]).
size mismatch for 0.rnns.0.module.weight_hh_l0: copying a param with shape torch.Size([4600, 1150]) from checkpoint, the shape in current model is torch.Size([4608, 1152]).
size mismatch for 0.rnns.0.module.bias_ih_l0: copying a param with shape torch.Size([4600]) from checkpoint, the shape in current model is torch.Size([4608]).
size mismatch for 0.rnns.0.module.bias_hh_l0: copying a param with shape torch.Size([4600]) from checkpoint, the shape in current model is torch.Size([4608]).
size mismatch for 0.rnns.1.weight_hh_l0_raw: copying a param with shape torch.Size([4600, 1150]) from checkpoint, the shape in current model is torch.Size([4608, 1152]).
size mismatch for 0.rnns.1.module.weight_ih_l0: copying a param with shape torch.Size([4600, 1150]) from checkpoint, the shape in current model is torch.Size([4608, 1152]).
size mismatch for 0.rnns.1.module.weight_hh_l0: copying a param with shape torch.Size([4600, 1150]) from checkpoint, the shape in current model is torch.Size([4608, 1152]).
size mismatch for 0.rnns.1.module.bias_ih_l0: copying a param with shape torch.Size([4600]) from checkpoint, the shape in current model is torch.Size([4608]).
size mismatch for 0.rnns.1.module.bias_hh_l0: copying a param with shape torch.Size([4600]) from checkpoint, the shape in current model is torch.Size([4608]).
size mismatch for 0.rnns.2.module.weight_ih_l0: copying a param with shape torch.Size([1600, 1150]) from checkpoint, the shape in current model is torch.Size([1600, 1152]).
size mismatch for 1.decoder.weight: copying a param with shape torch.Size([60002, 400]) from checkpoint, the shape in current model is torch.Size([7224, 400]).
size mismatch for 1.decoder.bias: copying a param with shape torch.Size([60002]) from checkpoint, the shape in current model is torch.Size([7224]).

found this Language_model_learner not working as before?
which removes the errors with loading of weights with shape 1552 now I am getting

RuntimeError: Error(s) in loading state_dict for SequentialRNN: size mismatch for 0.encoder.weight: copying a param with shape torch.Size([60002, 400]) from checkpoint, the shape in current model is torch.Size([7248, 400]). size mismatch for 0.encoder_dp.emb.weight: copying a param with shape torch.Size([60002, 400]) from checkpoint, the shape in current model is torch.Size([7248, 400]). size mismatch for 1.decoder.weight: copying a param with shape torch.Size([60002, 400]) from checkpoint, the shape in current model is torch.Size([7248, 400]). size mismatch for 1.decoder.bias: copying a param with shape torch.Size([60002]) from checkpoint, the shape in current model is torch.Size([7248]).

I searched for a long time for a solution to this issue. None of the options here helped me much (most likely because I did not go into much detail about the actual logic of the solutions and simply tried to tweak the code :slight_smile: ). What did help was looking into the logic employed by this Kaggle user:
The idea is that you have 2 notebooks, 1 for training the model (which can use internet) and the 2nd one for inference that is using the first one as input. Please see the examples below:
Notebook 1: https://www.kaggle.com/bjoernholzhauer/fastai-how-to-set-up-efficientnet-b4-0-945-lb
The notebook downloads and trains the model and outputs only the model.
Notebook 2:https://www.kaggle.com/bjoernholzhauer/inference-for-trained-fastai-efficientnet-b4
This notebook uses the model trained from notebook 1 as input (without any internet access) only for inference.

The following worked for me. I wanted to use a pretrained resnet18 model in a Kaggle competition.

  1. I added the pretrained-pytorch-models dataset into my notebook Pretrained PyTorch models | Kaggle (has pretrained resnet18)

  2. !mkdir -p /root/.cache/torch/hub/checkpoints/ !cp /kaggle/input/pretrained-pytorch-models/resnet18-5c106cde.pth /root/.cache/torch/hub/checkpoints/resnet18-5c106cde.pth

The shell commands copy over the pretrained resnet18 model to the location that Torch expects on Kaggle. The location can be determined from the message torch gives when it is downloading models over the internet.

After this I was able to submit my model without being connected to the internet.

for me worked the following on kaggle:

  1. Uploaded the model on the kaggle notebook.

  2. Created a copy of the loaded model into the kaggle/working dir, directory name /models:
    !cp …/input/tps-feb-22-xresnet18-model-fastai/xres18.pth ./models

  3. created a learner:
    learn = Learner(dls, xresnet18(n_out=10), metrics=accuracy)

  4. loaded the model:
    learn.load(‘xres18’)