learner.Load() error when Progressive Resizing on Unet?

Hi there!
I’m working on satellite imagery in a Kaggle kernel, using primarily sample code from the class. I am training a segmentation learner, and can reach about 80% accuracy (DICE) after a few minutes. However, when I try to swap the data out for the same images, but in 2x resolution, seg_learn.load('stage-2') fails to load! In particular, it complains as follows:

RuntimeError: Error(s) in loading state_dict for DynamicUnet: Missing key(s) in state_dict: "layers.10.layers.0.0.weight", "layers.10.layers.0.0.bias", "layers.10.layers.1.0.weight", "layers.10.layers.1.0.bias", "layers.11.0.weight", "layers.11.0.bias". Unexpected key(s) in state_dict: "layers.12.0.weight", "layers.12.0.bias", "layers.11.layers.0.0.weight", "layers.11.layers.0.0.bias", "layers.11.layers.1.0.weight", "layers.11.layers.1.0.bias".

To my untrained eye, it seems like the save has failed to prepare the file in a way that the loader expects. Am I doing something wrong? Have you seen this before? Can you help me get past this blocker?

Thank you!
Jona

3 Likes

I was having the exact same problem (just on layers.5 instead of on layers.10) loading an old model. After a bit of trial and error, I realized that I’ve enabled self-attention recently. Once disabled, the old model loaded correctly.

#learn = unet_learner(data, arch, pretrained=True, self_attention=True, loss_func=F.l1_loss, blur=True, norm_type=NormType.Weight)

learn = unet_learner(data, arch, pretrained=True, loss_func=F.l1_loss, blur=True, norm_type=NormType.Weight)

3 Likes

Hello @jona! Did you solve this problem and how? I’m encountering the same issue then doing progressive resizing on Unet.

Oof!
I’m sorry that I don’t remember. I’ll see if I can dig it up, but unlikely to be of help. Sorry!

Hi Sebastian,
I went back and the old notebook runs without error now, so I must have solved it somehow. I do not see any reference to seld_attention, so I don’t think Avio’s solution was my salvation.
Instead, what I see in my code is:

#… [train seg_learn on size=src_size//2]
seg_learn.save(‘stage-2’)
size = src_size
src = (SegItemListCustom.from_folder(path_img).split_by_rand_pct(valid_pct=0.2).label_from_func(get_y_fn, classes=codes))
data = (src.transform(tfms, size=size, tfm_y=True).databunch(bs=bs, num_workers=0).normalize(imagenet_stats))
seg_learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd, model_dir=’/tmp/models’)
seg_learn.load(‘stage-2’);

So it looks like I essentially rebuilt the learner from the ground up, and then load in the saved data. I hope this helps you!

2 Likes

Strange, I’m rebuilding my model too, but it still fails, the difference is that I’m using resnet18. But thanks a lot for the rapid response, I’ll continue testing

Updage: I tested with resnet34 and I’m still getting the error. I don’t know if the fastai version google colab runs is different from the one that allows unet to do progressive resizing that way…

The error I’m getting is:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in load(self, file, device, strict, with_opt, purge, remove_module)
    271             model_state = state['model']
    272             if remove_module: model_state = remove_module_load(model_state)
--> 273             get_model(self.model).load_state_dict(model_state, strict=strict)
    274             if ifnone(with_opt,True):
    275                 if not hasattr(self, 'opt'): self.create_opt(defaults.lr, self.wd)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    828         if len(error_msgs) > 0:
    829             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 830                                self.__class__.__name__, "\n\t".join(error_msgs)))
    831         return _IncompatibleKeys(missing_keys, unexpected_keys)
    832 

RuntimeError: Error(s) in loading state_dict for DynamicUnet:
	Missing key(s) in state_dict: "layers.11.layers.0.0.weight", "layers.11.layers.0.0.bias", "layers.11.layers.1.0.weight", "layers.11.layers.1.0.bias", "layers.12.0.weight", "layers.12.0.bias". 
	Unexpected key(s) in state_dict: "layers.10.layers.0.0.weight", "layers.10.layers.0.0.bias", "layers.10.layers.1.0.weight", "layers.10.layers.1.0.bias", "layers.11.0.weight", "layers.11.0.bias". 

@sgaseretto I have the exact same problem (maybe mirrored) as you.

	Missing key(s) in state_dict: "layers.10.layers.0.0.weight", "layers.10.layers.0.0.bias", "layers.10.layers.1.0.weight", "layers.10.layers.1.0.bias", "layers.11.0.weight", "layers.11.0.bias". 
	Unexpected key(s) in state_dict: "layers.12.0.weight", "layers.12.0.bias", "layers.11.layers.0.0.weight", "layers.11.layers.0.0.bias", "layers.11.layers.1.0.weight", "layers.11.layers.1.0.bias".

I’m creating the U-Net as usual with:

learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd, pretrained=True, self_attention=True)

But the saved model doesn’t load. How did you solve that?

Because it’s different!

The same U-Net instantiated with the same line of python code, produces two different model structures. That’s weird!

EDIT:

And here we have the culprit!

This line in the DynamicUnet class changes the structure of the network according to the image size provided. But I can’t understand why…

if imsize != x.shape[-2:]: layers.append(Lambda(lambda x: F.interpolate(x, imsize, mode=‘nearest’)))

EDIT2:

I’ve restarted the notebook and now the DynamicUnet constructor creates the lambda() layer also in the first U-Net allocation with the smallest image size (51, 100). So strange…

I’ve also added:

   print(x.shape)
   print(imsize)

to the DynamicUnet constructor and the output is:

torch.Size([1, 96, 52, 100])
torch.Size([51, 100])

1 Like

I didn’t dig in to the code like you did (and I should have done that), but figured out that the problem was with the image sizes too. I discover that if the sizes used to initialize the Databunch where odd numbers, I always got that error, but using images with 2^n sizes always worked, didn’t tried with even numbers because I resized the images from 128, to 256 and to 512, but probably if all the image size are even numbers you will stop getting those errors.

Yep, that’s seems the explanation, but it’s not deterministic. Now I’m getting the opposite, the first U-Net is created with the lambda layer, the second one doesn’t get it. Same code, same image sizes.

EDIT:

The other super weird thing is that with the lambda layer I get these losses:

image

Without the lambda layer I was getting these losses:

image

So it looks that, without the lambda layer, the network behaves… weirdly…