Error with Densnet 121 finetuning

Yep, I did. Still the same error. In fact, none of the DenseNet models are working!.

Sounds like I better look into it…

1 Like

Hi Jeremy - was the issue with densenet fixed?

No, turns out I had to teach 7 classes this week, which has kept me busy :open_mouth: It’s high on my priority list though.

4 Likes

Any update on a fix for densenet? I’ve been using an older version of fastai so I can use it as it just so happens to perform really well compared to other models.

2 Likes

I’m not sure I’ll have time this week. I’ll try! (But feel free to have a go yourself at a fix since I think you’d find it an interesting area of fastai to learn about…)

1 Like

DenseNet implementation in PyTorch has repeated modules. The layers are in the form of OrderedDict. Repeated modules key names causes the optimizer throw this error:

ValueError: some parameters appear in more than one parameter group

I found this PyTorch issue helpful in understanding the problem.

I did the following to fix the error:

  1. Modified the torchvision.models.densenet code to append the block numbers to the layer names.
  2. Copied the weights from torchvision.models.densenet models to the models with updated layer names.
  3. Saved the state_dict for the updated model.

With these changes I am able load and train all of the DenseNet models.

Here is the code I used to transfer the model weights

import torch
from densenet import *
import torchvision
from collections import OrderedDict
from tqdm import tqdm

dn_models = {
    'densenet121': densenet121,
    'densenet169': densenet169,
    'densenet201': densenet201,
    'densenet161': densenet161,
}

torch_models = {
    'densenet121': torchvision.models.densenet121,
    'densenet169': torchvision.models.densenet169,
    'densenet201': torchvision.models.densenet201,
    'densenet161': torchvision.models.densenet161,
}

for m in tqdm(dn_models.keys()):
    print(f"Fixing {m}")
    # densenet with layer names fixed
    dnetm = dn_models[m]()
    # original densenet
    dnet = torch_models[m](True).eval()

    # get the state dict of
    dnet_sdict = dnet.state_dict()
    d_keys = dnet_sdict.keys()
    dm_keys = dnetm.state_dict().keys() # modified densenet keys

    dnetm.load_state_dict(OrderedDict(zip(dm_keys, dnet_sdict.values())))
    dnetm.eval()
    dnetm_sdict = dnetm.state_dict()

    for k1, k2 in zip(d_keys, dm_keys):
        assert torch.equal(dnet_sdict[k1], dnetm_sdict[k2]), f"{k1}!={k2}"

    torch.save(dnetm.state_dict(), model_locs[m])
    print(f"Saving to {model_locs[m]}\n")

print("Done!")

Modified DenseNet code
Fixed DenseNet weights

@jeremy Does this look like valid solution? Or is there a better way of fixing this issue?

8 Likes

you rock, thanks so much @vikram!!

2 Likes

Thanks and welcome!

I’ve fixed this in fastai. It wasn’t actually the reason that @vikram suggested (although still awesome that his fix worked anyway!) but was due to a bug in how layer groups were created. I’ve fixed it in a rather hacky way for now, which works in my testing - but let me know if anyone sees any problems. I’ll endeavor to find a cleaner API for creating layer groups in the future…

2 Likes

@jeremy Glad that there was a simple fix! However, I see this error upon calling learn.unfreeze after the latest pull.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-932c6b65ac49> in <module>()
----> 1 learn.unfreeze()

~/vikram/fast_ai/fastai/courses/dl1/fastai/conv_learner.py in unfreeze(self)
    182             None
    183         """
--> 184         self.freeze_to(0)
    185         self.precompute = False

~/vikram/fast_ai/fastai/courses/dl1/fastai/learner.py in freeze_to(self, n)
     64         c=self.get_layer_groups()
     65         for l in c:     set_trainable(l, False)
---> 66         for l in c[n:]: set_trainable(l, True)
     67 
     68     def unfreeze(self): self.freeze_to(0)

~/vikram/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/container.py in __getitem__(self, idx)
     51 
     52     def __getitem__(self, idx):
---> 53         if not (-len(self) <= idx < len(self)):
     54             raise IndexError('index {} is out of range'.format(idx))
     55         if idx < 0:

TypeError: '<=' not supported between instances of 'int' and 'slice'

I figured out that this is arising from the updated get_layer_groups function. Specifically when we ask for fully connected block.

        if do_fc:
            return self.fc_model 

Comparing this to previous implementation, in the updated code you are returning torch.nn.modules.container.Sequential earlier it was list. May be torch.nn.modules.container.Sequential does not support indexing? I see error when I do c[n:].

Anyway, I did an easy fix by casting return as list of children of the fc_model and the unfreeze works.

        if do_fc:
            return list(children(self.fc_model))

I hope I got this right this time.

What is the right way of getting the layer groups? Can you help me figure out how I can improve the code?

Ah sorry - I was running out of time before class so it seems I didn’t test it properly… Anyway I’ve wrapped it in a list now and it’s OK again. (There’s no need to make each child of fc_model a separate layer - we can treat it all as one layer).

I don’t know - if I knew I would have implemented it! :wink:

@vikram While working on that Chestnet dataset did you get this error Metrics for multilabel dataset: accuracy_multi() missing 1 required positional argument: 'thresh'? I get this error using a resnet model (I noticed in a previous post you used resnet before moving do densenet).

When using densnet 121 (the same method as used by Ng https://arxiv.org/pdf/1711.05225.pdf) I’m getting another error altogether

RuntimeError: invalid argument 2: 3D or 4D (batch mode) tensor expected for input, but got: [64 x 1000] at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THCUNN/generic/SpatialAdaptiveMaxPooling.cu:22

This is the complete stack trace:

RuntimeError                              Traceback (most recent call last)
<ipython-input-27-2cf15c2f4ff1> in <module>()
      1 # determine learning rate using learning rate finder
----> 2 lrf=learn.lr_find()
      3 learn.sched.plot()

~/fastai/courses/dl1/fastai/learner.py in lr_find(self, start_lr, end_lr, wds)
    249         layer_opt = self.get_layer_opt(start_lr, wds)
    250         self.sched = LR_Finder(layer_opt, len(self.data.trn_dl), end_lr)
--> 251         self.fit_gen(self.model, self.data, layer_opt, 1)
    252         self.load('tmp')
    253 

~/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, **kwargs)
    158         n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
    159         fit(model, data, n_epoch, layer_opt.opt, self.crit,
--> 160             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
    161 
    162     def get_layer_groups(self): return self.models.get_layer_groups()

~/fastai/courses/dl1/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, **kwargs)
     84             batch_num += 1
     85             for cb in callbacks: cb.on_batch_begin()
---> 86             loss = stepper.step(V(x),V(y))
     87             avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)
     88             debias_loss = avg_loss / (1 - avg_mom**batch_num)

~/fastai/courses/dl1/fastai/model.py in step(self, xs, y)
     38     def step(self, xs, y):
     39         xtra = []
---> 40         output = self.m(*xs)
     41         if isinstance(output,(tuple,list)): output,*xtra = output
     42         self.opt.zero_grad()

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/container.py in forward(self, input)
     65     def forward(self, input):
     66         for module in self._modules.values():
---> 67             input = module(input)
     68         return input
     69 

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

~/fastai/courses/dl1/fastai/layers.py in forward(self, x)
      8         self.ap = nn.AdaptiveAvgPool2d(sz)
      9         self.mp = nn.AdaptiveMaxPool2d(sz)
---> 10     def forward(self, x): return torch.cat([self.mp(x), self.ap(x)], 1)
     11 
     12 class Lambda(nn.Module):

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/pooling.py in forward(self, input)
    820 
    821     def forward(self, input):
--> 822         return F.adaptive_max_pool2d(input, self.output_size, self.return_indices)
    823 
    824     def __repr__(self):

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in adaptive_max_pool2d(input, output_size, return_indices)
    383         return_indices: whether to return pooling indices
    384     """
--> 385     return _functions.thnn.AdaptiveMaxPool2d.apply(input, output_size, return_indices)
    386 
    387 

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/_functions/thnn/pooling.py in forward(ctx, input, output_size, return_indices)
    501         backend.SpatialAdaptiveMaxPooling_updateOutput(backend.library_state,
    502                                                        input, output, indices,
--> 503                                                        ctx.output_size[1], ctx.output_size[0])
    504         if ctx.return_indices:
    505             ctx.save_for_backward(input, indices)

RuntimeError: invalid argument 2: 3D or 4D (batch mode) tensor expected for input, but got: [64 x 1000] at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THCUNN/generic/SpatialAdaptiveMaxPooling.cu:22

I appreciate any help!

Thnx

p.s. I’m not sure how this is caused RuntimeError: invalid argument 2: 3D or 4D (batch mode) tensor expected for input, but got: [64 x 1000]. I understand that 64 is my default batch size – but why the 1000?

When I call data.classes I get the correct 15 classes (including “No Finding”)

['Atelectasis',
 'Cardiomegaly',
 'Consolidation',
 'Edema',
 'Effusion',
 'Emphysema',
 'Fibrosis',
 'Hernia',
 'Infiltration',
 'Mass',
 'No Finding',
 'Nodule',
 'Pleural_Thickening',
 'Pneumonia',
 'Pneumothorax']

Could you please post how you are instantiating ConvLearner? I suspect this is because of the loss function. Some this is a multi label classification, we should use something like f-beta score. Check planet.ipynb from fastai.

My ConvLearner looks like this learn = ConvLearner.pretrained(arch, data, precompute=False).

I also tried the metrics from planet and did the following learn = ConvLearner.pretrained(arch, data, metrics=metrics) where metrics = [f2] just like Lesson2 shows https://github.com/fastai/fastai/blob/master/courses/dl1/lesson2-image_models.ipynb

Also in the Planets notebook the tensor used is [torch.FloatTensor of size 64x17] mine for chestnet is [torch.FloatTensor of size 32x15]. However

I am trying the dn121 on a medical dataset from kaggle. But getting the following error:


I checked data.classes, did read and plot an image from my training set and did img.shape also. they all show okay. How do i go about this?

I am getting same error on different problem.
Did you find the solution?

Late to the party but would like to thank you a million times @vikram!

1 Like