Use of Learner and partial causes CUDA out of memory

kshitijpatil09 · April 2, 2020, 8:55pm

I’m working on Colab and was experimenting with several tweaks offered by XResnet. I’ve some strange observations that caused me to write this post.

When I choose the clean approach of cnn_learner with no tweaks in the model, I’m able to train the network without any memory error.

# arch=xresnet50
learn = cnn_learner(dls,arch,opt_func=ranger,metrics=error_rate)

Now, since I want to customize the architecture, I opt in for Learner and tried instantiating model as done by @ducha-aiki in this notebook

model = xresnet50(n_out=dls.c)
learn = Learner(dls,model,opt_func=ranger,metrics=error_rate,
                    splitter=lambda m: L(m[0][:3],m[0][3:],m[1:]).map(params))

This causes following error:

TypeError                                 Traceback (most recent call last)

<ipython-input-38-d81c6bd29d71> in <module>()
----> 1 learn.lr_find()

4 frames

/usr/local/lib/python3.6/dist-packages/fastai2/callback/schedule.py in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggestions)
    221     n_epoch = num_it//len(self.dls.train) + 1
    222     cb=LRFinder(start_lr=start_lr, end_lr=end_lr, num_it=num_it, stop_div=stop_div)
--> 223     with self.no_logging(): self.fit(n_epoch, cbs=cb)
    224     if show_plot: self.recorder.plot_lr_find()
    225     if suggestions:

/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    180     def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):
    181         with self.added_cbs(cbs):
--> 182             if reset_opt or not self.opt: self.create_opt()
    183             if wd is None: wd = self.wd
    184             if wd is not None: self.opt.set_hypers(wd=wd)

/usr/local/lib/python3.6/dist-packages/fastai2/learner.py in create_opt(self)
    129     def _bn_bias_state(self, with_bias): return bn_bias_params(self.model, with_bias).map(self.opt.state)
    130     def create_opt(self):
--> 131         self.opt = self.opt_func(self.splitter(self.model), lr=self.lr)
    132         if not self.wd_bn_bias:
    133             for p in self._bn_bias_state(True ): p['do_wd'] = False

<ipython-input-34-1c418864616f> in <lambda>(m)
      1 learn = Learner(dls,model,opt_func=ranger,metrics=metrics,                
----> 2                     splitter=lambda m: L(m[0][:3],m[0][3:],m[1:]).map(params))

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py in __getitem__(self, idx)
     66     def __getitem__(self, idx):
     67         if isinstance(idx, slice):
---> 68             return self.__class__(OrderedDict(list(self._modules.items())[idx]))
     69         else:
     70             return self._get_item_by_idx(self._modules.values(), idx)

TypeError: __init__() missing 1 required positional argument: 'nf'

I’m not sure, but thought this has something to do with head of the model and went ahead to write custom create_cnn_model method as shown by @muellerzr in this notebook

def create_custom_model(arch):
  def get_arch(pretrained=True): 
    return arch(sa=True, pool=MaxPool,act_cls=Mish,pretrained=pretrained)
  # arch = partial(xresnet50, sa=True, pool=MaxPool, act_cls=Mish)
  body = create_body(get_arch,cut=-4)
  nf = num_features_model(body) * 2
  body = convert_MP_to_blurMP(body, nn.MaxPool2d)
  head = create_head(nf, dls.c)
  model = nn.Sequential(body,head)
  return model

Note: create_body need callable function and hence one of the option was using partial. I tried using partial with all required args predefined which lead to “out of memory” error. Then I tried

removing Mish
removing all the custom args: partial(xresnet50) just to clarify if any param causing the issue
define a closure get_arch as shown above

Nothing helps. So clearly, either the partial or Learner somehow causing “Out of Memory” errors since cnn_learner has no issues and am able to train the model.

So please discuss the ideal ways of dealing with Custom Architectures that won’t cause any Memory Errors

muellerzr · April 2, 2020, 9:03pm

You shouldn’t use cnn_learner for xresnet, it’s not designed to work with that architecture as (unless we’re using xresnet50) it’s not pretrained (and I’ve found the pretrained one has issues too). But you said you have it working so let’s work under that assumption. You should use the Learner and pass in the full model instead as shown in the ImageNette/Woof notebooks. Can’t go into GitHub as it’s currently down else I’d show some examples. Also on your split with the error again, unless you explicitly use a pretrained model those splits don’t matter much and if you do you should be using the default_split

Edit: Looks like it and nbviewer are up again, just not GitHub’s in-site renderer

sgugger · April 2, 2020, 9:08pm

GitHub is down but not the documentation, you can check the new fleshed out imagenette tutorial which has all you need