Learn.load: RecursionError: maximum recursion depth exceeded

Hi, hope I’m posting this in the right forum :slight_smile:

I am running into a problem with the learn.load (using a cnn_learner) function and was wondering if anyone else gets this too? I keep getting a: “RecursionError: maximum recursion depth exceeded” error when running learn.load(base_dir + '[model_name]').

I don’t really understand what is happening here, and I’ve seen this happen a few times with other library calls, and it seems to happen regularly if I try to re-run functions of the model after the initial training run. But for my particular case at the moment to reproduce:

  1. I can save a model initially.
  2. Then initially load the model.
  3. But when I try re-load a model after that after changing some parameters (say for example the number of epochs in a fit cycle) I get that error. I’ve been trying to find a solution but can’t seem to get past this.

The only way I really recover from this is to create a new data bunch and to create a new learner. Which is not really ideal as I’d like to resume from my save point, rather than starting from scratch all the time.

I am on the latest version of fast ai (1.0.51). I’m also running my own ML rig (where I ssh in from my macOS laptop and connect to my running jupyter server) on Ubuntu 18 with an RTX 2070. I’m also using fp16. I’ve pasted the stack trace I am getting below if that helps to provide more context.

Thanks everyone!


RecursionError Traceback (most recent call last)
in
----> 1 learn.load(base_dir + ‘stage-1-50’)

~/fastai/fastai/basic_train.py in load(self, name, device, strict, with_opt, purge, remove_module)
259 remove_module:bool=False):
260 “Load model and optimizer state (if with_opt) name from self.model_dir using device.”
–> 261 if purge: self.purge(clear_opt=ifnone(with_opt, False))
262 if device is None: device = self.data.device
263 elif isinstance(device, int): device = torch.device(‘cuda’, device)

~/fastai/fastai/basic_train.py in purge(self, clear_opt)
310
311 tmp_file = get_tmp_file(self.path/self.model_dir)
–> 312 torch.save(state, open(tmp_file, ‘wb’))
313 for a in attrs_del: delattr(self, a)
314 gc.collect()

~/anaconda3/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol)
217 >>> torch.save(x, buffer)
218 “”"
–> 219 return _with_file_like(f, “wb”, lambda f: _save(obj, f, pickle_module, pickle_protocol))
220
221

~/anaconda3/lib/python3.7/site-packages/torch/serialization.py in _with_file_like(f, mode, body)
142 f = open(f, mode)
143 try:
–> 144 return body(f)
145 finally:
146 if new_fd:

~/anaconda3/lib/python3.7/site-packages/torch/serialization.py in (f)
217 >>> torch.save(x, buffer)
218 “”"
–> 219 return _with_file_like(f, “wb”, lambda f: _save(obj, f, pickle_module, pickle_protocol))
220
221

~/anaconda3/lib/python3.7/site-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol)
290 pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
291 pickler.persistent_id = persistent_id
–> 292 pickler.dump(obj)
293
294 serialized_storage_keys = sorted(serialized_storages.keys())

~/fastai/fastai/callback.py in getattr(self, k)
61
62 #Passthrough to the inner opt.
—> 63 def getattr(self, k:str)->Any: return getattr(self.opt, k, None)
64 def setstate(self,data:Any): self.dict.update(data)
65

… last 1 frames repeated, from the frame below …

~/fastai/fastai/callback.py in getattr(self, k)
61
62 #Passthrough to the inner opt.
—> 63 def getattr(self, k:str)->Any: return getattr(self.opt, k, None)
64 def setstate(self,data:Any): self.dict.update(data)
65

RecursionError: maximum recursion depth exceeded

Actually was experimenting a little just then and I can seem to avoid this error by instantiating a new learner object, and then calling load again. This seems to work pretty well:

learn = cnn_learner(data, models.resnet34, metrics=error_rate).to_fp16()
learn.load(base_dir + ‘[stage_model_name]’)

Still would be good to know why I have to instantiate a new learner? Is this the normal way to use load?

1 Like

Hey, did you ever find a solution to this? Running into the same issue.

1 Like

No I didn’t find a solution to this. But for some reason it is working now :man_shrugging:t2:Sorry I know that’s not very helpful.

I do have the notebook up on github where I “had” the error happening. But the current version up there does not bring up the error on my end anymore. So maybe can take a look there for some clues?

I suspect it may have had something to do with either:

  1. Not restarting the kernel
  2. The order of execution
  3. Possibly using to_fp16() (maybe try it with both?)
1 Like