Exception occured in `LRFinder` when calling event `after_fit`

I’m running the 05_pet_breeds nb on a GPU machine. Strangely I run into this error

RuntimeError                              Traceback (most recent call last)
Cell In[8], line 3
      1 learn = vision_learner(dls, resnet34, metrics=error_rate)
      2 # learn.remove_cb(ProgressCallback)
----> 3 lr_min,lr_steep = learn.lr_find(suggest_funcs=(minimum, steep))

File /venv/lib/python3.8/site-packages/fastai/callback/schedule.py:293, in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggest_funcs)
    291 n_epoch = num_it//len(self.dls.train) + 1
    292 cb=LRFinder(start_lr=start_lr, end_lr=end_lr, num_it=num_it, stop_div=stop_div)
--> 293 with self.no_logging(): self.fit(n_epoch, cbs=cb)
    294 if suggest_funcs is not None:
    295     lrs, losses = tensor(self.recorder.lrs[num_it//10:-5]), tensor(self.recorder.losses[num_it//10:-5])

File /venv/lib/python3.8/site-packages/fastai/learner.py:264, in Learner.fit(self, n_epoch, lr, wd, cbs, reset_opt, start_epoch)
    262 self.opt.set_hypers(lr=self.lr if lr is None else lr)
    263 self.n_epoch = n_epoch
--> 264 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)

File /venv/lib/python3.8/site-packages/fastai/learner.py:201, in Learner._with_events(self, f, event_type, ex, final)
    199 try: self(f'before_{event_type}');  f()
    200 except ex: self(f'after_cancel_{event_type}')
--> 201 self(f'after_{event_type}');  final()

File /venv/lib/python3.8/site-packages/fastai/learner.py:172, in Learner.__call__(self, event_name)
--> 172 def __call__(self, event_name): L(event_name).map(self._call_one)

File /venv/lib/python3.8/site-packages/fastcore/foundation.py:156, in L.map(self, f, *args, **kwargs)
--> 156 def map(self, f, *args, **kwargs): return self._new(map_ex(self, f, *args, gen=False, **kwargs))

File /venv/lib/python3.8/site-packages/fastcore/basics.py:840, in map_ex(iterable, f, gen, *args, **kwargs)
    838 res = map(g, iterable)
    839 if gen: return res
--> 840 return list(res)

File /venv/lib/python3.8/site-packages/fastcore/basics.py:825, in bind.__call__(self, *args, **kwargs)
    823     if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    824 fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 825 return self.func(*fargs, **kwargs)

File /venv/lib/python3.8/site-packages/fastai/learner.py:176, in Learner._call_one(self, event_name)
    174 def _call_one(self, event_name):
    175     if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 176     for cb in self.cbs.sorted('order'): cb(event_name)
...
    175                        'to an existing device.')
    176 return device

RuntimeError: Exception occured in `LRFinder` when calling event `after_fit`:
	Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an existing device.

when I simply run learn.lr_find(). Does sb know what this could have to do with?
Steps to reproduce:

! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
from fastai.vision.all import *
from fastbook import *
path = untar_data(URLs.PETS)
pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
                 get_items=get_image_files, 
                 splitter=RandomSplitter(seed=42),
                 get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
                 item_tfms=Resize(460),
                 batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = pets.dataloaders(path/"images")
learn = vision_learner(dls, resnet34, metrics=error_rate)
# learn.remove_cb(ProgressCallback)
lr_min,lr_steep = learn.lr_find(suggest_funcs=(minimum, steep))
1 Like

I’m having this error too. I can train the model using fit though

I tried to run these lines on Google Colab and my Paperspace isntance, but I am not getting error.

yeah, somehow it has to do with that machine I’m running. on other machines w/ GPUs that I tried it doesn’t happen either. not sure what’s differenst with that machine but lr_find will throw that error

You can use mamba to setup fastai again from scratch if that fixes the problem. Or you can start using other virtual environment like Paperspace.

Jeremy has live-coding videos where he goes over how to setup an environment on Paperspace, and other programming tips in general like using vim and debugger. It is pretty long, but very helpful.

1 Like

I cannot install mamba in the environment where this error is appearing…so yes, it seems to be sth related to the specific machine itself

One solution is to patch the after_fit callback in LRFinder to load the temporal model into cpu (Cortesy of ChatGPT for this solution)

from fastai.callback.schedule import LRFinder

@patch_to(LRFinder)
def after_fit(self):
    self.learn.opt.zero_grad() # Needed before detaching the optimizer for future fits
    tmp_f = self.path/self.model_dir/self.tmp_p/'_tmp.pth'
    if tmp_f.exists():
        self.learn.load(f'{self.tmp_p}/_tmp', with_opt=True, device='cpu')
        self.tmp_d.cleanup()

I found out that apparently it’s a problem with torch itself where it wants to receive an integer as GPU identifier but in this case receives a different identifier format (I think a UUID). The workaround I found is to before activating the environment run this command:

export CUDA_VISIBLE_DEVICES=`nvidia-smi -L |grep \`echo $CUDA_VISIBLE_DEVICES\` |awk '{print $2}' |sed 's/://'`

which will set the variable $CUDA_VISIBLE_DEVICES straight. Worked for me!

2 Likes

Thanks!

To run this in a Jupyter Notebook and keep the env variable across cells I had to execute the command in Python like this (cortesy of GPT4)

import os
import subprocess

result = subprocess.check_output("nvidia-smi -L | grep -oE '[0-9]+:' | tr -d ':'", shell=True).decode("utf-8").strip()
os.environ['CUDA_VISIBLE_DEVICES'] = result

print(os.environ['CUDA_VISIBLE_DEVICES'])
1 Like