Exception occured in `LRFinder` when calling event `after_fit`

InspectorBrain · March 12, 2023, 3:42pm

I’m running the 05_pet_breeds nb on a GPU machine. Strangely I run into this error

RuntimeError                              Traceback (most recent call last)
Cell In[8], line 3
      1 learn = vision_learner(dls, resnet34, metrics=error_rate)
      2 # learn.remove_cb(ProgressCallback)
----> 3 lr_min,lr_steep = learn.lr_find(suggest_funcs=(minimum, steep))

File /venv/lib/python3.8/site-packages/fastai/callback/schedule.py:293, in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggest_funcs)
    291 n_epoch = num_it//len(self.dls.train) + 1
    292 cb=LRFinder(start_lr=start_lr, end_lr=end_lr, num_it=num_it, stop_div=stop_div)
--> 293 with self.no_logging(): self.fit(n_epoch, cbs=cb)
    294 if suggest_funcs is not None:
    295     lrs, losses = tensor(self.recorder.lrs[num_it//10:-5]), tensor(self.recorder.losses[num_it//10:-5])

File /venv/lib/python3.8/site-packages/fastai/learner.py:264, in Learner.fit(self, n_epoch, lr, wd, cbs, reset_opt, start_epoch)
    262 self.opt.set_hypers(lr=self.lr if lr is None else lr)
    263 self.n_epoch = n_epoch
--> 264 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)

File /venv/lib/python3.8/site-packages/fastai/learner.py:201, in Learner._with_events(self, f, event_type, ex, final)
    199 try: self(f'before_{event_type}');  f()
    200 except ex: self(f'after_cancel_{event_type}')
--> 201 self(f'after_{event_type}');  final()

File /venv/lib/python3.8/site-packages/fastai/learner.py:172, in Learner.__call__(self, event_name)
--> 172 def __call__(self, event_name): L(event_name).map(self._call_one)

File /venv/lib/python3.8/site-packages/fastcore/foundation.py:156, in L.map(self, f, *args, **kwargs)
--> 156 def map(self, f, *args, **kwargs): return self._new(map_ex(self, f, *args, gen=False, **kwargs))

File /venv/lib/python3.8/site-packages/fastcore/basics.py:840, in map_ex(iterable, f, gen, *args, **kwargs)
    838 res = map(g, iterable)
    839 if gen: return res
--> 840 return list(res)

File /venv/lib/python3.8/site-packages/fastcore/basics.py:825, in bind.__call__(self, *args, **kwargs)
    823     if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    824 fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 825 return self.func(*fargs, **kwargs)

File /venv/lib/python3.8/site-packages/fastai/learner.py:176, in Learner._call_one(self, event_name)
    174 def _call_one(self, event_name):
    175     if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 176     for cb in self.cbs.sorted('order'): cb(event_name)
...
    175                        'to an existing device.')
    176 return device

RuntimeError: Exception occured in `LRFinder` when calling event `after_fit`:
	Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an existing device.

when I simply run learn.lr_find(). Does sb know what this could have to do with?
Steps to reproduce:

! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
from fastai.vision.all import *
from fastbook import *
path = untar_data(URLs.PETS)
pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
                 get_items=get_image_files, 
                 splitter=RandomSplitter(seed=42),
                 get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
                 item_tfms=Resize(460),
                 batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = pets.dataloaders(path/"images")
learn = vision_learner(dls, resnet34, metrics=error_rate)
# learn.remove_cb(ProgressCallback)
lr_min,lr_steep = learn.lr_find(suggest_funcs=(minimum, steep))

vrodriguezf · March 17, 2023, 3:14pm

I’m having this error too. I can train the model using fit though

galopy · March 17, 2023, 7:51pm

I tried to run these lines on Google Colab and my Paperspace isntance, but I am not getting error.

InspectorBrain · March 18, 2023, 8:28am

yeah, somehow it has to do with that machine I’m running. on other machines w/ GPUs that I tried it doesn’t happen either. not sure what’s differenst with that machine but lr_find will throw that error

galopy · March 18, 2023, 1:42pm

You can use mamba to setup fastai again from scratch if that fixes the problem. Or you can start using other virtual environment like Paperspace.

Jeremy has live-coding videos where he goes over how to setup an environment on Paperspace, and other programming tips in general like using vim and debugger. It is pretty long, but very helpful.

vrodriguezf · March 18, 2023, 8:54pm

I cannot install mamba in the environment where this error is appearing…so yes, it seems to be sth related to the specific machine itself

vrodriguezf · March 21, 2023, 10:01pm

One solution is to patch the after_fit callback in LRFinder to load the temporal model into cpu (Cortesy of ChatGPT for this solution)

from fastai.callback.schedule import LRFinder

@patch_to(LRFinder)
def after_fit(self):
    self.learn.opt.zero_grad() # Needed before detaching the optimizer for future fits
    tmp_f = self.path/self.model_dir/self.tmp_p/'_tmp.pth'
    if tmp_f.exists():
        self.learn.load(f'{self.tmp_p}/_tmp', with_opt=True, device='cpu')
        self.tmp_d.cleanup()

InspectorBrain · March 22, 2023, 10:08pm

I found out that apparently it’s a problem with torch itself where it wants to receive an integer as GPU identifier but in this case receives a different identifier format (I think a UUID). The workaround I found is to before activating the environment run this command:

export CUDA_VISIBLE_DEVICES=`nvidia-smi -L |grep \`echo $CUDA_VISIBLE_DEVICES\` |awk '{print $2}' |sed 's/://'`

which will set the variable $CUDA_VISIBLE_DEVICES straight. Worked for me!

vrodriguezf · March 24, 2023, 7:56pm

Thanks!

To run this in a Jupyter Notebook and keep the env variable across cells I had to execute the command in Python like this (cortesy of GPT4)

import os
import subprocess

result = subprocess.check_output("nvidia-smi -L | grep -oE '[0-9]+:' | tr -d ':'", shell=True).decode("utf-8").strip()
os.environ['CUDA_VISIBLE_DEVICES'] = result

print(os.environ['CUDA_VISIBLE_DEVICES'])