Cuda Runtime Error (801) with learn.predict() in Jupyter

Hey there! :innocent:

I come to share with you an issue I’m facing. Maybe some of you are in the same situation and I hope we could find an answer

As a beginner I’m practicing fastai.vision on “Urban Sound Classification” by converting audio files to spectograms. I faced a first Cuda Runtime Error when creating the ImageDataLoaders and the solution was using num_workers=0. Everything went well with the training and now I want to do some predictions with my test set. For now I just tried to predict a single image and here’s what I got

img = PILImage.create('../Urban Sound Classification/Data Sources/test_spectrogram/416.png')
img.resize((300,300))
learn.predict(TensorImage(image2tensor(img))) #not sure if it's the correct way to pass an image
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-17-fb76c5c3289b> in <module>
----> 1 learn.predict(TensorImage(image2tensor(img)))

~\miniconda3\envs\tensorflow\lib\site-packages\fastai\learner.py in predict(self, item, rm_type_tfms, with_input)
    246     def predict(self, item, rm_type_tfms=None, with_input=False):
    247         dl = self.dls.test_dl([item], rm_type_tfms=rm_type_tfms, num_workers=0)
--> 248         inp,preds,_,dec_preds = self.get_preds(dl=dl, with_input=True, with_decoded=True)
    249         i = getattr(self.dls, 'n_inp', -1)
    250         inp = (inp,) if i==1 else tuplify(inp)

~\miniconda3\envs\tensorflow\lib\site-packages\fastai\learner.py in get_preds(self, ds_idx, dl, with_input, with_decoded, with_loss, act, inner, reorder, cbs, n_workers, **kwargs)
    233         if with_loss: ctx_mgrs.append(self.loss_not_reduced())
    234         with ContextManagers(ctx_mgrs):
--> 235             self._do_epoch_validate(dl=dl)
    236             if act is None: act = getattr(self.loss_func, 'activation', noop)
    237             res = cb.all_tensors()

~\miniconda3\envs\tensorflow\lib\site-packages\fastai\learner.py in _do_epoch_validate(self, ds_idx, dl)
    186         if dl is None: dl = self.dls[ds_idx]
    187         self.dl = dl
--> 188         with torch.no_grad(): self._with_events(self.all_batches, 'validate', CancelValidException)
    189 
    190     def _do_epoch(self):

~\miniconda3\envs\tensorflow\lib\site-packages\fastai\learner.py in _with_events(self, f, event_type, ex, final)
    153 
    154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
    156         except ex: self(f'after_cancel_{event_type}')
    157         finally:   self(f'after_{event_type}')        ;final()

~\miniconda3\envs\tensorflow\lib\site-packages\fastai\learner.py in all_batches(self)
    159     def all_batches(self):
    160         self.n_iter = len(self.dl)
--> 161         for o in enumerate(self.dl): self.one_batch(*o)
    162 
    163     def _do_one_batch(self):

~\miniconda3\envs\tensorflow\lib\site-packages\fastai\data\load.py in __iter__(self)
    101         self.randomize()
    102         self.before_iter()
--> 103         for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
    104             if self.device is not None: b = to_device(b, self.device)
    105             yield self.after_batch(b)

~\miniconda3\envs\tensorflow\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    735             #     before it starts, and __del__ tries to join but will get:
    736             #     AssertionError: can only join a started process.
--> 737             w.start()
    738             self._index_queues.append(index_queue)
    739             self._workers.append(w)

~\miniconda3\envs\tensorflow\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

~\miniconda3\envs\tensorflow\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

~\miniconda3\envs\tensorflow\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

~\miniconda3\envs\tensorflow\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     87             try:
     88                 reduction.dump(prep_data, to_child)
---> 89                 reduction.dump(process_obj, to_child)
     90             finally:
     91                 set_spawning_popen(None)

~\miniconda3\envs\tensorflow\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

~\miniconda3\envs\tensorflow\lib\site-packages\torch\multiprocessing\reductions.py in reduce_tensor(tensor)
    238          ref_counter_offset,
    239          event_handle,
--> 240          event_sync_required) = storage._share_cuda_()
    241         tensor_offset = tensor.storage_offset()
    242         shared_cache[handle] = StorageWeakRef(storage)

RuntimeError: cuda runtime error (801) : operation not supported at ..\torch/csrc/generic/StorageSharing.cpp:247

Full project on GitHub is here and the error is at the bottom if it can help: https://github.com/loicrouillermonay/Machine-Learning-Projects/blob/urban-sound-classification/Urban%20Sound%20Classification/Urban%20Sound%20Classification.ipynb

Hey Loïc,

Looks like a really cool project! Have you tried just passing the full filename (path to file) to predict as is done in the docs tutorials?

Thanks for your time and your answer. :relieved: I also tried to pass the image as in the docs tutorial and got the same error. The code i tried is the one below:

files = get_image_files('../Urban Sound Classification/Data Sources/test_spectrogram')
# files[0] -> Path('../Urban Sound Classification/Data Sources/test_spectrogram/1002.png')
learn.predict(files[0])

I’m getting the same error - earlier in the module it was fixed by passing num_workers=0 into ‘ImageDataLoaders.from_name_func’, but learn.predict doesn’t take num_workers as a parameter…

Same problem, stuck here, did you figure it out?

I feel as if this is a recent issue ? maybe related to a recent version? because im seeing cases where the passing it into the learner was sufficient for learn.predict but for us in this thread it isnt

I couldn’t find any solution to make my code work on Windows. However, I have just downloaded exactly the same code on MacOS (a second computer) and it works. In the meantime I’ll finish my project this way: training the model On Windows and predictions on MacOS :upside_down_face:

Call get_preds() with n_workers=0 in predict() of fastai\learner.py, then it works on Windows.

Remember to reload fastai module if you test it in jupyter notebook.

2 Likes

Thank you for the response.

To do that, does one have to:

git clone https://github.com/fastai/fastai

modify line 248 of fastai/learner.py to read:
inp,preds,_,dec_preds = self.get_preds(dl=dl, with_input=True, with_decoded=True, n_workers=0)

pip install -e “fastai[dev]”

?

1 Like

That’s probably the correct way to do it, but I just located Lib\site-packages\fastai\learner.py in my Python folder (in my case, the folder for the conda env I’m using) and edited it there.

Obviously this will be overwritten if I update the fastai package, but hopefully a future update to fastai will fix the issue anyway :slightly_smiling_face:

2 Likes

Thanks… for anyone else who encounters this and is so green when it comes to Python/Anaconda it’s not even funny (like myself), I didn’t have a Lib\site-packages\fastai\learner.py in the conda env I was using, but I had two fastai* folders in C:\users\<my username>\.conda\pkgs and after making the learner.py change in both it worked. I’m sure I would’ve only had to make the change in one of them, but I did both just to make sure. I then switched away from and back to the conda env and restarted the Chapter 01 Jupyter Notebook and did a ‘Restart and Run All’ from the Kernel menu after which the learn.predict function worked as expected! Cool.

The two fastai* folders I had were:
fastai-2.0.9-pyh39e3cac_0
fastai-2.0.10-py_0

I also tried pip install -e “fastai[dev]” after the git clone but got this error (which I’m not overly concerned about given that it’s ‘working’), but if anyone has any ideas I’m all ears:

(anotherenv) C:\Users\Owner\anotherenv>pip install -e "fastai[dev]"
Obtaining file:///C:/Users/Owner/anotherenv/fastai
Requirement already satisfied: pip in c:\users\owner\.conda\envs\anotherenv\lib\site-packages (from fastai==2.0.14) (20.2.2)
Collecting packaging
  Using cached packaging-20.4-py2.py3-none-any.whl (37 kB)
Collecting fastcore>=1.0.5
  Using cached fastcore-1.0.15-py3-none-any.whl (40 kB)
ERROR: Could not find a version that satisfies the requirement torchvision>=0.7 (from fastai==2.0.14) (from versions: 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.1, 0.2.2, 0.2.2.post2, 0.2.2.post3, 0.5.0)
ERROR: No matching distribution found for torchvision>=0.7 (from fastai==2.0.14)

Also, running into another issue later on in the module:

path = untar_data(URLs.CAMVID_TINY)
dls = SegmentationDataLoaders.from_label_func(
    path, bs=8, fnames = get_image_files(path/"images"),
    label_func = lambda o: path/'labels'/f'{o.stem}_P{o.suffix}',
    codes = np.loadtxt(path/'codes.txt', dtype=str)
)

learn = unet_learner(dls, resnet34)
learn.fine_tune(8)

Is giving this error… any suggestions as to where I should add n/num_workers = 0 here?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-21-b0aa8ba5c292> in <module>
  7 
  8 learn = unet_learner(dls, resnet34)
----> 9 learn.fine_tune(8)

~\.conda\envs\thefinalenv\lib\site-packages\fastcore\utils.py in _f(*args, **kwargs)
471         init_args.update(log)
472         setattr(inst, 'init_args', init_args)
--> 473         return inst if to_return else f(*args, **kwargs)
474     return _f
475 

~\.conda\envs\thefinalenv\lib\site-packages\fastai\callback\schedule.py in fine_tune(self, epochs, base_lr, freeze_epochs, lr_mult, pct_start, div, **kwargs)
159     "Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR"
160     self.freeze()
--> 161     self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
162     base_lr /= 2
163     self.unfreeze()

~\.conda\envs\thefinalenv\lib\site-packages\fastcore\utils.py in _f(*args, **kwargs)
471         init_args.update(log)
472         setattr(inst, 'init_args', init_args)
--> 473         return inst if to_return else f(*args, **kwargs)
474     return _f
475 

~\.conda\envs\thefinalenv\lib\site-packages\fastai\callback\schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
111     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
112               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 113     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
114 
115 # Cell

~\.conda\envs\thefinalenv\lib\site-packages\fastcore\utils.py in _f(*args, **kwargs)
471         init_args.update(log)
472         setattr(inst, 'init_args', init_args)
--> 473         return inst if to_return else f(*args, **kwargs)
474     return _f
475 

~\.conda\envs\thefinalenv\lib\site-packages\fastai\learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
205             self.opt.set_hypers(lr=self.lr if lr is None else lr)
206             self.n_epoch,self.loss = n_epoch,tensor(0.)
--> 207             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
208 
209     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

~\.conda\envs\thefinalenv\lib\site-packages\fastai\learner.py in _with_events(self, f, event_type, ex, final)
153 
154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
156         except ex: self(f'after_cancel_{event_type}')
157         finally:   self(f'after_{event_type}')        ;final()

~\.conda\envs\thefinalenv\lib\site-packages\fastai\learner.py in _do_fit(self)
195         for epoch in range(self.n_epoch):
196             self.epoch=epoch
--> 197             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
198 
199     @log_args(but='cbs')

~\.conda\envs\thefinalenv\lib\site-packages\fastai\learner.py in _with_events(self, f, event_type, ex, final)
153 
154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
156         except ex: self(f'after_cancel_{event_type}')
157         finally:   self(f'after_{event_type}')        ;final()

~\.conda\envs\thefinalenv\lib\site-packages\fastai\learner.py in _do_epoch(self)
189 
190     def _do_epoch(self):
--> 191         self._do_epoch_train()
192         self._do_epoch_validate()
193 

~\.conda\envs\thefinalenv\lib\site-packages\fastai\learner.py in _do_epoch_train(self)
181     def _do_epoch_train(self):
182         self.dl = self.dls.train
--> 183         self._with_events(self.all_batches, 'train', CancelTrainException)
184 
185     def _do_epoch_validate(self, ds_idx=1, dl=None):

~\.conda\envs\thefinalenv\lib\site-packages\fastai\learner.py in _with_events(self, f, event_type, ex, final)
153 
154     def _with_events(self, f, event_type, ex, final=noop):
--> 155         try:       self(f'before_{event_type}')       ;f()
156         except ex: self(f'after_cancel_{event_type}')
157         finally:   self(f'after_{event_type}')        ;final()

~\.conda\envs\thefinalenv\lib\site-packages\fastai\learner.py in all_batches(self)
159     def all_batches(self):
160         self.n_iter = len(self.dl)
--> 161         for o in enumerate(self.dl): self.one_batch(*o)
162 
163     def _do_one_batch(self):

~\.conda\envs\thefinalenv\lib\site-packages\fastai\data\load.py in __iter__(self)
101         self.randomize()
102         self.before_iter()
--> 103         for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
104             if self.device is not None: b = to_device(b, self.device)
105             yield self.after_batch(b)

~\.conda\envs\thefinalenv\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
735             #     before it starts, and __del__ tries to join but will get:
736             #     AssertionError: can only join a started process.
--> 737             w.start()
738             self._index_queues.append(index_queue)
739             self._workers.append(w)

~\.conda\envs\thefinalenv\lib\multiprocessing\process.py in start(self)
119                'daemonic processes are not allowed to have children'
120         _cleanup()
--> 121         self._popen = self._Popen(self)
122         self._sentinel = self._popen.sentinel
123         # Avoid a refcycle if the target function holds an indirect

~\.conda\envs\thefinalenv\lib\multiprocessing\context.py in _Popen(process_obj)
222     @staticmethod
223     def _Popen(process_obj):
--> 224         return _default_context.get_context().Process._Popen(process_obj)
225 
226 class DefaultContext(BaseContext):

~\.conda\envs\thefinalenv\lib\multiprocessing\context.py in _Popen(process_obj)
325         def _Popen(process_obj):
326             from .popen_spawn_win32 import Popen
--> 327             return Popen(process_obj)
328 
329     class SpawnContext(BaseContext):

~\.conda\envs\thefinalenv\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
 91             try:
 92                 reduction.dump(prep_data, to_child)
---> 93                 reduction.dump(process_obj, to_child)
 94             finally:
 95                 set_spawning_popen(None)

~\.conda\envs\thefinalenv\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
 58 def dump(obj, file, protocol=None):
 59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
 61 
 62 #

~\.conda\envs\thefinalenv\lib\site-packages\torch\multiprocessing\reductions.py in reduce_tensor(tensor)
238          ref_counter_offset,
239          event_handle,
--> 240          event_sync_required) = storage._share_cuda_()
241         tensor_offset = tensor.storage_offset()
242         shared_cache[handle] = StorageWeakRef(storage)

RuntimeError: cuda runtime error (801) : operation not supported at ..\torch/csrc/generic/StorageSharing.cpp:247

Unfortunately the segmentation example always fails with a lack of GPU memory on my machine, so I can’t tell you anything more about how to get that one working :frowning:

Re the errors you see with the pip install command, remember that pip is an alternative dependency management system to conda, so you can’t always rely on them working well together. Normally you can set up an environment with conda as far as possible and then install further packages into it with pip without trouble, but you may then run into problems if you try to add or upgrade things with conda after that. Of course the advantage of using environments is if all else fails you can just start again…

Hey, in general, you want to put num_workers=0 at the end of your dataloaders (refer).

I can personally vouch for aforementioned trick working out in your case, it’s from page 42 right ?

For clarity, the following code will work(I’m on 64-bit Windows 10 with a GTX 960):

from fastai.vision.all import *

path = untar_data(URLs.CAMVID_TINY)
dls = SegmentationDataLoaders.from_label_func(
 path, bs=8, fnames = get_image_files(path/"images"),
 label_func = lambda o: path/'labels'/f'{o.stem}_P{o.suffix}',
 codes = np.loadtxt(path/'codes.txt', dtype=str),
 num_workers=0)
learn = unet_learner(dls, resnet34)
learn.fine_tune(13)

Note that the only real difference between your code and the above one is in the third from last line.

Hope this helps :slight_smile:

1 Like

This code fixed my problem.