Unable to find a valid cuDNN algorithm to run convolution

Mike_Nelson · June 7, 2022, 9:10pm

Hi all!

I have been struggling for about three days now, first to get fastbook working via Anaconda on Windows10, which I believe I have done (Much thanks to threads like Installation of fastai v2 on Windows). Now I have a problem that seems to be specific to my dataset. I am fairly new, and am trying to follow the advice of the first few lectures by working on my own problem as I go through the course/book (which just arrived!). I get the error message in the title when attempting to train a simple foreground background classifier using a Resnet. Searching for that error on the forum returns no results.

Output of show_install

=== Software === 
python        : 3.10.4
fastai        : 2.6.3
fastcore      : 1.4.4
fastprogress  : 1.0.2
torch         : 1.11.0
nvidia driver : 496.76
torch cuda    : 11.3 / is available
torch cudnn   : 8200 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : NVIDIA GeForce RTX 3090

=== Environment === 
platform      : Windows-10-10.0.19044-SP0
conda env     : fastai2
python        : D:\Anaconda\envs\fastai2\python.exe
sys.path      : D:\pytorch
D:\Anaconda\envs\fastai2\python310.zip
D:\Anaconda\envs\fastai2\DLLs
D:\Anaconda\envs\fastai2\lib
D:\Anaconda\envs\fastai2

D:\Anaconda\envs\fastai2\lib\site-packages
D:\Anaconda\envs\fastai2\lib\site-packages\win32
D:\Anaconda\envs\fastai2\lib\site-packages\win32\lib
D:\Anaconda\envs\fastai2\lib\site-packages\Pythonwin

My immediate goal is hopefully fairly straightforward - I am trying to adapt the cell in 01_intro that performs semantic segmentation of an image to work on my own images. My own setup is somewhat simpler, I have 16bit single channel images in .tif format (512x512) that contain grayscale fluorescent
data of zebrafish, and 8bit mask files with either 0 for background or 1 for foreground (in the images or labels folders). My total code is only:

import fastbook
fastbook.setup_book()
from fastbook import *
path =Path("D:/pytorch/data/2D_Zebrafish")
codes = ['Background', 'Zebrafish']

dls = SegmentationDataLoaders.from_label_func(
    path, bs=2, fnames = get_image_files(path/"images"),
    label_func = lambda o: path/'labels'/f'{o.stem}_annotationLabels.tif',
    codes =codes, num_workers=0
)

learn = unet_learner(dls, resnet18)
learn.fine_tune(8)

I can provide sample images if that will help. My goal is to train a simple foreground/background classifier, but I get the error listed in the title whenever I try to run the code the first time.

Does anyone have any suggestions? I can provide a sample image/label if that helps.

Full error message

D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:1142: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  ret = func(*args, **kwargs)

 0.00% [0/1 00:00<00:00]
epoch	train_loss	valid_loss	time

 0.00% [0/77 00:00<00:00]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [7], in <cell line: 8>()
      1 dls = SegmentationDataLoaders.from_label_func(
      2     path, bs=2, fnames = get_image_files(path/"images"),
      3     label_func = lambda o: path/'labels'/f'{o.stem}_annotationLabels.tif',
      4     codes =codes, num_workers=0
      5 )
      7 learn = unet_learner(dls, resnet18)
----> 8 learn.fine_tune(1)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\callback\schedule.py:161, in fine_tune(self, epochs, base_lr, freeze_epochs, lr_mult, pct_start, div, **kwargs)
    159 "Fine tune with `Learner.freeze` for `freeze_epochs`, then with `Learner.unfreeze` for `epochs`, using discriminative LR."
    160 self.freeze()
--> 161 self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
    162 base_lr /= 2
    163 self.unfreeze()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\callback\schedule.py:116, in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    113 lr_max = np.array([h['lr'] for h in self.opt.hypers])
    114 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    115           'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 116 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:222, in Learner.fit(self, n_epoch, lr, wd, cbs, reset_opt)
    220 self.opt.set_hypers(lr=self.lr if lr is None else lr)
    221 self.n_epoch = n_epoch
--> 222 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
    163 def _with_events(self, f, event_type, ex, final=noop):
--> 164     try: self(f'before_{event_type}');  f()
    165     except ex: self(f'after_cancel_{event_type}')
    166     self(f'after_{event_type}');  final()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:213, in Learner._do_fit(self)
    211 for epoch in range(self.n_epoch):
    212     self.epoch=epoch
--> 213     self._with_events(self._do_epoch, 'epoch', CancelEpochException)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
    163 def _with_events(self, f, event_type, ex, final=noop):
--> 164     try: self(f'before_{event_type}');  f()
    165     except ex: self(f'after_cancel_{event_type}')
    166     self(f'after_{event_type}');  final()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:207, in Learner._do_epoch(self)
    206 def _do_epoch(self):
--> 207     self._do_epoch_train()
    208     self._do_epoch_validate()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:199, in Learner._do_epoch_train(self)
    197 def _do_epoch_train(self):
    198     self.dl = self.dls.train
--> 199     self._with_events(self.all_batches, 'train', CancelTrainException)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
    163 def _with_events(self, f, event_type, ex, final=noop):
--> 164     try: self(f'before_{event_type}');  f()
    165     except ex: self(f'after_cancel_{event_type}')
    166     self(f'after_{event_type}');  final()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:170, in Learner.all_batches(self)
    168 def all_batches(self):
    169     self.n_iter = len(self.dl)
--> 170     for o in enumerate(self.dl): self.one_batch(*o)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:195, in Learner.one_batch(self, i, b)
    193 b = self._set_device(b)
    194 self._split(b)
--> 195 self._with_events(self._do_one_batch, 'batch', CancelBatchException)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
    163 def _with_events(self, f, event_type, ex, final=noop):
--> 164     try: self(f'before_{event_type}');  f()
    165     except ex: self(f'after_cancel_{event_type}')
    166     self(f'after_{event_type}');  final()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:181, in Learner._do_one_batch(self)
    179 if not self.training or not len(self.yb): return
    180 self('before_backward')
--> 181 self.loss_grad.backward()
    182 self._with_events(self.opt.step, 'step', CancelStepException)
    183 self.opt.zero_grad()

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:355, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    308 r"""Computes the gradient of current tensor w.r.t. graph leaves.
    309 
    310 The graph is differentiated using the chain rule. If the tensor is
   (...)
    352         used to compute the attr::tensors.
    353 """
    354 if has_torch_function_unary(self):
--> 355     return handle_torch_function(
    356         Tensor.backward,
    357         (self,),
    358         self,
    359         gradient=gradient,
    360         retain_graph=retain_graph,
    361         create_graph=create_graph,
    362         inputs=inputs)
    363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\overrides.py:1394, in handle_torch_function(public_api, relevant_args, *args, **kwargs)
   1388     warnings.warn("Defining your `__torch_function__ as a plain method is deprecated and "
   1389                   "will be an error in PyTorch 1.11, please define it as a classmethod.",
   1390                   DeprecationWarning)
   1392 # Use `public_api` instead of `implementation` so __torch_function__
   1393 # implementations can do equality/identity comparisons.
-> 1394 result = torch_func_method(public_api, types, args, kwargs)
   1396 if result is not NotImplemented:
   1397     return result

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\torch_core.py:341, in TensorBase.__torch_function__(self, func, types, args, kwargs)
    339 convert=False
    340 if _torch_handled(args, self._opt, func): convert,types = type(self),(torch.Tensor,)
--> 341 res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
    342 if convert: res = convert(res)
    343 if isinstance(res, TensorBase): res.set_meta(self, as_copy=True)

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:1142, in Tensor.__torch_function__(cls, func, types, args, kwargs)
   1139     return NotImplemented
   1141 with _C.DisableTorchFunction():
-> 1142     ret = func(*args, **kwargs)
   1143     if func in get_default_nowrap_functions():
   1144         return ret

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:363, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    354 if has_torch_function_unary(self):
    355     return handle_torch_function(
    356         Tensor.backward,
    357         (self,),
   (...)
    361         create_graph=create_graph,
    362         inputs=inputs)
--> 363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\autograd\__init__.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    168     retain_graph = create_graph
    170 # The reason we repeat same the comment below is that
    171 # some Python versions print out the first line of a multi-line function
    172 # calls in the traceback and some print out the last line
--> 173 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    174     tensors, grad_tensors_, retain_graph, create_graph, inputs,
    175     allow_unreachable=True, accumulate_grad=True)

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

In my Python window I also get the following at the same time as the first error.

which suggests that maybe my labeled image is setup incorrectly, but I have checked that the background values are 0 and the label values are 1. Other posts I have seen asking about assertions related to the number of classes have been from using binary masks of 0 and 255, or nonadjacent label values. Neither of those is the case here.

If I attempt to run the code again, I get

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

And need to restart the kernel in order to change settings and try again.

I have looked around and found

RuntimeError: no valid convolution algorithms available in CuDNN · Issue #3031 · fastai/fastai · GitHub although this seems to be for an earlier version and I have installed three environments so far, and the issue persists.
https://towardsdatascience.com/cuda-error-device-side-assert-triggered-c6ae1c8fa4c3 suggests that it may be the input of my loss function, although this is only in relation to the error that shows up attempting to run the code a second time.
There are several Pytorch posts on stackexchange that have to do with insufficient VRAM or old video cards, but I did not see anything I could relate to my current setup other than trying to reduce the batch size, which I did (from 1 to 8) without any effect on the outcome.

The error seems to be in learn_finetune() and I have tried several values there without success.

Appreciate any help - I post a bit on the image.sc forum, but I am totally lost here!

Cheers,
Mike

Mike_Nelson · June 8, 2022, 2:56am

In case anyone else runs across this, it seems the output message is a little bit confusing. A colleague pointed out that the real problem is the unet_learner needs to be changed quite a bit, as it does not adapt automatically to different inputs. It also does not generate a warning message stating that the unet settings are not going to work!

unet_learner that worked.

learn = unet_learner(dls, resnet18, n_in=1, n_out=1, loss_func=BCEWithLogitsLossFlat())