Unable to find a valid cuDNN algorithm to run convolution

Hi all!

I have been struggling for about three days now, first to get fastbook working via Anaconda on Windows10, which I believe I have done (Much thanks to threads like Installation of fastai v2 on Windows). Now I have a problem that seems to be specific to my dataset. I am fairly new, and am trying to follow the advice of the first few lectures by working on my own problem as I go through the course/book (which just arrived!). I get the error message in the title when attempting to train a simple foreground background classifier using a Resnet. Searching for that error on the forum returns no results.

Output of show_install
=== Software === 
python        : 3.10.4
fastai        : 2.6.3
fastcore      : 1.4.4
fastprogress  : 1.0.2
torch         : 1.11.0
nvidia driver : 496.76
torch cuda    : 11.3 / is available
torch cudnn   : 8200 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : NVIDIA GeForce RTX 3090

=== Environment === 
platform      : Windows-10-10.0.19044-SP0
conda env     : fastai2
python        : D:\Anaconda\envs\fastai2\python.exe
sys.path      : D:\pytorch


My immediate goal is hopefully fairly straightforward - I am trying to adapt the cell in 01_intro that performs semantic segmentation of an image to work on my own images. My own setup is somewhat simpler, I have 16bit single channel images in .tif format (512x512) that contain grayscale fluorescent
data of zebrafish, and 8bit mask files with either 0 for background or 1 for foreground (in the images or labels folders). My total code is only:

import fastbook
from fastbook import *
path =Path("D:/pytorch/data/2D_Zebrafish")
codes = ['Background', 'Zebrafish']

dls = SegmentationDataLoaders.from_label_func(
    path, bs=2, fnames = get_image_files(path/"images"),
    label_func = lambda o: path/'labels'/f'{o.stem}_annotationLabels.tif',
    codes =codes, num_workers=0

learn = unet_learner(dls, resnet18)

I can provide sample images if that will help. My goal is to train a simple foreground/background classifier, but I get the error listed in the title whenever I try to run the code the first time.

Does anyone have any suggestions? I can provide a sample image/label if that helps.

Full error message
D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:1142: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  ret = func(*args, **kwargs)

 0.00% [0/1 00:00<00:00]
epoch	train_loss	valid_loss	time

 0.00% [0/77 00:00<00:00]
RuntimeError                              Traceback (most recent call last)
Input In [7], in <cell line: 8>()
      1 dls = SegmentationDataLoaders.from_label_func(
      2     path, bs=2, fnames = get_image_files(path/"images"),
      3     label_func = lambda o: path/'labels'/f'{o.stem}_annotationLabels.tif',
      4     codes =codes, num_workers=0
      5 )
      7 learn = unet_learner(dls, resnet18)
----> 8 learn.fine_tune(1)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\callback\schedule.py:161, in fine_tune(self, epochs, base_lr, freeze_epochs, lr_mult, pct_start, div, **kwargs)
    159 "Fine tune with `Learner.freeze` for `freeze_epochs`, then with `Learner.unfreeze` for `epochs`, using discriminative LR."
    160 self.freeze()
--> 161 self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
    162 base_lr /= 2
    163 self.unfreeze()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\callback\schedule.py:116, in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    113 lr_max = np.array([h['lr'] for h in self.opt.hypers])
    114 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    115           'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 116 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:222, in Learner.fit(self, n_epoch, lr, wd, cbs, reset_opt)
    220 self.opt.set_hypers(lr=self.lr if lr is None else lr)
    221 self.n_epoch = n_epoch
--> 222 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
    163 def _with_events(self, f, event_type, ex, final=noop):
--> 164     try: self(f'before_{event_type}');  f()
    165     except ex: self(f'after_cancel_{event_type}')
    166     self(f'after_{event_type}');  final()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:213, in Learner._do_fit(self)
    211 for epoch in range(self.n_epoch):
    212     self.epoch=epoch
--> 213     self._with_events(self._do_epoch, 'epoch', CancelEpochException)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
    163 def _with_events(self, f, event_type, ex, final=noop):
--> 164     try: self(f'before_{event_type}');  f()
    165     except ex: self(f'after_cancel_{event_type}')
    166     self(f'after_{event_type}');  final()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:207, in Learner._do_epoch(self)
    206 def _do_epoch(self):
--> 207     self._do_epoch_train()
    208     self._do_epoch_validate()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:199, in Learner._do_epoch_train(self)
    197 def _do_epoch_train(self):
    198     self.dl = self.dls.train
--> 199     self._with_events(self.all_batches, 'train', CancelTrainException)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
    163 def _with_events(self, f, event_type, ex, final=noop):
--> 164     try: self(f'before_{event_type}');  f()
    165     except ex: self(f'after_cancel_{event_type}')
    166     self(f'after_{event_type}');  final()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:170, in Learner.all_batches(self)
    168 def all_batches(self):
    169     self.n_iter = len(self.dl)
--> 170     for o in enumerate(self.dl): self.one_batch(*o)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:195, in Learner.one_batch(self, i, b)
    193 b = self._set_device(b)
    194 self._split(b)
--> 195 self._with_events(self._do_one_batch, 'batch', CancelBatchException)

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
    163 def _with_events(self, f, event_type, ex, final=noop):
--> 164     try: self(f'before_{event_type}');  f()
    165     except ex: self(f'after_cancel_{event_type}')
    166     self(f'after_{event_type}');  final()

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:181, in Learner._do_one_batch(self)
    179 if not self.training or not len(self.yb): return
    180 self('before_backward')
--> 181 self.loss_grad.backward()
    182 self._with_events(self.opt.step, 'step', CancelStepException)
    183 self.opt.zero_grad()

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:355, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    308 r"""Computes the gradient of current tensor w.r.t. graph leaves.
    310 The graph is differentiated using the chain rule. If the tensor is
    352         used to compute the attr::tensors.
    353 """
    354 if has_torch_function_unary(self):
--> 355     return handle_torch_function(
    356         Tensor.backward,
    357         (self,),
    358         self,
    359         gradient=gradient,
    360         retain_graph=retain_graph,
    361         create_graph=create_graph,
    362         inputs=inputs)
    363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\overrides.py:1394, in handle_torch_function(public_api, relevant_args, *args, **kwargs)
   1388     warnings.warn("Defining your `__torch_function__ as a plain method is deprecated and "
   1389                   "will be an error in PyTorch 1.11, please define it as a classmethod.",
   1390                   DeprecationWarning)
   1392 # Use `public_api` instead of `implementation` so __torch_function__
   1393 # implementations can do equality/identity comparisons.
-> 1394 result = torch_func_method(public_api, types, args, kwargs)
   1396 if result is not NotImplemented:
   1397     return result

File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\torch_core.py:341, in TensorBase.__torch_function__(self, func, types, args, kwargs)
    339 convert=False
    340 if _torch_handled(args, self._opt, func): convert,types = type(self),(torch.Tensor,)
--> 341 res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
    342 if convert: res = convert(res)
    343 if isinstance(res, TensorBase): res.set_meta(self, as_copy=True)

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:1142, in Tensor.__torch_function__(cls, func, types, args, kwargs)
   1139     return NotImplemented
   1141 with _C.DisableTorchFunction():
-> 1142     ret = func(*args, **kwargs)
   1143     if func in get_default_nowrap_functions():
   1144         return ret

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:363, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    354 if has_torch_function_unary(self):
    355     return handle_torch_function(
    356         Tensor.backward,
    357         (self,),
    361         create_graph=create_graph,
    362         inputs=inputs)
--> 363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File D:\Anaconda\envs\fastai2\lib\site-packages\torch\autograd\__init__.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    168     retain_graph = create_graph
    170 # The reason we repeat same the comment below is that
    171 # some Python versions print out the first line of a multi-line function
    172 # calls in the traceback and some print out the last line
--> 173 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    174     tensors, grad_tensors_, retain_graph, create_graph, inputs,
    175     allow_unreachable=True, accumulate_grad=True)

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

In my Python window I also get the following at the same time as the first error.

which suggests that maybe my labeled image is setup incorrectly, but I have checked that the background values are 0 and the label values are 1. Other posts I have seen asking about assertions related to the number of classes have been from using binary masks of 0 and 255, or nonadjacent label values. Neither of those is the case here.

If I attempt to run the code again, I get

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

And need to restart the kernel in order to change settings and try again.

I have looked around and found

RuntimeError: no valid convolution algorithms available in CuDNN · Issue #3031 · fastai/fastai · GitHub although this seems to be for an earlier version and I have installed three environments so far, and the issue persists.
https://towardsdatascience.com/cuda-error-device-side-assert-triggered-c6ae1c8fa4c3 suggests that it may be the input of my loss function, although this is only in relation to the error that shows up attempting to run the code a second time.
There are several Pytorch posts on stackexchange that have to do with insufficient VRAM or old video cards, but I did not see anything I could relate to my current setup other than trying to reduce the batch size, which I did (from 1 to 8) without any effect on the outcome.

The error seems to be in learn_finetune() and I have tried several values there without success.

Appreciate any help - I post a bit on the image.sc forum, but I am totally lost here!


In case anyone else runs across this, it seems the output message is a little bit confusing. A colleague pointed out that the real problem is the unet_learner needs to be changed quite a bit, as it does not adapt automatically to different inputs. It also does not generate a warning message stating that the unet settings are not going to work!

unet_learner that worked.

learn = unet_learner(dls, resnet18, n_in=1, n_out=1, loss_func=BCEWithLogitsLossFlat())