Hi all!
I have been struggling for about three days now, first to get fastbook working via Anaconda on Windows10, which I believe I have done (Much thanks to threads like Installation of fastai v2 on Windows). Now I have a problem that seems to be specific to my dataset. I am fairly new, and am trying to follow the advice of the first few lectures by working on my own problem as I go through the course/book (which just arrived!). I get the error message in the title when attempting to train a simple foreground background classifier using a Resnet. Searching for that error on the forum returns no results.
Output of show_install
=== Software ===
python : 3.10.4
fastai : 2.6.3
fastcore : 1.4.4
fastprogress : 1.0.2
torch : 1.11.0
nvidia driver : 496.76
torch cuda : 11.3 / is available
torch cudnn : 8200 / is enabled
=== Hardware ===
nvidia gpus : 1
torch devices : 1
- gpu0 : NVIDIA GeForce RTX 3090
=== Environment ===
platform : Windows-10-10.0.19044-SP0
conda env : fastai2
python : D:\Anaconda\envs\fastai2\python.exe
sys.path : D:\pytorch
D:\Anaconda\envs\fastai2\python310.zip
D:\Anaconda\envs\fastai2\DLLs
D:\Anaconda\envs\fastai2\lib
D:\Anaconda\envs\fastai2
D:\Anaconda\envs\fastai2\lib\site-packages
D:\Anaconda\envs\fastai2\lib\site-packages\win32
D:\Anaconda\envs\fastai2\lib\site-packages\win32\lib
D:\Anaconda\envs\fastai2\lib\site-packages\Pythonwin
My immediate goal is hopefully fairly straightforward - I am trying to adapt the cell in 01_intro that performs semantic segmentation of an image to work on my own images. My own setup is somewhat simpler, I have 16bit single channel images in .tif format (512x512) that contain grayscale fluorescent
data of zebrafish, and 8bit mask files with either 0 for background or 1 for foreground (in the images or labels folders). My total code is only:
import fastbook
fastbook.setup_book()
from fastbook import *
path =Path("D:/pytorch/data/2D_Zebrafish")
codes = ['Background', 'Zebrafish']
dls = SegmentationDataLoaders.from_label_func(
path, bs=2, fnames = get_image_files(path/"images"),
label_func = lambda o: path/'labels'/f'{o.stem}_annotationLabels.tif',
codes =codes, num_workers=0
)
learn = unet_learner(dls, resnet18)
learn.fine_tune(8)
I can provide sample images if that will help. My goal is to train a simple foreground/background classifier, but I get the error listed in the title whenever I try to run the code the first time.
Does anyone have any suggestions? I can provide a sample image/label if that helps.
Full error message
D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:1142: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
ret = func(*args, **kwargs)
0.00% [0/1 00:00<00:00]
epoch train_loss valid_loss time
0.00% [0/77 00:00<00:00]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [7], in <cell line: 8>()
1 dls = SegmentationDataLoaders.from_label_func(
2 path, bs=2, fnames = get_image_files(path/"images"),
3 label_func = lambda o: path/'labels'/f'{o.stem}_annotationLabels.tif',
4 codes =codes, num_workers=0
5 )
7 learn = unet_learner(dls, resnet18)
----> 8 learn.fine_tune(1)
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\callback\schedule.py:161, in fine_tune(self, epochs, base_lr, freeze_epochs, lr_mult, pct_start, div, **kwargs)
159 "Fine tune with `Learner.freeze` for `freeze_epochs`, then with `Learner.unfreeze` for `epochs`, using discriminative LR."
160 self.freeze()
--> 161 self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
162 base_lr /= 2
163 self.unfreeze()
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\callback\schedule.py:116, in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
113 lr_max = np.array([h['lr'] for h in self.opt.hypers])
114 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
115 'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 116 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:222, in Learner.fit(self, n_epoch, lr, wd, cbs, reset_opt)
220 self.opt.set_hypers(lr=self.lr if lr is None else lr)
221 self.n_epoch = n_epoch
--> 222 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
163 def _with_events(self, f, event_type, ex, final=noop):
--> 164 try: self(f'before_{event_type}'); f()
165 except ex: self(f'after_cancel_{event_type}')
166 self(f'after_{event_type}'); final()
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:213, in Learner._do_fit(self)
211 for epoch in range(self.n_epoch):
212 self.epoch=epoch
--> 213 self._with_events(self._do_epoch, 'epoch', CancelEpochException)
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
163 def _with_events(self, f, event_type, ex, final=noop):
--> 164 try: self(f'before_{event_type}'); f()
165 except ex: self(f'after_cancel_{event_type}')
166 self(f'after_{event_type}'); final()
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:207, in Learner._do_epoch(self)
206 def _do_epoch(self):
--> 207 self._do_epoch_train()
208 self._do_epoch_validate()
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:199, in Learner._do_epoch_train(self)
197 def _do_epoch_train(self):
198 self.dl = self.dls.train
--> 199 self._with_events(self.all_batches, 'train', CancelTrainException)
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
163 def _with_events(self, f, event_type, ex, final=noop):
--> 164 try: self(f'before_{event_type}'); f()
165 except ex: self(f'after_cancel_{event_type}')
166 self(f'after_{event_type}'); final()
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:170, in Learner.all_batches(self)
168 def all_batches(self):
169 self.n_iter = len(self.dl)
--> 170 for o in enumerate(self.dl): self.one_batch(*o)
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:195, in Learner.one_batch(self, i, b)
193 b = self._set_device(b)
194 self._split(b)
--> 195 self._with_events(self._do_one_batch, 'batch', CancelBatchException)
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:164, in Learner._with_events(self, f, event_type, ex, final)
163 def _with_events(self, f, event_type, ex, final=noop):
--> 164 try: self(f'before_{event_type}'); f()
165 except ex: self(f'after_cancel_{event_type}')
166 self(f'after_{event_type}'); final()
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\learner.py:181, in Learner._do_one_batch(self)
179 if not self.training or not len(self.yb): return
180 self('before_backward')
--> 181 self.loss_grad.backward()
182 self._with_events(self.opt.step, 'step', CancelStepException)
183 self.opt.zero_grad()
File D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:355, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
308 r"""Computes the gradient of current tensor w.r.t. graph leaves.
309
310 The graph is differentiated using the chain rule. If the tensor is
(...)
352 used to compute the attr::tensors.
353 """
354 if has_torch_function_unary(self):
--> 355 return handle_torch_function(
356 Tensor.backward,
357 (self,),
358 self,
359 gradient=gradient,
360 retain_graph=retain_graph,
361 create_graph=create_graph,
362 inputs=inputs)
363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File D:\Anaconda\envs\fastai2\lib\site-packages\torch\overrides.py:1394, in handle_torch_function(public_api, relevant_args, *args, **kwargs)
1388 warnings.warn("Defining your `__torch_function__ as a plain method is deprecated and "
1389 "will be an error in PyTorch 1.11, please define it as a classmethod.",
1390 DeprecationWarning)
1392 # Use `public_api` instead of `implementation` so __torch_function__
1393 # implementations can do equality/identity comparisons.
-> 1394 result = torch_func_method(public_api, types, args, kwargs)
1396 if result is not NotImplemented:
1397 return result
File D:\Anaconda\envs\fastai2\lib\site-packages\fastai\torch_core.py:341, in TensorBase.__torch_function__(self, func, types, args, kwargs)
339 convert=False
340 if _torch_handled(args, self._opt, func): convert,types = type(self),(torch.Tensor,)
--> 341 res = super().__torch_function__(func, types, args=args, kwargs=kwargs)
342 if convert: res = convert(res)
343 if isinstance(res, TensorBase): res.set_meta(self, as_copy=True)
File D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:1142, in Tensor.__torch_function__(cls, func, types, args, kwargs)
1139 return NotImplemented
1141 with _C.DisableTorchFunction():
-> 1142 ret = func(*args, **kwargs)
1143 if func in get_default_nowrap_functions():
1144 return ret
File D:\Anaconda\envs\fastai2\lib\site-packages\torch\_tensor.py:363, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
354 if has_torch_function_unary(self):
355 return handle_torch_function(
356 Tensor.backward,
357 (self,),
(...)
361 create_graph=create_graph,
362 inputs=inputs)
--> 363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File D:\Anaconda\envs\fastai2\lib\site-packages\torch\autograd\__init__.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
168 retain_graph = create_graph
170 # The reason we repeat same the comment below is that
171 # some Python versions print out the first line of a multi-line function
172 # calls in the traceback and some print out the last line
--> 173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
175 allow_unreachable=True, accumulate_grad=True)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
In my Python window I also get the following at the same time as the first error.
which suggests that maybe my labeled image is setup incorrectly, but I have checked that the background values are 0 and the label values are 1. Other posts I have seen asking about assertions related to the number of classes have been from using binary masks of 0 and 255, or nonadjacent label values. Neither of those is the case here.
If I attempt to run the code again, I get
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
And need to restart the kernel in order to change settings and try again.
I have looked around and found
RuntimeError: no valid convolution algorithms available in CuDNN · Issue #3031 · fastai/fastai · GitHub although this seems to be for an earlier version and I have installed three environments so far, and the issue persists.
https://towardsdatascience.com/cuda-error-device-side-assert-triggered-c6ae1c8fa4c3 suggests that it may be the input of my loss function, although this is only in relation to the error that shows up attempting to run the code a second time.
There are several Pytorch posts on stackexchange that have to do with insufficient VRAM or old video cards, but I did not see anything I could relate to my current setup other than trying to reduce the batch size, which I did (from 1 to 8) without any effect on the outcome.
The error seems to be in learn_finetune() and I have tried several values there without success.
Appreciate any help - I post a bit on the image.sc forum, but I am totally lost here!
Cheers,
Mike