RuntimeError: CUDA error: device-side assert triggered when using unet_learner

andandandand · March 10, 2019, 10:26pm

Please have a look at the following notebook.

We’re getting a RuntimeError: CUDA error: device-side assert triggered when using unet_learner when invoking the unet_learner and can’t figure out why.

A Stackoverflow answer says that:

In general, when encountering cuda runtine error s, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace

Is there a way to obtain this from Google Colab.

Also:

the targets of your data were too high (or low) for the specified number of classes.

Which doesn’t seem applicable.

Edit: just found this answer and trying it.

Pak · March 12, 2019, 7:12pm

I’ve encountered errors like “device-side assert triggered” a couple times in recent days. And always solution was to calculate desired cell on cpu rather than gpu (.cpu() for ex.). That didn’t get error away but it maked it’s text more clear (it showed the accual error, that went wrong). In fact, as I understood, “device-side assert triggered” is just a general message that indicates that some error in gpu calculations occurred, but it doesn’t tell me which one.
Hope that will help you to move toward solution.

uwaisiqbal · March 30, 2019, 1:41pm

I ran into a similar error and it was because I had label values in my mask that were greater than the number of classes I supplied to the learner. Have a look and see if there is a mismatch between the class label values of your mask, and the number of classes you are supplying.

mr.ashutoshraj · May 25, 2019, 3:31pm

I ran into the same problem, I have a single class. My labeled data is a .jpg file having a green color as a mask for that object and the rest of the area is in white color. I tried putting codes as object and background label, then same error came. If put only the object name in code. It throws

~/.conda/envs/fastai/lib/python3.7/site-packages/fastai/vision/data.py in process(self, ds)
370 "PreProcessor that stores the classes for segmentation."
371 def init(self, ds:ItemList): self.classes = ds.classes
–> 372 def process(self, ds:ItemList): ds.classes,ds.c = self.classes,len(self.classes)
373
374 class SegmentationLabelList(ImageList):

TypeError: len() of unsized object

Suggest me something to rid of this error.

mr.ashutoshraj · May 25, 2019, 7:56pm

Got solved !!

SiddharthGadekar · June 4, 2019, 6:27pm

Hi,

Could anyone please take a look at my notebook and suggest on how to resolve this error?
It seems to be memory related.

gist.github.com

https://gist.github.com/GadekarSid/1b3275b32bb54d734c806ba211f9db35

IndoorSegmentation.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",

This file has been truncated. show original

mr.ashutoshraj · August 28, 2019, 8:59am

Mask size == Image Size
Mask Pixel Level == No of class

the_ccalderon · September 5, 2019, 7:48pm

Hi @SiddharthGadekar, were you able to solve this error? I can’t see nothing wrong with your notebook.

serain · October 12, 2019, 10:15pm

Could you give some more details? I’m in the same situation.

Were you trying to work with the Airbus Ship Detection dataset?

BelAir · October 31, 2019, 8:07pm

Why don’t you explain how you did it to help others?

mr.ashutoshraj · November 11, 2019, 1:46pm

Sorry for late reply. Please follow this repository for error free fastai dynamic u-net implementation for binary classification.

mr.ashutoshraj · November 11, 2019, 1:48pm

Please go through my notebook. It might help you.

mr.ashutoshraj · November 11, 2019, 1:52pm

Sorry for the late reply bro. Please go through my implementation.

And please make your mask properly.
No of pixel-level (0-255) == No of classes in RGB image

BelAir · November 12, 2019, 2:52am

Thank you.

keyurparalkar · March 14, 2020, 2:48pm

Thanks, I faced the same issue. This solved it.

KKulikovskis · April 3, 2020, 11:24am

Hi,

I’ve encountered the same error - “CUDA assert error”. Following other thread suggestions, I ran the same model on cpu and got possibly a more clarifying error

Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:97

This seems to suggest, that my class list is smaller than mask pixelation level. But my mask only has 2 pixels - 0 and 255.

and classes are provided classes = ["card","background"]

here is an example of my mask:

I can’t seem to find the issue, as I generated the masks myself from polyline data, and specifically made them only with 2 pixel values.

Just to double check, running

img_arr = cv2.imread(str(get_y_fn(img_f)))
np.unique(img_arr)

also returns only 2 values array([ 0, 255], dtype=uint8)

mr.ashutoshraj · April 3, 2020, 12:10pm

After writing this image can read image and check what is the values at the edges. Or you can simply put 0 and 1 for 0 and 255.

KKulikovskis · April 3, 2020, 12:46pm

Thanks for the answer.

I found the solution in stackoverflow

Since I saved the image to be black and white to visually see the mask, I needed to normalize it when adding to the data. Using the solution in that thread, made everything work for me

Amritpal · June 1, 2020, 2:59pm

I am new to Fastai and deep Learning. I am trying image segmentation on 2 class Data. But i am having error as soon as i start training model!

About data being used - "The annotation images for segmentation task are binary images in which pixels are either 1 for the foreground or 0 for the background. The annotation images named as “xxx.png”. Where “xxx” presents patient ID (from 001 to 3644). "

path_img = '/content/00000TNSCUI2020_train/TNSCUI2020_train/image'
path_lbl = '/content/00000TNSCUI2020_train/TNSCUI2020_train/mask'

fnames = get_image_files(path_img)
fnames[:3]
lbl_names = get_image_files(path_lbl)
lbl_names[:3]
img_f = fnames[0]
img = open_image(img_f)
img.show(figsize=(5,5))

codes = ['others ,thyroid']

from pathlib import Path
import os
def get_y_fn(filename):
   b = Path(path_lbl).joinpath(Path(filename).name)
   return b

mask = open_mask(get_y_fn(img_f))
mask.show(figsize=(5,5), alpha=1)

src_size = np.array(mask.shape[1:])
src_size,mask.data

size = src_size//2
bs = 4
src = (SegmentationItemList.from_folder(path_img)
       .split_by_fname_file('/content/valid.txt')
       .label_from_func(get_y_fn, classes=codes))

data = (src.transform(get_transforms(), size=size, tfm_y=True)
        .databunch(bs=bs)
        .normalize(imagenet_stats))

metrics=accuracy
wd=1e-2
learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd)

lr=3e-3
learn.fit_one_cycle(10, slice(lr), pct_start=0.9)

RuntimeError: CUDA error: device-side assert triggered

Entire error was-
RuntimeError Traceback (most recent call last)

<ipython-input-24-c2c63f494cee> in <module>()
     44 learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd)
     45 lr=3e-3
---> 46 learn.fit_one_cycle(10, slice(lr), pct_start=0.9)

4 frames

/usr/local/lib/python3.6/dist-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
     21     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
     22                                        final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
---> 23     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     24 
     25 def fit_fc(learn:Learner, tot_epochs:int=1, lr:float=defaults.lr,  moms:Tuple[float,float]=(0.95,0.85), start_pct:float=0.72,

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    198         else: self.opt.lr,self.opt.wd = lr,wd
    199         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
--> 200         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    201 
    202     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
     99             for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
    100                 xb, yb = cb_handler.on_batch_begin(xb, yb)
--> 101                 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
    102                 if cb_handler.on_batch_end(loss): break
    103 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     31 
     32     if opt is not None:
---> 33         loss,skip_bwd = cb_handler.on_backward_begin(loss)
     34         if not skip_bwd:                     loss.backward()
     35         if not cb_handler.on_backward_end(): opt.step()

/usr/local/lib/python3.6/dist-packages/fastai/callback.py in on_backward_begin(self, loss)
    288     def on_backward_begin(self, loss:Tensor)->Tuple[Any,Any]:
    289         "Handle gradient calculation on `loss`."
--> 290         self.smoothener.add_value(loss.float().detach().cpu())
    291         self.state_dict['last_loss'], self.state_dict['smooth_loss'] = loss, self.smoothener.smooth
    292         self('backward_begin', call_mets=False)

RuntimeError: CUDA error: device-side assert triggered

Please help. Thanks in advance.

cuongnc · August 26, 2020, 9:32am

Hi Amritpal,

I found you mentioned about TNSCUI2020 dataset. Is it the dataset for thyroid cancer right?
Could you share with me this dataset, because I cannot download it.

Thank you very much