Can't do Segmentation: CUDA error: device-side assert triggered

xnet · February 7, 2019, 9:09pm

Could anyone help with troubleshooting the " CUDA error: device-side assert triggered" error?

I’ve figured it’s due to some -1 value in my data, but I don’t know where. I suspect it’s the masks I’ve created, since the camvid datasets work well.

Basically, I manually create masks in numpy and save them as such:
img = Image.fromarray((mask * 255).astype('uint8'), mode='L')
img.save(savefile, bit=1)
where mask is the numpy array

Then, I use this PNG as my segmentation masks. There’s only 2 classes and so only 0 and 1s, and I manually pass in the codes as
codes = np.asarray(['void','seam'])

There’s also some output error on the command line, but I’m not sure how to troubleshoot this:

/opt/conda/conda-bld/pytorch_1549287501208/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes failed.

I should also add that the masks looks fine when I visualize them with data.show_batch()

imanol · February 20, 2019, 10:22am

I have the same issue, also using PNG files as masks

sunhwan · February 22, 2019, 5:15am

I think there should not be any gap between your masked integer values. Because you are multiplying 255, you will end up having values 0 and 255 (if your input mask array is 0 and 1 only). Check with np.unique(img.getdata()) and made sure it returns [0, 1].

renato · February 22, 2019, 5:28am

When defining the DataBunch, have you specified the classes?

imanol · February 22, 2019, 7:35am

I have found the solution. The problem is that my masks have the following values: [0, 255], and this is not supported by default by fast.ai. The problem is that the fast.ai function open_mask works with small mask pixel values like 0,1,2,3 by default. This function is called by the SegmentationLabelList class. To work with 255 values we should call the function open_mask with div=True: open_mask(fn, div=True). It divides the mask pixel values by 255.

In order to change the default behaviour of open_mask, I did the following:

class SegLabelListCustom(SegmentationLabelList):
    def open(self, fn): return open_mask(fn, div=True)

class SegItemListCustom(SegmentationItemList):
    _label_cls = SegLabelListCustom

codes = ['0','1']
src = (SegItemListCustom.from_folder(path_img)
       .random_split_by_pct(valid_pct=0.2, seed=33)
       .label_from_func(get_y_fn, classes=codes))

data = (src.transform(get_transforms(), size=size, tfm_y=True)
        .databunch(bs=bs)
        .normalize(imagenet_stats))

Hope this helps. Thank you for your replies!

alex_zhang · March 12, 2019, 5:58am

Hi, I used your codes, the error gone, but now by viewing data.show_batch, I found there is no mask. And if try" lr_find(learn);learn.recorder.plot()" after the learner created, the loss is 0, which confirm that the mask is gone…
How to fix that? Thanks

alex_zhang · March 12, 2019, 5:59am

I used “get_y_fn = lambda x: pathmask_train/f’{x.stem}_json.png’” before modification

hasif · March 13, 2019, 8:26am

Thanks @imanol, it works!

ingbiodanielh · March 23, 2019, 9:13pm

I have the same issue, Did you solve it?

alex_zhang · March 25, 2019, 2:45am

Hey, guys! I’ve just partly fixed the problem(fix the problem using COCO dataset). Now, I can use lr_find(learn) and learn.recorder.plot() without error. Also see this link:Image Segmentation on COCO dataset - summary, questions and suggestions.

I think there are 2 problems in this error: 1)The value of the mask array should be {0,1}, not {0,255}or some other stuff, that’s what many other people said. The link above find ways to generate {0,1} mask. But how to fix it by not transforming data into COCO form, I haven’t tried; 2)The class list. Mine is a bynary segmentation project(to segment “row” ), so CATEGORY_NAMES=[0, ‘row’]. If CATEGORY_NAMES=[‘row’], I found the loss would always be 0, when using lr_find(learn) and learn.recorder.plot().
I think there still are something more to figure out, but for now, at leat I can use unet and lr_find(learn).

alex_zhang · March 25, 2019, 2:49am

I transformed data to COCO, and added 0 to class list(CATEGORY_NAMES=[0, ‘row’], instead[‘row’], e.g.), and it works.
I also updated fastai to 1.0.48, then the data.show_batch error disappeared, though I don’t know why.

massaros · April 1, 2019, 7:03pm

It works! thank you

hotessy · April 15, 2019, 7:22pm

It works!
Should’t it be reported as a bug ?

ptrampert · April 15, 2019, 7:37pm

Have a look here for those that still have problems:

nithinnivi · June 5, 2019, 7:30am

Thanks so much, much needed one

eljas1 · June 16, 2019, 7:04am

class SegLabelListCustom(SegmentationLabelList):
    def open(self, fn): return open_mask(fn, div=True)

class SegItemListCustom(SegmentationItemList):
    _label_cls = SegLabelListCustom

codes = ['0','1']
src = (SegItemListCustom.from_folder(path_img)
       .random_split_by_pct(valid_pct=0.2, seed=33)
       .label_from_func(get_y_fn, classes=codes))

data = (src.transform(get_transforms(), size=size, tfm_y=True)
        .databunch(bs=bs)
        .normalize(imagenet_stats))

Is it normal for this code to take quite a while to run? I’ve had it running for well over an hour with just 5000 128x128 images in the dataset. Hoping it will work but I’m not sure what’s going on. I only changed the paths.

moeinh77 · July 24, 2019, 3:32pm

Did u manage to fix the show batch problem ?

alex_zhang · July 24, 2019, 9:47pm

Just update fastai, then show batch will be ok

moeinh77 · July 25, 2019, 5:51am

I updated the fastai to 1.0.55 but still no mask is shown

alex_zhang · July 25, 2019, 5:53am

Sorry, I didn’t figure out why it worked, but it worked when I was using kaggle kernel