Using fastai for Segmentation, receiving a CUDA device-side assertion error

Using fastai v1.0.20:

How should a learner be created for segmentation? I am currently attempting segmentation for this challenge: https://www.kaggle.com/c/tgs-salt-identification-challenge

Right now I have separated the images and masks both into separate png files.
Following the examples in camvid I attempted to create a learner with:

data = (ImageFileList.from_folder(path_trn)                
    .label_from_func(get_y_fn)                         
    .random_split_by_pct()                             
    .datasets(SegmentationDataset, classes=codes)      
    .transform(get_transforms(), size=96, tfm_y=True)  
    .databunch(bs=64))
learn = Learner.create_unet(data, models.resnet34)

However, when doing a learn.fit, I get a cuda error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-24-bd510e07cfec> in <module>
      1 CUDA_LAUNCH_BLOCKING=1
----> 2 learn.fit(1)

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     80             cb_handler.on_epoch_begin()
     81 
---> 82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
     84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastprogress/fastprogress.py in __iter__(self)
     63         self.update(0)
     64         try:
---> 65             for i,o in enumerate(self._gen):
     66                 yield o
     67                 if self.auto_update: self.update(i+1)

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_data.py in __iter__(self)
     85             y = b[1][0] if is_listy(b[1]) else b[1]
     86             if not self.skip_size1 or y.size(0) != 1:
---> 87                 yield self.proc_batch(b)
     88 
     89     def one_batch(self)->Collection[Tensor]:

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_data.py in proc_batch(self, b)
     76     def proc_batch(self,b:Tensor)->Tensor:
     77         "Proces batch `b` of `TensorImage`."
---> 78         b = to_device(b, self.device)
     79         for f in listify(self.tfms): b = f(b)
     80         return b

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/torch_core.py in to_device(b, device)
     85     "Ensure `b` is on `device`."
     86     device = ifnone(device, defaults.device)
---> 87     if is_listy(b): return [to_device(o, device) for o in b]
     88     return b.to(device)
     89 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/torch_core.py in <listcomp>(.0)
     85     "Ensure `b` is on `device`."
     86     device = ifnone(device, defaults.device)
---> 87     if is_listy(b): return [to_device(o, device) for o in b]
     88     return b.to(device)
     89 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/torch_core.py in to_device(b, device)
     86     device = ifnone(device, defaults.device)
     87     if is_listy(b): return [to_device(o, device) for o in b]
---> 88     return b.to(device)
     89 
     90 def data_collate(batch:ItemsList)->Tensor:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch-nightly_1541497631545/work/aten/src/THC/generic/THCTensorCopy.cpp:20

The data seems fine. I was able to view the data properly with data.show_batch(). I was also able to create the learner fine as well, but for some reason the learner will not work when calling learn.fit(). I also am not sure how to create a proper metric for segmentation, although I would first like to fix the CUDA error before that.

1 Like

Hey,

I’ve also been getting an error when defining a metric for segmentation. The inbuilt accuracy metric breaks for segmentation for some reason. But if I use the code from the camvid notebook and just remove the void mask the metric works. Something like this:

def seg_accuracy(input, target):
    target = target.squeeze(1)
    return (input.argmax(dim=1)==target).float().mean()
2 Likes

You error message seems like you had the cuda error before. Not that any time you get a CUDA error, you have to restart your kernel to see anything you did solved it (very annoying but that’s how it is :frowning: )

3 Likes

Are your mask values 255 or 1? The TGS Challenge is a binary classification problem, so the values for non-background pixels should be 1. If the mask tensor contains 255, set div=True in open_mask().

1 Like

Removing the void mask from the metric did not solve the issue. The mask tensor contains 0s and 1s.

I received this error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-bd510e07cfec> in <module>
      1 CUDA_LAUNCH_BLOCKING=1
----> 2 learn.fit(1)

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     23 
     24     if opt is not None:
---> 25         loss = cb_handler.on_backward_begin(loss)
     26         loss.backward()
     27         cb_handler.on_backward_end()

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/callback.py in on_backward_begin(self, loss)
    222         self.state_dict['last_loss'], self.state_dict['smooth_loss'] = loss, self.smoothener.smooth
    223         for cb in self.callbacks:
--> 224             a = cb.on_backward_begin(**self.state_dict)
    225             if a is not None: self.state_dict['last_loss'] = a
    226         return self.state_dict['last_loss']

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in on_backward_begin(self, smooth_loss, **kwargs)
    264         self.losses.append(smooth_loss)
    265         if self.pbar is not None and hasattr(self.pbar,'child'):
--> 266             self.pbar.child.comment = f'{smooth_loss:.4f}'
    267 
    268     def on_epoch_end(self, epoch:int, num_batch:int, smooth_loss:Tensor,

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/torch/tensor.py in __format__(self, format_spec)
    376     def __format__(self, format_spec):
    377         if self.dim() == 0:
--> 378             return self.item().__format__(format_spec)
    379         return object.__format__(self, format_spec)
    380 

RuntimeError: CUDA error: device-side assert triggered

I tried to do a learn.pred_batch() instead of learn.fit() to see if there is only issues with learn.fit. I get an output of all nan. Why would there be an output here but when using learn.fit() gives a CUDA error?

I also have this error appearing on terminal when the cuda error appears:

`/opt/conda/conda-bld/pytorch-nightly_1542102561606/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T* , T *, long* , T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [15,0,0], thread: [49,0,0] Assertion `t >= 0 && t < n_classes` failed.`

In addition, the fastai function SegmentationItemList.from_folder() is not recognized, even with the latest version of fast.ai 1.0.24 it still says that such a function does not exist. Instead, I have been using ImageFileList.

did you fix the error? i have the same problem with accuracy metric on binary segmentation

No, I was not able to fix the error.

The accuracy metric no longer works for image segmentation with the latest fastai version. You can try using other metrics like dice or iou.

def dice(pred, targs):
    pred = (pred>0).float()
    return 2. * (pred*targs).sum() / (pred+targs).sum()

def iou(input:Tensor, targs:Tensor) -> Rank0Tensor:
    "IoU coefficient metric for binary target."
    n = targs.shape[0]
    input = input.argmax(dim=1).view(n,-1)
    targs = targs.view(n,-1)
    intersect = (input*targs).sum().float()
    union = (input+targs).sum().float()
    return intersect / (union-intersect+1.0)
3 Likes

thanks! yes i ended up using iou and dice

Hi,
None of this solved the problem for me.
Any suggestions to debug this issue?

1 Like

A CUDA device-side assertion is super generic and just means you have a bad index somewhere. It’s impossible to debug without seeing your code.
One thing that might help is to run the same thing on the CPU first, because you’ll get a clearer message of error.

Could you post how we should instantiate a model from folders “images” and “masks” like the carvana example?
I am trying to do this, but it does not seems to work:

src = (SegmentationItemList.from_folder(path_img)
       .random_split_by_pct(0.2)
       .label_from_func(get_y_fn, classes=[0,1]))

Hi @tcapelle ,
This is what I did:


Where my classes are : [‘background’, ‘person’]

If you like you can join the thread I opened on image segmentation: Image Segmentation on COCO dataset - summary, questions and suggestions

Maybe this can help both of us and also other people :wink:

can you plot the output of src please?

Yes, here it is:

As you can see on the post I linked, I may have some problems on data because the have different shapes (?)

If I disable the GPU, to trace back the error, I get this:

RuntimeError                              Traceback (most recent call last)

<ipython-input-19-f2e08e2ffc17> in <module>()
----> 1 learn.lr_find(); learn.recorder.plot()

/usr/local/lib/python3.6/dist-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, **kwargs)
     30     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     31     a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 32     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
     33 
     34 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    176         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    177         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 178             callbacks=self.callbacks+callbacks)
    179 
    180     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.6/dist-packages/fastai/utils/mem.py in wrapper(*args, **kwargs)
     83 
     84         try:
---> 85             return func(*args, **kwargs)
     86         except Exception as e:
     87             if "CUDA out of memory" in str(e) or tb_clear_frames=="1":

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     98     except Exception as e:
     99         exception = e
--> 100         raise e
    101     finally: cb_handler.on_train_end(exception)
    102 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     88             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     89                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 90                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     91                 if cb_handler.on_batch_end(loss): break
     92 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     22 
     23     if not loss_func: return to_detach(out), yb[0].detach()
---> 24     loss = loss_func(out, *yb)
     25 
     26     if opt is not None:

/usr/local/lib/python3.6/dist-packages/fastai/layers.py in __call__(self, input, target, **kwargs)
    229         if self.floatify: target = target.float()
    230         input = input.view(-1,input.shape[-1]) if self.is_2d else input.view(-1)
--> 231         return self.func.__call__(input, target.view(-1), **kwargs)
    232 
    233 def CrossEntropyFlat(*args, axis:int=-1, **kwargs):

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py in forward(self, input, target)
    902     def forward(self, input, target):
    903         return F.cross_entropy(input, target, weight=self.weight,
--> 904                                ignore_index=self.ignore_index, reduction=self.reduction)
    905 
    906 

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   1968     if size_average is not None or reduce is not None:
   1969         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 1970     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   1971 
   1972 

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   1788                          .format(input.size(0), target.size(0)))
   1789     if dim == 2:
-> 1790         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
   1791     elif dim == 4:
   1792         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:93

So it is not choosing the right loss as it should be a Binary Cross entropy (predicting beweent 0 and 1)

Try to reduce batch size

You can predict 0 or 1 with CrossEntropy (as long as it can’t be 0 and 1 at the same time, which is the case in segmentation). Your problem is the same as everyone else: your mask is encoded with 0 and 255s, not 0s and 1s, so you need to subclass SegmentationItemList and its open method to return open_mask(bla, div=True).

2 Likes