Using fastai for Segmentation, receiving a CUDA device-side assertion error

clarkeaa13 · November 13, 2018, 9:12pm

Using fastai v1.0.20:

How should a learner be created for segmentation? I am currently attempting segmentation for this challenge: https://www.kaggle.com/c/tgs-salt-identification-challenge

Right now I have separated the images and masks both into separate png files.
Following the examples in camvid I attempted to create a learner with:

data = (ImageFileList.from_folder(path_trn)                
    .label_from_func(get_y_fn)                         
    .random_split_by_pct()                             
    .datasets(SegmentationDataset, classes=codes)      
    .transform(get_transforms(), size=96, tfm_y=True)  
    .databunch(bs=64))
learn = Learner.create_unet(data, models.resnet34)

However, when doing a learn.fit, I get a cuda error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-24-bd510e07cfec> in <module>
      1 CUDA_LAUNCH_BLOCKING=1
----> 2 learn.fit(1)

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     80             cb_handler.on_epoch_begin()
     81 
---> 82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
     84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastprogress/fastprogress.py in __iter__(self)
     63         self.update(0)
     64         try:
---> 65             for i,o in enumerate(self._gen):
     66                 yield o
     67                 if self.auto_update: self.update(i+1)

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_data.py in __iter__(self)
     85             y = b[1][0] if is_listy(b[1]) else b[1]
     86             if not self.skip_size1 or y.size(0) != 1:
---> 87                 yield self.proc_batch(b)
     88 
     89     def one_batch(self)->Collection[Tensor]:

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_data.py in proc_batch(self, b)
     76     def proc_batch(self,b:Tensor)->Tensor:
     77         "Proces batch `b` of `TensorImage`."
---> 78         b = to_device(b, self.device)
     79         for f in listify(self.tfms): b = f(b)
     80         return b

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/torch_core.py in to_device(b, device)
     85     "Ensure `b` is on `device`."
     86     device = ifnone(device, defaults.device)
---> 87     if is_listy(b): return [to_device(o, device) for o in b]
     88     return b.to(device)
     89 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/torch_core.py in <listcomp>(.0)
     85     "Ensure `b` is on `device`."
     86     device = ifnone(device, defaults.device)
---> 87     if is_listy(b): return [to_device(o, device) for o in b]
     88     return b.to(device)
     89 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/torch_core.py in to_device(b, device)
     86     device = ifnone(device, defaults.device)
     87     if is_listy(b): return [to_device(o, device) for o in b]
---> 88     return b.to(device)
     89 
     90 def data_collate(batch:ItemsList)->Tensor:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch-nightly_1541497631545/work/aten/src/THC/generic/THCTensorCopy.cpp:20

The data seems fine. I was able to view the data properly with data.show_batch(). I was also able to create the learner fine as well, but for some reason the learner will not work when calling learn.fit(). I also am not sure how to create a proper metric for segmentation, although I would first like to fix the CUDA error before that.

uwaisiqbal · November 13, 2018, 10:58pm

Hey,

I’ve also been getting an error when defining a metric for segmentation. The inbuilt accuracy metric breaks for segmentation for some reason. But if I use the code from the camvid notebook and just remove the void mask the metric works. Something like this:

def seg_accuracy(input, target):
    target = target.squeeze(1)
    return (input.argmax(dim=1)==target).float().mean()

sgugger · November 14, 2018, 2:16pm

You error message seems like you had the cuda error before. Not that any time you get a CUDA error, you have to restart your kernel to see anything you did solved it (very annoying but that’s how it is )

Jamie · November 14, 2018, 3:45pm

Are your mask values 255 or 1? The TGS Challenge is a binary classification problem, so the values for non-background pixels should be 1. If the mask tensor contains 255, set div=True in open_mask().

clarkeaa13 · November 14, 2018, 5:35pm

Removing the void mask from the metric did not solve the issue. The mask tensor contains 0s and 1s.

I received this error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-bd510e07cfec> in <module>
      1 CUDA_LAUNCH_BLOCKING=1
----> 2 learn.fit(1)

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    160         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    161         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162             callbacks=self.callbacks+callbacks)
    163 
    164     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     23 
     24     if opt is not None:
---> 25         loss = cb_handler.on_backward_begin(loss)
     26         loss.backward()
     27         cb_handler.on_backward_end()

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/callback.py in on_backward_begin(self, loss)
    222         self.state_dict['last_loss'], self.state_dict['smooth_loss'] = loss, self.smoothener.smooth
    223         for cb in self.callbacks:
--> 224             a = cb.on_backward_begin(**self.state_dict)
    225             if a is not None: self.state_dict['last_loss'] = a
    226         return self.state_dict['last_loss']

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/fastai/basic_train.py in on_backward_begin(self, smooth_loss, **kwargs)
    264         self.losses.append(smooth_loss)
    265         if self.pbar is not None and hasattr(self.pbar,'child'):
--> 266             self.pbar.child.comment = f'{smooth_loss:.4f}'
    267 
    268     def on_epoch_end(self, epoch:int, num_batch:int, smooth_loss:Tensor,

~/anaconda3/envs/fastai12/lib/python3.7/site-packages/torch/tensor.py in __format__(self, format_spec)
    376     def __format__(self, format_spec):
    377         if self.dim() == 0:
--> 378             return self.item().__format__(format_spec)
    379         return object.__format__(self, format_spec)
    380 

RuntimeError: CUDA error: device-side assert triggered

clarkeaa13 · November 14, 2018, 6:34pm

I tried to do a learn.pred_batch() instead of learn.fit() to see if there is only issues with learn.fit. I get an output of all nan. Why would there be an output here but when using learn.fit() gives a CUDA error?

clarkeaa13 · November 14, 2018, 8:27pm

I also have this error appearing on terminal when the cuda error appears:

`/opt/conda/conda-bld/pytorch-nightly_1542102561606/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T* , T *, long* , T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [15,0,0], thread: [49,0,0] Assertion `t >= 0 && t < n_classes` failed.`

In addition, the fastai function SegmentationItemList.from_folder() is not recognized, even with the latest version of fast.ai 1.0.24 it still says that such a function does not exist. Instead, I have been using ImageFileList.

Rares · November 21, 2018, 11:52am

did you fix the error? i have the same problem with accuracy metric on binary segmentation

clarkeaa13 · November 25, 2018, 6:18pm

No, I was not able to fix the error.

Jamie · November 25, 2018, 9:51pm

The accuracy metric no longer works for image segmentation with the latest fastai version. You can try using other metrics like dice or iou.

def dice(pred, targs):
    pred = (pred>0).float()
    return 2. * (pred*targs).sum() / (pred+targs).sum()

def iou(input:Tensor, targs:Tensor) -> Rank0Tensor:
    "IoU coefficient metric for binary target."
    n = targs.shape[0]
    input = input.argmax(dim=1).view(n,-1)
    targs = targs.view(n,-1)
    intersect = (input*targs).sum().float()
    union = (input+targs).sum().float()
    return intersect / (union-intersect+1.0)

Rares · November 26, 2018, 1:33am

thanks! yes i ended up using iou and dice

pietro.latorre · February 1, 2019, 1:48am

Hi,
None of this solved the problem for me.
Any suggestions to debug this issue?

sgugger · February 1, 2019, 2:31pm

A CUDA device-side assertion is super generic and just means you have a bad index somewhere. It’s impossible to debug without seeing your code.
One thing that might help is to run the same thing on the CPU first, because you’ll get a clearer message of error.

tcapelle · February 1, 2019, 4:56pm

Could you post how we should instantiate a model from folders “images” and “masks” like the carvana example?
I am trying to do this, but it does not seems to work:

src = (SegmentationItemList.from_folder(path_img)
       .random_split_by_pct(0.2)
       .label_from_func(get_y_fn, classes=[0,1]))

pietro.latorre · February 1, 2019, 5:18pm

Hi @tcapelle ,
This is what I did:

Where my classes are : [‘background’, ‘person’]

If you like you can join the thread I opened on image segmentation: Image Segmentation on COCO dataset - summary, questions and suggestions

Maybe this can help both of us and also other people

tcapelle · February 1, 2019, 5:23pm

can you plot the output of src please?

pietro.latorre · February 1, 2019, 5:54pm

Yes, here it is:

As you can see on the post I linked, I may have some problems on data because the have different shapes (?)

tcapelle · February 1, 2019, 6:07pm

If I disable the GPU, to trace back the error, I get this:

RuntimeError                              Traceback (most recent call last)

<ipython-input-19-f2e08e2ffc17> in <module>()
----> 1 learn.lr_find(); learn.recorder.plot()

/usr/local/lib/python3.6/dist-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, **kwargs)
     30     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     31     a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 32     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
     33 
     34 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    176         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    177         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 178             callbacks=self.callbacks+callbacks)
    179 
    180     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.6/dist-packages/fastai/utils/mem.py in wrapper(*args, **kwargs)
     83 
     84         try:
---> 85             return func(*args, **kwargs)
     86         except Exception as e:
     87             if "CUDA out of memory" in str(e) or tb_clear_frames=="1":

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     98     except Exception as e:
     99         exception = e
--> 100         raise e
    101     finally: cb_handler.on_train_end(exception)
    102 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     88             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     89                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 90                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     91                 if cb_handler.on_batch_end(loss): break
     92 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     22 
     23     if not loss_func: return to_detach(out), yb[0].detach()
---> 24     loss = loss_func(out, *yb)
     25 
     26     if opt is not None:

/usr/local/lib/python3.6/dist-packages/fastai/layers.py in __call__(self, input, target, **kwargs)
    229         if self.floatify: target = target.float()
    230         input = input.view(-1,input.shape[-1]) if self.is_2d else input.view(-1)
--> 231         return self.func.__call__(input, target.view(-1), **kwargs)
    232 
    233 def CrossEntropyFlat(*args, axis:int=-1, **kwargs):

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py in forward(self, input, target)
    902     def forward(self, input, target):
    903         return F.cross_entropy(input, target, weight=self.weight,
--> 904                                ignore_index=self.ignore_index, reduction=self.reduction)
    905 
    906 

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   1968     if size_average is not None or reduce is not None:
   1969         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 1970     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   1971 
   1972 

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   1788                          .format(input.size(0), target.size(0)))
   1789     if dim == 2:
-> 1790         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
   1791     elif dim == 4:
   1792         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:93

So it is not choosing the right loss as it should be a Binary Cross entropy (predicting beweent 0 and 1)

pietro.latorre · February 1, 2019, 6:34pm

Try to reduce batch size

sgugger · February 1, 2019, 6:42pm

You can predict 0 or 1 with CrossEntropy (as long as it can’t be 0 and 1 at the same time, which is the case in segmentation). Your problem is the same as everyone else: your mask is encoded with 0 and 255s, not 0s and 1s, so you need to subclass SegmentationItemList and its open method to return open_mask(bla, div=True).