Segmentation in V1

tcapelle · October 13, 2018, 3:22pm

I am trying to use V1 to solve a segmentation problem, I started with the carvana notebook (https://github.com/fastai/fastai_docs/blob/master/dev_nb/006_carvana.ipynb) and I am having some problems with ImageMask class, more precisely, open_mask() is returning a 4 channel image.

im = open_mask(PATH128/masks_names_png[0])
im.shape
>>torch.Size([4, 128, 128])

Is this normal?

sgugger · October 13, 2018, 3:53pm

Not sure. It is an ImageMask object, and you should look at what the final data attribute is, since it’s what will get passed to the model.

tcapelle · October 13, 2018, 3:59pm

It’s weird: open_mask is returning the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/IPython/core/formatters.py in __call__(self, obj)
    336             method = get_real_method(obj, self.print_method)
    337             if method is not None:
--> 338                 return method()
    339             return None
    340         else:

/usr/local/lib/python3.6/dist-packages/fastai/vision/image.py in _repr_png_(self)
    102 
    103     def __repr__(self): return f'{self.__class__.__name__} {tuple(self.shape)}'
--> 104     def _repr_png_(self): return self._repr_image_format('png')
    105     def _repr_jpeg_(self): return self._repr_image_format('jpeg')
    106 

/usr/local/lib/python3.6/dist-packages/fastai/vision/image.py in _repr_image_format(self, format_str)
    107     def _repr_image_format(self, format_str):
    108         with BytesIO() as str_buffer:
--> 109             plt.imsave(str_buffer, image2np(self.px), format=format_str)
    110             return str_buffer.getvalue()
    111 

/usr/local/lib/python3.6/dist-packages/matplotlib/pyplot.py in imsave(*args, **kwargs)
   2257 @docstring.copy_dedent(_imsave)
   2258 def imsave(*args, **kwargs):
-> 2259     return _imsave(*args, **kwargs)
   2260 
   2261 

/usr/local/lib/python3.6/dist-packages/matplotlib/image.py in imsave(fname, arr, vmin, vmax, cmap, format, origin, dpi)
   1369         image.set_data(arr)
   1370         image.set_clim(vmin, vmax)
-> 1371         image.write_png(fname)
   1372     else:
   1373         fig = Figure(dpi=dpi, frameon=False)

/usr/local/lib/python3.6/dist-packages/matplotlib/image.py in write_png(self, fname)
    597         """Write the image to png file with fname"""
    598         im = self.to_rgba(self._A[::-1] if self.origin == 'lower' else self._A,
--> 599                           bytes=True, norm=True)
    600         _png.write_png(im, fname)
    601 

/usr/local/lib/python3.6/dist-packages/matplotlib/cm.py in to_rgba(self, x, alpha, bytes, norm)
    255                 if xx.dtype.kind == 'f':
    256                     if norm and xx.max() > 1 or xx.min() < 0:
--> 257                         raise ValueError("Floating point image RGB values "
    258                                          "must be in the 0..1 range.")
    259                     if bytes:

ValueError: Floating point image RGB values must be in the 0..1 range.

It’s weird, if I do:

im = PIL.Image.open(PATH128/masks_names_png[0]).convert('L')
im_t = pil2tensor(im)
im_t.shape
>>torch.Size([1, 128, 128])
im_m = ImageMask(im_t.float())
im_m.shape
>>torch.Size([1, 128, 128])

sgugger · October 13, 2018, 4:04pm

Are you sure you’re using the latest version? I know open_mask has been fixed a few days ago.

tcapelle · October 13, 2018, 4:05pm

fastai version : 1.0.5?
I am in google colab, le me restart this thing…
pip install is pulling version 1.0.5, I changed to pip install git+https:/.....and it pulled version 1.0.6

tcapelle · October 13, 2018, 4:18pm

1.0.6 solved the issue…
I had 1.0.6 in my computer, and I was checking the code against this version (git clone in my computer, pip install in the server).
My mistake.

sgugger · October 13, 2018, 5:02pm

Thanks for letting me know.

tcapelle · October 13, 2018, 5:23pm

But imposible to train the model, it kills the gpu…

body = create_body(tvm.resnet34(True), 2)
model = DynamicUnet(body, n_classes=2).cuda()

learn = Learner(data, model, metrics=metrics,
                loss_fn=CrossEntropyFlat())

learn.split([model[0][6], model[1]])
learn.freeze()

Unable to pass this block…
I find really hard to understand this line: learn.split([model[0][6], model[1]])

sgugger · October 13, 2018, 5:51pm

I think the 2 should be -2. You’re cutting your model at the second layer instead of the second to last, which then doesn’t work with the split.
This split will be made easier later, it’s basically the layers where to separate the model to have three groups for differential learning rates.

MicPie · October 13, 2018, 6:13pm

I am not sure if I should start an extra thread for this, but I currently try to debug this strange behavior when I iterate through a DataBunch from ObjectDetectDatasets based on png images with bounding boxes (so kind of segmentation):

Test code for a dummy dataset of 100 entries each for train and valid:

# Create ObjectDetectDatasets
train_ds = get_datasets(PATH_train)
valid_ds = get_datasets(PATH_valid)
size = 128
bs = 4 # bs=1 is working!

# Create DataBunch
def get_data(bs, size):
    return DataBunch.create(train_ds, valid_ds, bs=bs, size=size,ds_tfms=None, path=PATH)

data = get_data(bs, size)

# Test DataBunch DataLoader
for i in range(100):
    print(i, end=', ')
    next(iter(data.train_dl.dl))

Output:
bs = 1

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,

For bs > 1it stops the loop after a unreproducible step (I guess due to random shuffling) with this error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-132-d0590afb83a6> in <module>()
      1 for i in range(100):
      2     print(i, end=', ')
----> 3     next(iter(data.train_dl.dl))

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    351                 self.reorder_dict[idx] = batch
    352                 continue
--> 353             return self._process_next_batch(batch)
    354 
    355     next = __next__  # Python 2 compatibility

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
    372         self._put_indices()
    373         if isinstance(batch, ExceptionWrapper):
--> 374             raise batch.exc_type(batch.exc_msg)
    375         return batch
    376 

RuntimeError: Traceback (most recent call last):
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 114, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/paperspace/fastai/fastai/torch_core.py", line 86, in data_collate
    return torch.utils.data.dataloader.default_collate(to_data(batch))
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 198, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 198, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 198, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 198, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 175, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 2 and 3 at /opt/conda/conda-bld/pytorch-nightly_1538165619353/work/aten/src/TH/generic/THTensorMoreMath.cpp:1308

or got 3 and 2.

The error also occurs with the show_image_batch() method.

I found this thread on PyTorch forum which points into direction of png files with different channel numbers: https://discuss.pytorch.org/t/runtimeerror-invalid-argument-0/17919/5
However, in the fastai library the open_image() function uses .convert('RGB') and when I debug the tensor shapes I always find the same shape for each element with 3 channels x width x height.

What I don’t get is why is it working with bs = 1?

Maybe somebody has a tip?
Maybe I am using parts of the library which are currently under development?

Thank you & best regards
Michael

PS: When I try to visualize the images with show_image_batch() and bs = 1 I get this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-68-7d54941a820f> in <module>
      1 # http://docs.fast.ai/vision.data.html
----> 2 show_image_batch(data.train_dl, data.train_ds.classes, rows=3, figsize=(5,5))

~/fastai/fastai/vision/data.py in show_image_batch(dl, classes, rows, figsize, denorm)
     44     x = x[:rows*rows].cpu()
     45     if denorm: x = denorm(x)
---> 46     show_images(x,y[:rows*rows].cpu(),rows, classes, figsize)
     47 
     48 def show_images(x:Collection[Image],y:int,rows:int, classes:Collection[str], figsize:Tuple[int,int]=(9,9))->None:

AttributeError: 'list' object has no attribute 'cpu'

When I define a custom show_image_batch() function without the .cpu() in the jupyter notebook I get this error:
NameError: name 'show_image' is not defined

MicPie · October 13, 2018, 6:37pm

@Tcapelle If you call the model with tvm.resnet34() and your body, you can see where the create_body function cuts.

For example with 2 instead of -2 you only see the input stage with the 7x7 kernel and a subsequent batchnorm layer:

Sequential(
  (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

tcapelle · October 13, 2018, 8:06pm

Thanks, I have already figured it out, I changed this to -2.
Anyway,

lr_find(learn)
>> ---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-98-dcd2a06c9caf> in <module>()
----> 1 lr_find(learn)
      2 # learn.recorder.plot()

/usr/local/lib/python3.6/dist-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, **kwargs)
     24     cb = LRFinder(learn, start_lr, end_lr, num_it)
     25     a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 26     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
     27 
     28 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    135         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    136         fit(epochs, self.model, self.loss_fn, opt=self.opt, data=self.data, metrics=self.metrics,
--> 137             callbacks=self.callbacks+callbacks)
    138 
    139     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
     88     except Exception as e:
     89         exception = e
---> 90         raise e
     91     finally: cb_handler.on_train_end(exception)
     92 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
     78             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     79                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 80                 loss = loss_batch(model, xb, yb, loss_fn, opt, cb_handler)
     81                 if cb_handler.on_batch_end(loss): break
     82 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_fn, opt, cb_handler, metrics)
     16     if not is_listy(xb): xb = [xb]
     17     if not is_listy(yb): yb = [yb]
---> 18     out = model(*xb)
     19     out = cb_handler.on_loss_begin(out)
     20 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    475             result = self._slow_forward(*input, **kwargs)
    476         else:
--> 477             result = self.forward(*input, **kwargs)
    478         for hook in self._forward_hooks.values():
    479             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    475             result = self._slow_forward(*input, **kwargs)
    476         else:
--> 477             result = self.forward(*input, **kwargs)
    478         for hook in self._forward_hooks.values():
    479             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/fastai/vision/models/unet.py in forward(self, up_in)
     28         up_out = self.upconv(up_in)
     29         cat_x = torch.cat([up_out, self.hook.stored], dim=1)
---> 30         x = F.relu(self.conv1(cat_x))
     31         x = F.relu(self.conv2(x))
     32         return self.bn(x)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    475             result = self._slow_forward(*input, **kwargs)
    476         else:
--> 477             result = self.forward(*input, **kwargs)
    478         for hook in self._forward_hooks.values():
    479             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py in forward(self, input)
    311     def forward(self, input):
    312         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 313                         self.padding, self.dilation, self.groups)
    314 
    315 

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in handler(signum, frame)
    271         # This following call uses `waitid` with WNOHANG from C side. Therefore,
    272         # Python can still get and update the process status successfully.
--> 273         _error_if_any_worker_fails()
    274         if previous_handler is not None:
    275             previous_handler(signum, frame)

RuntimeError: DataLoader worker (pid 2631) is killed by signal: Bus error.

This is the call I would like to understand: learn.split([model[0][6], model[1]])

I was pretty good using fastai v0.7 and I am having a hard time with this…

sgugger · October 13, 2018, 8:24pm

You didn’t specify any transform so I’m guessing you don’t have images of the same size. The error message indicates pytorch isn’t able to group them in a batch.

sgugger · October 13, 2018, 8:27pm

This error message has nothing to do with the model being split. A quick search led me there, I don’t know if this is applicable to you or not.

tcapelle · October 13, 2018, 8:55pm

learn.model.state_dict() this is not working.

tcapelle · October 13, 2018, 9:00pm

I solved this reducing the bs. I never had this problem in v0.7 with the exactly same dataset and params.

jeremy · October 13, 2018, 11:15pm

We need more info to help you. Stack trace and exact code and error message at least.

MicPie · October 14, 2018, 10:07am

Thank you for your fast reply!
You guys are great!

I now added a ds_tfms and tfms (see code below).
I have to specifiy the ds_tfms as a list because, otherwise I get an error that it cannot be indexed.

Calling data.train_ds.tfms, data.valid_ds.tfms, data.train_dl.tfms, and data.valid_dl.tfms returns the information on the transformation and seems to be looking ok.

However, I get this error with show_image_batch():

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-7d54941a820f> in <module>
----> 1 show_image_batch(data.train_dl, data.train_ds.classes, rows=3, figsize=(5,5))

~/fastai/fastai/vision/data.py in show_image_batch(dl, classes, rows, figsize, denorm)
     40                      denorm:Callable=None) -> None:
     41     "Show a few images from a batch."
---> 42     x,y = next(iter(dl))
     43     if rows is None: rows = int(math.sqrt(len(x)))
     44     x = x[:rows*rows].cpu()

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    602                 self.reorder_dict[idx] = batch
    603                 continue
--> 604             return self._process_next_batch(batch)
    605 
    606     next = __next__  # Python 2 compatibility

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
    623         self._put_indices()
    624         if isinstance(batch, ExceptionWrapper):
--> 625             raise batch.exc_type(batch.exc_msg)
    626         return batch
    627 

AttributeError: Traceback (most recent call last):
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 137, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 137, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/paperspace/fastai/fastai/vision/data.py", line 190, in __getitem__
    x = apply_tfms(self.tfms, x, **self.kwargs)
  File "/home/paperspace/fastai/fastai/vision/image.py", line 422, in apply_tfms
    tfms = sorted(listify(tfms), key=lambda o: o.tfm.order)
  File "/home/paperspace/fastai/fastai/vision/image.py", line 422, in <lambda>
    tfms = sorted(listify(tfms), key=lambda o: o.tfm.order)
AttributeError: 'functools.partial' object has no attribute 'tfm'

This is my code:

paths = [path1, path2]

stats = ([0.4914, 0.4914, 0.4914], [0.2492, 0.2492, 0.2492])
norm, denorm = normalize_funcs(*stats)

tfms = get_transforms()

def get_tfms_datasets(size, paths, tfms):
    datasets = get_datasets(paths)
    return transform_datasets(*datasets, test_ds=None, tfms=tfms, size=size)

def get_data(bs, size, paths):
    return DataBunch.create(*get_tfms_datasets(size, tfms=tfms, paths=paths), bs=bs, size=size, ds_tfms=[norm, norm], tfms=tfms)

data = get_data(bs, size, paths)

I also upgraded to the latest pytorch-nightly.
There must be still an issue with how I apply the tfms to the data…?

Best regards
Michael

sgugger · October 14, 2018, 12:46pm

Yes ds_tfms must be a list of two list of transforms (one for the training set, one for the validation set) as explained in the docs.
Then in your last DataBunch you’re mxing the arguments: tfms are the transforms that will be applied to the batches, so it should be [norm] and ds_tfms should be your tfms variable.

tcapelle · October 14, 2018, 2:57pm

I think the error comes from loading the weights of resnet34.

from fastai.vision.models.unet import *
body = create_body(tvm.resnet34(True), -2) #/root/.torch/models/
model = DynamicUnet(body, n_classes=2).cuda()

learn = Learner(data, model, metrics=metrics,
                loss_fn=CrossEntropyFlat())
learn.split([model[0][7], model[1]])
learn.freeze()
lr_find(learn)

>>RuntimeError                              Traceback (most recent call last)
<ipython-input-78-dcd2a06c9caf> in <module>()
----> 1 lr_find(learn)
      2 # learn.recorder.plot()

/usr/local/lib/python3.6/dist-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, **kwargs)
     24     cb = LRFinder(learn, start_lr, end_lr, num_it)
     25     a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 26     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
     27 
     28 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    136         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    137         fit(epochs, self.model, self.loss_fn, opt=self.opt, data=self.data, metrics=self.metrics,
--> 138             callbacks=self.callbacks+callbacks)
    139 
    140     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
     69     cb_handler = CallbackHandler(callbacks)
     70     pbar = master_bar(range(epochs))
---> 71     cb_handler.on_train_begin(epochs, pbar=pbar, metrics=metrics)
     72 
     73     exception=False

/usr/local/lib/python3.6/dist-packages/fastai/callback.py in on_train_begin(self, epochs, pbar, metrics)
    186         self.state_dict = _get_init_state()
    187         self.state_dict['n_epochs'],self.state_dict['pbar'],self.state_dict['metrics'] = epochs,pbar,metrics
--> 188         self('train_begin')
    189 
    190     def on_epoch_begin(self)->None:

/usr/local/lib/python3.6/dist-packages/fastai/callback.py in __call__(self, cb_name, **kwargs)
    180     def __call__(self, cb_name, **kwargs)->None:
    181         "Call through to all of the `CallbakHandler` functions."
--> 182         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    183 
    184     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

/usr/local/lib/python3.6/dist-packages/fastai/callback.py in <listcomp>(.0)
    180     def __call__(self, cb_name, **kwargs)->None:
    181         "Call through to all of the `CallbakHandler` functions."
--> 182         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    183 
    184     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

/usr/local/lib/python3.6/dist-packages/fastai/callbacks/lr_finder.py in on_train_begin(self, **kwargs)
     22     def on_train_begin(self, **kwargs:Any)->None:
     23         "Initialize optimizer and learner hyperparameters."
---> 24         self.learn.save('tmp')
     25         self.opt = self.learn.opt
     26         self.opt.lr = self.sched.start

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in save(self, name)
    167     def save(self, name:PathOrStr):
    168         "Save model with `name` to `self.model_dir`."
--> 169         torch.save(self.model.state_dict(), self.path/self.model_dir/f'{name}.pth')
    170 
    171     def load(self, name:PathOrStr):

/usr/local/lib/python3.6/dist-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol)
    207         >>> torch.save(x, buffer)
    208     """
--> 209     return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
    210 
    211 

/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _with_file_like(f, mode, body)
    132         f = open(f, mode)
    133     try:
--> 134         return body(f)
    135     finally:
    136         if new_fd:

/usr/local/lib/python3.6/dist-packages/torch/serialization.py in <lambda>(f)
    207         >>> torch.save(x, buffer)
    208     """
--> 209     return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
    210 
    211 

/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol)
    286     f.flush()
    287     for key in serialized_storage_keys:
--> 288         serialized_storages[key]._write_file(f, _should_read_directly(f))
    289 
    290 

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/csrc/generic/serialization.cpp:15

I will add that when this eerror ocurr, I am force to restart the kernel, even a model that worked before (some lines before) it triggers:

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 240, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_352_228954310>

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-87-aec71e564917> in <module>()
----> 1 x,y = next(iter(md.train_dl))

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    596         while True:
    597             assert (not self.shutdown and self.batches_outstanding > 0)
--> 598             idx, batch = self._get_batch()
    599             self.batches_outstanding -= 1
    600             if idx != self.rcvd_idx:

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in _get_batch(self)
    575             # need to call `.task_done()` because we don't use `.join()`.
    576         else:
--> 577             return self.data_queue.get()
    578 
    579     def __next__(self):

/usr/lib/python3.6/multiprocessing/queues.py in get(self, block, timeout)
     92         if block and timeout is None:
     93             with self._rlock:
---> 94                 res = self._recv_bytes()
     95             self._sem.release()
     96         else:

/usr/lib/python3.6/multiprocessing/connection.py in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

/usr/lib/python3.6/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

/usr/lib/python3.6/multiprocessing/connection.py in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py in handler(signum, frame)
    271         # This following call uses `waitid` with WNOHANG from C side. Therefore,
    272         # Python can still get and update the process status successfully.
--> 273         _error_if_any_worker_fails()
    274         if previous_handler is not None:
    275             previous_handler(signum, frame)

RuntimeError: DataLoader worker (pid 351) is killed by signal: Bus error.