RuntimeError: DataLoader worker is killed by signal

balnazzar · December 10, 2018, 10:02am

I made extensive tests on the dgx. The GC doesn’t make any real difference.

The problem tends to happen more frequently as you use:

More than one GPU
Bigger models
Bigger datasets

As it happens, there is still plenty of free RAM and VRAM.

Quite suprisingly, I also get that issue more frequently on the dgx rather than my home machine, with the same datasets/models.

It has become a real hindrance for my workflow, as of late.

hwasiti · December 10, 2018, 10:40am

It is those 3 points you mentioned that are enormously large in FB production systems. I am puzzled how does FB use Pytorch in production systems then?

balnazzar · December 10, 2018, 10:46am

Indeed. Furthermore, I’m updated to 1.0 stable.

I’m quite convinced we are doing something in the wrong way. Pytorch cannot be so buggy.

marcmuc · December 10, 2018, 10:48am

I have just posted a comment to the pytorch bug ticket above. I think I have figured out at least one example of what goes wrong. @balnazzar maybe you can have a look at how your data is stored/handled in the datasets/dataloaders and experiment with that. It could make a huge difference when forking out the workers. I have made a notebook gist to demonstrate this. Especially if you are dealing with text / tokens (as I seem to remember you wrote about) this could be a key issue. See here, example with 8 workers, 10 Million strings. Just changing the datatype is the difference in the memory explosion (factor 4):

Memory-Consumption in GB with fixed length string array:

Memory-Consumption in GB with object array (only change!)

It basically has got nothing to do with pytorch tensors etc. it is a problem of the other data stored in the dataloaders and workers, usually consisting of lists of pahts/filenames and dicts of labels (which all store strings/objects)

balnazzar · December 10, 2018, 11:01am

@marcmuc

Thanks for you feedback and suggestions.
It’s not just about text data. Yesterday I went just crazy working on images. Same issue.

I’ll try and experiment with your nb (thanks), but note what I reported above: there is still plenty of free memory as I experience the error, particularly when I work on non-text data.

marcmuc · December 10, 2018, 11:40am

Okay, sorry. But then why you get the “killed by signal” message is probably different from why @devforfu or I get that message, because that was definitely related to running out of memory. Have you used his memory usage callback yet to track the consumption while running the model? are you running multiple models or other processes that could have short “spikes/bursts” in mem consumption? because that would be enough, even if your training process itself is not the culprit?

balnazzar · December 10, 2018, 1:03pm

I just incurred in that error. I’m working with images right now. I just increased the size of pics from 299 to 352. Dataloader killed as soon as I ran fit_one_cycle.

Kernel restarted, tried to set 352 from the beginning. Nothing: it is killed as soon as I begin the training process.

I cannot make use of your notebook right now (I’m in the middle of my work right now, but I’ll login on the dgx during the night and make a test with your nb…), but I can say that over half the RAM is unused.

Also, I’m using a single gpu.

devforfu · December 10, 2018, 1:50pm

Yes, I am also getting these errors while working with image datasets. The main problem is that path objects definitely bring some overhead but they are not in the core of this issue. I didn’t try it yet but there is torchvision.datasets.ImageFolder class that doesn’t include any sophisticated dependencies:

github.com

pytorch/vision/blob/master/torchvision/datasets/folder.py

import torch.utils.data as data

from PIL import Image

import os
import os.path
import sys


def has_file_allowed_extension(filename, extensions):
    """Checks if a file is an allowed extension.

    Args:
        filename (string): path to a file
        extensions (iterable of strings): extensions to consider (lowercase)

    Returns:
        bool: True if the filename ends with one of given extensions
    """
    filename_lower = filename.lower()

This file has been truncated. show original

No numpy, pandas, or pathlib, as simple as it could be. So if this guy leaks, then we probably have only two possibilities:

bug in PyTorch
problems with built-in multiprocessing as mentioned in Kaggle’s discussion

balnazzar · December 10, 2018, 5:36pm

Hi Ilia, thanks for your feedback.

Quite surprisingly, the dataloader worker gets killed by bus signal even if I set num_cpus=0 (afaik, this superseded num_workers) in the ImageDataBunch.

What makes the difference is the size of the images. Indeed, everything works fine till I set a size above 306x306. I’m still trying to figure out why that happens.

marcmuc · December 10, 2018, 9:55pm

Just to be clear, I have also only ever gotten these errors on ‘image’ datasets, not text. But what I tried to show also with the notebook, is that it has nothing to do with the actual content of the data, it is the lists of filenames, stored as strings, and the dicts of labels that are enough to cause this, if they are large enough. I just meant that this problem would be even worse if you were using other large lists of objects such as tokens for language models etc. within the dataloaders… And I have not looked at that, but I would assume the pytorch ImageFolder method will also be storing filenames in some sort of list, as long as that receives no special treatment, the same problems would therefore apply.
The case of image size causing “killed by bus” can not be explained by my statements above…

And the problem we had on quickdraw only appears when stuff combined doesn’t fit in RAM, as long as everything fits, so amount of RAM per process x num processes is available, this problem doesn’t appear, which is why with the small datasets within the lessons etc. of course this never pops up and probably for most people this edge case will not really matter either.

hwasiti · December 29, 2018, 1:14am

@balnazzar I am working on the whale competition where I have only ~75,000 images. I got the error:
RuntimeError: DataLoader worker (pid 5421) is killed by signal: Segmentation fault.

It is a local machine and it did not fill more than 15% of the 64GB RAM
Ubuntu 16
64GB
Cuda 10
Pytorch 1 stable

Error happened right after fit_one_cycle for a resnet50 model (image size 448), num_workers= 4

It seems if I let the data augmentation transforms as fastai defaults, I will not get the error like:
.transform(get_transforms(do_flip=False), size=SZ, resize_method=ResizeMethod.SQUISH)

instead of

.transform(get_transforms(do_flip=False, max_zoom=1.5, max_lighting=0.5, max_warp=0.7), size=SZ, resize_method=ResizeMethod.SQUISH)

Did you try to stick with the transforms default and still you get the error?

And I think the error that you and me are getting is different than the memory leak in the case of @marcmuc @devforfu where the RAM is filled before reaching the end of the epoch.

I noticed such error also in the Quick draw comp, where I could not change anything in the default transforms parameters. It does not seem even related to the number of images in the dataset. This whale competition has only few tens of thousands of images.

Edit: seems there are certain limits for the transforms arguments that cannot be increased over. For example, max_warp=0.6 will fail immediatly after learn.fit_one_cycle(8) in the pets notebook. However if I set it to 0.5 (default is 0.2) it will fail ~ 2nd epoch. Keeping in mind that the maximum value is just randomly (rarely applied on images), and if by chance it is appliead somewhere in the 1st epoch or subsequent epochs then it will fail.

I will make tests on what limits are acceptable on the lesson1 pets notebook after resnet50 fit_one_cycle.

Reproduced the error both on GCP and local machine.

ScreenClip

Trace of the error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     19     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     20                                         pct_start=pct_start, **kwargs))
---> 21     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     22 
     23 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    164         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    165         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 166             callbacks=self.callbacks+callbacks)
    167 
    168     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     25         loss = cb_handler.on_backward_begin(loss)
     26         loss.backward()
---> 27         cb_handler.on_backward_end()
     28         opt.step()
     29         cb_handler.on_step_end()

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callback.py in on_backward_end(self)
    229     def on_backward_end(self)->None:
    230         "Handle end of gradient calculation."
--> 231         self('backward_end', False)
    232     def on_step_end(self)->None:
    233         "Handle end of optimization step."

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    186         "Call through to all of the `CallbakHandler` functions."
    187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 
    190     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callback.py in <listcomp>(.0)
    186         "Call through to all of the `CallbakHandler` functions."
    187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 
    190     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/train.py in on_backward_end(self, **kwargs)
     75     def on_backward_end(self, **kwargs):
     76         "Clip the gradient before the optimizer step."
---> 77         if self.clip: nn.utils.clip_grad_norm_(self.learn.model.parameters(), self.clip)
     78 
     79 def clip_grad(learn:Learner, clip:float=0.1)->Learner:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/utils/clip_grad.py in clip_grad_norm_(parameters, max_norm, norm_type)
     30         total_norm = 0
     31         for p in parameters:
---> 32             param_norm = p.grad.data.norm(norm_type)
     33             total_norm += param_norm.item() ** norm_type
     34         total_norm = total_norm ** (1. / norm_type)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/tensor.py in norm(self, p, dim, keepdim)
    250     def norm(self, p="fro", dim=None, keepdim=False):
    251         r"""See :func: `torch.norm`"""
--> 252         return torch.norm(self, p, dim, keepdim)
    253 
    254     def btrifact(self, info=None, pivot=True):

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/functional.py in norm(input, p, dim, keepdim, out)
    716             return torch._C._VariableFunctions.frobenius_norm(input)
    717         elif p != "nuc":
--> 718             return torch._C._VariableFunctions.norm(input, p)
    719 
    720     if p == "fro":

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/utils/data/dataloader.py in handler(signum, frame)
    272         # This following call uses `waitid` with WNOHANG from C side. Therefore,
    273         # Python can still get and update the process status successfully.
--> 274         _error_if_any_worker_fails()
    275         if previous_handler is not None:
    276             previous_handler(signum, frame)

RuntimeError: DataLoader worker (pid 5421) is killed by signal: Segmentation fault.

balnazzar · December 30, 2018, 7:53pm

Yes. I suspected the transformer was part of the problem, and tried to stick with defaults. The error still persisted. But if I do NOT transform anything at all the error does not pop up. Note that without the transformations tha total amount of data going all around is much smaller.

Indeed. The memory of the DGX is hard to fill during a regular DL project.

However, it seem that lighting is involved too, apart from the amount of max_warp.

Since your findings are quite interesting try and tag Jeremy. Since he’s participated in this thread, we should not be at risk of being quartered and beheaded.

hwasiti · December 31, 2018, 7:28am

Today, I was working on the whale competition and it failed even with only max_warp = 0.4 (after few epochs). So I had to decrease it to 0.3. And returning back to fastai defaults did not give me errors.
Which means even my table in my previous post is not consistent with all datasets. So perhaps defaults also can cause this error like in some cases (default max_warp = 0.1)?

Did you get the same error like mine?:

RuntimeError: DataLoader worker (pid 5421) is killed by signal: Segmentation fault.

If I remember well, in the quickdraw comp., even slight change in max_zoom will trigger an error at some point in some epoch. The larger the argument value, the sooner the error pops up.

I bet @jeremy knows about this issue . Perhaps we should wait a bit to settle other more important things in fastai v1.

balnazzar · December 31, 2018, 1:36pm

No. Mine was killed by bus signal

I was curious to know if they (the fastai developers) get that kind of error too…

marcmuc · December 31, 2018, 6:07pm

I have had segmentation faults before but can‘t remember what solved it . But this is from bookmarks I saved then:

This last definitely helped in some non-pytorch cases, it‘s about increasing the stack that the system provides for (python) processes.

maybe it helps.

devforfu · January 2, 2019, 7:48am

Hm, these are interesting observations guys. I had some issues with transformations also. Not that critical as yours though. I am trying to train the model for face landmarks detection dataset, and in my case, the data block API raises some warnings about data inconsistency.

I am going to continue experiments on various datasets using fastai and plain torch. Probably it is worth to try some other ways to retrieve the data, as @marcmuc mentioned previously in one of his links about Redis caching.

balnazzar · January 3, 2019, 2:32pm

May you provide examples of such inconsistencies? Thanks!

devforfu · January 5, 2019, 2:41pm

Sure, will do! I am trying to re-write that code with the most recent version of the library and using a bit modified dataset to check if the issue still exists.

I can’t remember the exact error message (though try to replicate it later) but It was something related to the size of target tensor. In my case, each target is a 2D-array with (y, x) face landmarks coordinates. And, there are 42 (21 times 2) elements in that array. After transformations were applied, I’ve got a warning that my observations cannot be gathered into a batch because their shapes are different, like 19x2, 18x2, 20x2, etc. So it was like some of the landmarks were “lost” during the transformation process.

hwasiti · February 1, 2019, 9:35pm

For those who are getting:
RuntimeError: DataLoader worker (pid 173) is killed by signal: Bus error.

Maybe increasing the shared memory of the system will solve the issue. More details here:
https://www.kaggle.com/product-feedback/72606

marcello_m · February 2, 2019, 8:06pm

I am working on Kaggle and I uploaded a dataset of images of mine.
But I get the error
DataLoader worker (pid 54) is killed by signal: Bus error.
when I try to inspect my data with
data.show_batch(rows=3, figsize=(7,6))

The thread on Kaggle.com (see above) has no solution whatsoever