RuntimeError: DataLoader worker is killed by signal

I have been reading up on this a little more after the insights from the 1st place winners. Here is the best stuff I could find, and I think we have the explanation now. So we have compounding problems in python itself (copy-on-access), pytorch (way multiprocessing is used) and fastai (too large objects, not wrapped “correctly”).

The core is this: There is no way of storing arbitrary python objects (such as pandas dfs, dicts of Pathbojects, or even simple lists) in shared memory in Python without triggering copy-on-write behaviour due to the addition of refcounts, everytime something reads from these objects. The refcounts are added memory-page by memory-page, which is why the consumption grows slowly, whereas by spawning the processes (and/or copying the entire objects in the beginning) it would jump up immediately. Either way, the processes will end up having all/most of the memory copied over bit by bit, which is why we get the memory overflow problem. Best description of this behaviour is here.

-> Hacky Workaround Solution 1: Check the memory consumption of your main process -> Devide total free mem by this and set the number of workers to the resulting number (absolute maximum). This the background for @larcat’s num_workers=2 solution working for him I assume. It is also why you don’t ever get these problems with huge mem (like @hwasiti showed on gcp), because as long as num_workers x total_mem_of_main_process < total_mem_available, everything is fine!

-> Hacky Workaround Solution 2: Make sure the main process occupies as little memory as possible, by a) not storing lists of Path-Objects (i.e. using.from_csv methods and not .from_folders as suggested by Jeremy) and b) removing any unneccessary intermediate objects/lists/stuff from your main process (i.e. using del) and c) running gc.collect() before starting the fit process, so before the workers get forked (not sure if b) and c) really helps much, but it can’t hurt)

Real Solutions (not tested yet)

-> A) Using Multiprocessing like now: in order for python multiprocessing to work without these refcount effects, the objects have to be made “compatible with” and wrapped in multiprocessing.Array before the process pool is created and workers are forked. This supposedly ensures, that the memory will really be shared and no copy-on-write happens. This explains how to do it for numpy arrays and this explains the reasoning behind it again. Don’t get confused by some false statements even by the authors of these good answers stating that copy-on-write makes all of this unneccessary, which is not true. One comment also points to this:

“Just to note, on Python fork() actually means copy on access (because just accessing the object will change its ref-count).”

-> B) Using external tools/managers for storing the shared access objects, instead of storing them in the main process and the forked processes. Solutions could be the Pyro library as mentioned by the winners, but also something like Redis might be interesting. @vitaliy has experimented with this in the context of this competition almost 2 months ago, unfortunately without any replies, we should take a closer look at that I think!

Disclaimer: I am not an expert in any of this, just followed a lot of stack overflow link trails :wink:
If you think it is useful maybe I should split this long post out into a separate topic, after working in some of you guys’ comments/corrections.

6 Likes

A great summary! I am going to use the on-the-fly Dataset implementation without Pandas to see if it can help solve the problem. Will post the update as soon as ready.

1 Like

Also try to make your processing faster, so you need less workers. If you’re using jpegs, you should do this to install an accelerated jpeg library, since that’s a major overhead otherwise:

conda uninstall --force jpeg libtiff -y
conda install -c conda-forge libjpeg-turbo
CC="cc -mavx2" pip install --no-cache-dir -U --force-reinstall pillow-simd
7 Likes

@devforfu
This is slightly off topic, but there’s an example you posted in this thread where you are displaying map3 during training – Could you maybe provide a code snip of how you did this? I’m pretty new to the Pydata stack, and the numpy array manipulation syntax is pretty unintuitive to me still.

Thanks!

Pytorch 1.0 stable is released. Please note the release notes and particularly the big fixes (serious bugs). There are a few related to memory leaks and data loaders. It would be interesting to know whether this fixes your issues.

1 Like

Following a few links from the release notes fixes section leads to the (high priority) bug below, still open. I have posted links to this thread in the comments there and a short summary of my post above. (I have recategorized this thread so that it is linkable from the outside, because this issue is not really course v3 related.)

2 Likes

I made extensive tests on the dgx. The GC doesn’t make any real difference.

The problem tends to happen more frequently as you use:

  • More than one GPU
  • Bigger models
  • Bigger datasets

As it happens, there is still plenty of free RAM and VRAM.

Quite suprisingly, I also get that issue more frequently on the dgx rather than my home machine, with the same datasets/models.

It has become a real hindrance for my workflow, as of late.

1 Like

Tweet

It is those 3 points you mentioned that are enormously large in FB production systems. I am puzzled how does FB use Pytorch in production systems then?

2 Likes

Indeed. Furthermore, I’m updated to 1.0 stable.

I’m quite convinced we are doing something in the wrong way. Pytorch cannot be so buggy.

2 Likes

I have just posted a comment to the pytorch bug ticket above. I think I have figured out at least one example of what goes wrong. @balnazzar maybe you can have a look at how your data is stored/handled in the datasets/dataloaders and experiment with that. It could make a huge difference when forking out the workers. I have made a notebook gist to demonstrate this. Especially if you are dealing with text / tokens (as I seem to remember you wrote about) this could be a key issue. See here, example with 8 workers, 10 Million strings. Just changing the datatype is the difference in the memory explosion (factor 4):

Memory-Consumption in GB with fixed length string array:
image

Memory-Consumption in GB with object array (only change!)
image

It basically has got nothing to do with pytorch tensors etc. it is a problem of the other data stored in the dataloaders and workers, usually consisting of lists of pahts/filenames and dicts of labels (which all store strings/objects)

@marcmuc

Thanks for you feedback and suggestions.
It’s not just about text data. Yesterday I went just crazy working on images. Same issue.

I’ll try and experiment with your nb (thanks), but note what I reported above: there is still plenty of free memory as I experience the error, particularly when I work on non-text data.

Okay, sorry. But then why you get the “killed by signal” message is probably different from why @devforfu or I get that message, because that was definitely related to running out of memory. Have you used his memory usage callback yet to track the consumption while running the model? are you running multiple models or other processes that could have short “spikes/bursts” in mem consumption? because that would be enough, even if your training process itself is not the culprit?

2 Likes

I just incurred in that error. I’m working with images right now. I just increased the size of pics from 299 to 352. Dataloader killed as soon as I ran fit_one_cycle.

Kernel restarted, tried to set 352 from the beginning. Nothing: it is killed as soon as I begin the training process.

I cannot make use of your notebook right now (I’m in the middle of my work right now, but I’ll login on the dgx during the night and make a test with your nb…), but I can say that over half the RAM is unused.

Also, I’m using a single gpu.

Yes, I am also getting these errors while working with image datasets. The main problem is that path objects definitely bring some overhead but they are not in the core of this issue. I didn’t try it yet but there is torchvision.datasets.ImageFolder class that doesn’t include any sophisticated dependencies:

No numpy, pandas, or pathlib, as simple as it could be. So if this guy leaks, then we probably have only two possibilities:

  1. bug in PyTorch
  2. problems with built-in multiprocessing as mentioned in Kaggle’s discussion

Hi Ilia, thanks for your feedback.

Quite surprisingly, the dataloader worker gets killed by bus signal even if I set num_cpus=0 (afaik, this superseded num_workers) in the ImageDataBunch.

What makes the difference is the size of the images. Indeed, everything works fine till I set a size above 306x306. I’m still trying to figure out why that happens.

Just to be clear, I have also only ever gotten these errors on ‘image’ datasets, not text. But what I tried to show also with the notebook, is that it has nothing to do with the actual content of the data, it is the lists of filenames, stored as strings, and the dicts of labels that are enough to cause this, if they are large enough. I just meant that this problem would be even worse if you were using other large lists of objects such as tokens for language models etc. within the dataloaders… And I have not looked at that, but I would assume the pytorch ImageFolder method will also be storing filenames in some sort of list, as long as that receives no special treatment, the same problems would therefore apply.
The case of image size causing “killed by bus” can not be explained by my statements above…

And the problem we had on quickdraw only appears when stuff combined doesn’t fit in RAM, as long as everything fits, so amount of RAM per process x num processes is available, this problem doesn’t appear, which is why with the small datasets within the lessons etc. of course this never pops up and probably for most people this edge case will not really matter either.

@balnazzar I am working on the whale competition where I have only ~75,000 images. I got the error:
RuntimeError: DataLoader worker (pid 5421) is killed by signal: Segmentation fault.

It is a local machine and it did not fill more than 15% of the 64GB RAM
Ubuntu 16
64GB
Cuda 10
Pytorch 1 stable

Error happened right after fit_one_cycle for a resnet50 model (image size 448), num_workers= 4

It seems if I let the data augmentation transforms as fastai defaults, I will not get the error like:
.transform(get_transforms(do_flip=False), size=SZ, resize_method=ResizeMethod.SQUISH)

instead of

.transform(get_transforms(do_flip=False, max_zoom=1.5, max_lighting=0.5, max_warp=0.7), size=SZ, resize_method=ResizeMethod.SQUISH)

Did you try to stick with the transforms default and still you get the error?

And I think the error that you and me are getting is different than the memory leak in the case of @marcmuc @devforfu where the RAM is filled before reaching the end of the epoch.

I noticed such error also in the Quick draw comp, where I could not change anything in the default transforms parameters. It does not seem even related to the number of images in the dataset. This whale competition has only few tens of thousands of images.

Edit: seems there are certain limits for the transforms arguments that cannot be increased over. For example, max_warp=0.6 will fail immediatly after learn.fit_one_cycle(8) in the pets notebook. However if I set it to 0.5 (default is 0.2) it will fail ~ 2nd epoch. Keeping in mind that the maximum value is just randomly (rarely applied on images), and if by chance it is appliead somewhere in the 1st epoch or subsequent epochs then it will fail.

I will make tests on what limits are acceptable on the lesson1 pets notebook after resnet50 fit_one_cycle.

Reproduced the error both on GCP and local machine.

ScreenClip

Trace of the error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     19     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     20                                         pct_start=pct_start, **kwargs))
---> 21     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     22 
     23 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    164         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    165         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 166             callbacks=self.callbacks+callbacks)
    167 
    168     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     25         loss = cb_handler.on_backward_begin(loss)
     26         loss.backward()
---> 27         cb_handler.on_backward_end()
     28         opt.step()
     29         cb_handler.on_step_end()

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callback.py in on_backward_end(self)
    229     def on_backward_end(self)->None:
    230         "Handle end of gradient calculation."
--> 231         self('backward_end', False)
    232     def on_step_end(self)->None:
    233         "Handle end of optimization step."

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    186         "Call through to all of the `CallbakHandler` functions."
    187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 
    190     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/callback.py in <listcomp>(.0)
    186         "Call through to all of the `CallbakHandler` functions."
    187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
--> 188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 
    190     def on_train_begin(self, epochs:int, pbar:PBar, metrics:MetricFuncList)->None:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/fastai/train.py in on_backward_end(self, **kwargs)
     75     def on_backward_end(self, **kwargs):
     76         "Clip the gradient before the optimizer step."
---> 77         if self.clip: nn.utils.clip_grad_norm_(self.learn.model.parameters(), self.clip)
     78 
     79 def clip_grad(learn:Learner, clip:float=0.1)->Learner:

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/nn/utils/clip_grad.py in clip_grad_norm_(parameters, max_norm, norm_type)
     30         total_norm = 0
     31         for p in parameters:
---> 32             param_norm = p.grad.data.norm(norm_type)
     33             total_norm += param_norm.item() ** norm_type
     34         total_norm = total_norm ** (1. / norm_type)

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/tensor.py in norm(self, p, dim, keepdim)
    250     def norm(self, p="fro", dim=None, keepdim=False):
    251         r"""See :func: `torch.norm`"""
--> 252         return torch.norm(self, p, dim, keepdim)
    253 
    254     def btrifact(self, info=None, pivot=True):

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/functional.py in norm(input, p, dim, keepdim, out)
    716             return torch._C._VariableFunctions.frobenius_norm(input)
    717         elif p != "nuc":
--> 718             return torch._C._VariableFunctions.norm(input, p)
    719 
    720     if p == "fro":

~/anaconda3/envs/fastai-v1/lib/python3.7/site-packages/torch/utils/data/dataloader.py in handler(signum, frame)
    272         # This following call uses `waitid` with WNOHANG from C side. Therefore,
    273         # Python can still get and update the process status successfully.
--> 274         _error_if_any_worker_fails()
    275         if previous_handler is not None:
    276             previous_handler(signum, frame)

RuntimeError: DataLoader worker (pid 5421) is killed by signal: Segmentation fault. 
2 Likes

Yes. I suspected the transformer was part of the problem, and tried to stick with defaults. The error still persisted. But if I do NOT transform anything at all the error does not pop up. Note that without the transformations tha total amount of data going all around is much smaller.

Indeed. The memory of the DGX is hard to fill during a regular DL project.

However, it seem that lighting is involved too, apart from the amount of max_warp.

Since your findings are quite interesting try and tag Jeremy. Since he’s participated in this thread, we should not be at risk of being quartered and beheaded.

2 Likes

Today, I was working on the whale competition and it failed even with only max_warp = 0.4 (after few epochs). So I had to decrease it to 0.3. And returning back to fastai defaults did not give me errors.
Which means even my table in my previous post is not consistent with all datasets. So perhaps defaults also can cause this error like in some cases (default max_warp = 0.1)?

Did you get the same error like mine?:

RuntimeError: DataLoader worker (pid 5421) is killed by signal: Segmentation fault. 

If I remember well, in the quickdraw comp., even slight change in max_zoom will trigger an error at some point in some epoch. The larger the argument value, the sooner the error pops up.

I bet @jeremy knows about this issue :slight_smile: . Perhaps we should wait a bit to settle other more important things in fastai v1.

1 Like

No. Mine was killed by bus signal :face_with_raised_eyebrow:

I was curious to know if they (the fastai developers) get that kind of error too…