RuntimeError: DataLoader worker is killed by signal

Dunno if this applies, but I’ve been running the last two days or so with num_workers = 2 and it’s been working. Defaults and I was getting the mentioned errors… Picked most of the speed back up with 2 vs. 0.

Interesting. I’ll try it tomorrow and report back!

1 Like

Did you try pytorch v0.4? Does it have the same issue?

Now that the comp ended, can you share a notebook that I can run to reproduce the issue in my same GCP instance that I reported that I had no memory issues with?

I will decrease the specs of the vm to the same level of your specs…

Although that I had no issues till now like that, but I am too worried that the bug is fundamental to the level of pytorch itself or worse if it is in python itself… No matter how much RAM you have, if this is a memory leak then in a real life big dataset, it will bring any system to its knees like a DGX-1.

So this should worry everybody.

Yeah, sure! I’ll share everything as soon as have time to make some small cleanup and upload into the repository. The code actually is very similar to standard notebooks shared here and there.

No, I didn’t try v0.4 but guys from Kaggle mentioned that they have used v0.4 and memory leaks happen no matter which multiprocessing program you’re running due to issues with Python’s implementation. However, I’ve trained some models with Keras and can’t remember if I had such a trouble. Though the datasets were not that big. So I am going to try this Pyro thing also that I’ve linked above.

Another interesting point also for someone who uses Pandas in their datasets classes:

Relevant to the @marcmuc’s notice about copy-on-write behavior.

1 Like

I have been reading up on this a little more after the insights from the 1st place winners. Here is the best stuff I could find, and I think we have the explanation now. So we have compounding problems in python itself (copy-on-access), pytorch (way multiprocessing is used) and fastai (too large objects, not wrapped “correctly”).

The core is this: There is no way of storing arbitrary python objects (such as pandas dfs, dicts of Pathbojects, or even simple lists) in shared memory in Python without triggering copy-on-write behaviour due to the addition of refcounts, everytime something reads from these objects. The refcounts are added memory-page by memory-page, which is why the consumption grows slowly, whereas by spawning the processes (and/or copying the entire objects in the beginning) it would jump up immediately. Either way, the processes will end up having all/most of the memory copied over bit by bit, which is why we get the memory overflow problem. Best description of this behaviour is here.

-> Hacky Workaround Solution 1: Check the memory consumption of your main process -> Devide total free mem by this and set the number of workers to the resulting number (absolute maximum). This the background for @larcat’s num_workers=2 solution working for him I assume. It is also why you don’t ever get these problems with huge mem (like @hwasiti showed on gcp), because as long as num_workers x total_mem_of_main_process < total_mem_available, everything is fine!

-> Hacky Workaround Solution 2: Make sure the main process occupies as little memory as possible, by a) not storing lists of Path-Objects (i.e. using.from_csv methods and not .from_folders as suggested by Jeremy) and b) removing any unneccessary intermediate objects/lists/stuff from your main process (i.e. using del) and c) running gc.collect() before starting the fit process, so before the workers get forked (not sure if b) and c) really helps much, but it can’t hurt)

Real Solutions (not tested yet)

-> A) Using Multiprocessing like now: in order for python multiprocessing to work without these refcount effects, the objects have to be made “compatible with” and wrapped in multiprocessing.Array before the process pool is created and workers are forked. This supposedly ensures, that the memory will really be shared and no copy-on-write happens. This explains how to do it for numpy arrays and this explains the reasoning behind it again. Don’t get confused by some false statements even by the authors of these good answers stating that copy-on-write makes all of this unneccessary, which is not true. One comment also points to this:

“Just to note, on Python fork() actually means copy on access (because just accessing the object will change its ref-count).”

-> B) Using external tools/managers for storing the shared access objects, instead of storing them in the main process and the forked processes. Solutions could be the Pyro library as mentioned by the winners, but also something like Redis might be interesting. @vitaliy has experimented with this in the context of this competition almost 2 months ago, unfortunately without any replies, we should take a closer look at that I think!

Disclaimer: I am not an expert in any of this, just followed a lot of stack overflow link trails :wink:
If you think it is useful maybe I should split this long post out into a separate topic, after working in some of you guys’ comments/corrections.

6 Likes

A great summary! I am going to use the on-the-fly Dataset implementation without Pandas to see if it can help solve the problem. Will post the update as soon as ready.

1 Like

Also try to make your processing faster, so you need less workers. If you’re using jpegs, you should do this to install an accelerated jpeg library, since that’s a major overhead otherwise:

conda uninstall --force jpeg libtiff -y
conda install -c conda-forge libjpeg-turbo
CC="cc -mavx2" pip install --no-cache-dir -U --force-reinstall pillow-simd
7 Likes

@devforfu
This is slightly off topic, but there’s an example you posted in this thread where you are displaying map3 during training – Could you maybe provide a code snip of how you did this? I’m pretty new to the Pydata stack, and the numpy array manipulation syntax is pretty unintuitive to me still.

Thanks!

Pytorch 1.0 stable is released. Please note the release notes and particularly the big fixes (serious bugs). There are a few related to memory leaks and data loaders. It would be interesting to know whether this fixes your issues.

1 Like

Following a few links from the release notes fixes section leads to the (high priority) bug below, still open. I have posted links to this thread in the comments there and a short summary of my post above. (I have recategorized this thread so that it is linkable from the outside, because this issue is not really course v3 related.)

2 Likes

I made extensive tests on the dgx. The GC doesn’t make any real difference.

The problem tends to happen more frequently as you use:

  • More than one GPU
  • Bigger models
  • Bigger datasets

As it happens, there is still plenty of free RAM and VRAM.

Quite suprisingly, I also get that issue more frequently on the dgx rather than my home machine, with the same datasets/models.

It has become a real hindrance for my workflow, as of late.

1 Like

Tweet

It is those 3 points you mentioned that are enormously large in FB production systems. I am puzzled how does FB use Pytorch in production systems then?

2 Likes

Indeed. Furthermore, I’m updated to 1.0 stable.

I’m quite convinced we are doing something in the wrong way. Pytorch cannot be so buggy.

2 Likes

I have just posted a comment to the pytorch bug ticket above. I think I have figured out at least one example of what goes wrong. @balnazzar maybe you can have a look at how your data is stored/handled in the datasets/dataloaders and experiment with that. It could make a huge difference when forking out the workers. I have made a notebook gist to demonstrate this. Especially if you are dealing with text / tokens (as I seem to remember you wrote about) this could be a key issue. See here, example with 8 workers, 10 Million strings. Just changing the datatype is the difference in the memory explosion (factor 4):

Memory-Consumption in GB with fixed length string array:
image

Memory-Consumption in GB with object array (only change!)
image

It basically has got nothing to do with pytorch tensors etc. it is a problem of the other data stored in the dataloaders and workers, usually consisting of lists of pahts/filenames and dicts of labels (which all store strings/objects)

@marcmuc

Thanks for you feedback and suggestions.
It’s not just about text data. Yesterday I went just crazy working on images. Same issue.

I’ll try and experiment with your nb (thanks), but note what I reported above: there is still plenty of free memory as I experience the error, particularly when I work on non-text data.

Okay, sorry. But then why you get the “killed by signal” message is probably different from why @devforfu or I get that message, because that was definitely related to running out of memory. Have you used his memory usage callback yet to track the consumption while running the model? are you running multiple models or other processes that could have short “spikes/bursts” in mem consumption? because that would be enough, even if your training process itself is not the culprit?

2 Likes

I just incurred in that error. I’m working with images right now. I just increased the size of pics from 299 to 352. Dataloader killed as soon as I ran fit_one_cycle.

Kernel restarted, tried to set 352 from the beginning. Nothing: it is killed as soon as I begin the training process.

I cannot make use of your notebook right now (I’m in the middle of my work right now, but I’ll login on the dgx during the night and make a test with your nb…), but I can say that over half the RAM is unused.

Also, I’m using a single gpu.

Yes, I am also getting these errors while working with image datasets. The main problem is that path objects definitely bring some overhead but they are not in the core of this issue. I didn’t try it yet but there is torchvision.datasets.ImageFolder class that doesn’t include any sophisticated dependencies:

No numpy, pandas, or pathlib, as simple as it could be. So if this guy leaks, then we probably have only two possibilities:

  1. bug in PyTorch
  2. problems with built-in multiprocessing as mentioned in Kaggle’s discussion

Hi Ilia, thanks for your feedback.

Quite surprisingly, the dataloader worker gets killed by bus signal even if I set num_cpus=0 (afaik, this superseded num_workers) in the ImageDataBunch.

What makes the difference is the size of the images. Indeed, everything works fine till I set a size above 306x306. I’m still trying to figure out why that happens.