RuntimeError: DataLoader worker is killed by signal

hwasiti · February 2, 2019, 8:53pm

I mentioned the link here because this forum thread had been discussing such memory issues on local and remote servers for several months. Maybe increasing the shared memory on local or remote servers solve it with such errors.I know that kaggle kernels should be solved by the kaggle team, but I didn’t say it will solve this issue for kaggle kernels. Hopefully they will do it soon as they promised.

devforfu · February 4, 2019, 4:23am

By the way, here is a reference to the issue I had. I was thinking that it is somehow related to data augmentation, i.e., the landmarks are falling outside of the image after various image transformations. However, then I’ve got another error:

UserWarning: 
There seems to be something wrong with your dataset, can't access self.train_ds[i] for all i in 
[65237, 47545, 8078, 53990, ..., 758]

I am going to try to reproduce this issue on some small/dummy dataset to see if it still exists in the library.

balnazzar · February 5, 2019, 7:39am

You know, I thought about the transformations too. The ideal setup would be a very small dataset where you can visualize exactly all the transfs performed upon every img.

But that’s about vision. I had plenty of killed by signal back when I was working upon text…

Thanks, however: you commitment in finding a solution is commendable.

Mwambali · March 13, 2019, 10:24pm

RuntimeError: DataLoader worker (pid 81) is killed by signal: Bus error.
I am getting this error while running an kernel on kaggle, any solution please

utkb · March 15, 2019, 2:41am

Hi David,

Please see here. I think Pytorch 1.0.1 fixed this problem.

Yijin

jeffhale · March 20, 2019, 4:09am

I think Kaggle still doesn’t have a high enough shared memory limit for their Docker containers.

Some options:

reduce your batch size, say to bs=16 maybe, instead of the default 64.
reduce the number of workers. This will slow down your training.
train on Colab instead of Kaggle. Colab fixed this issue in fall 2018.

I would favor option #1 or #3.

lawrence · January 21, 2020, 5:01am

I ran into a memory problem yesterday (with 24 hours to go in a competition, of course!) and I’m sharing it here in case it’s a useful clue. The error messages were mostly about pin memory, and seemed to be related to a few lines in one_batch(), which is in basic_data.py and is called during the normalize() step of setting up a DataBunch. The lines are:

w = dl.num_workers
dl.num_workers = 0
try:     x,y = next(iter(dl))
finally: dl.num_workers = w <== This is where it crashed
if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu)...

In addition to resetting num_workers (and spawning/re-spawning worker proceses?), it looks like there are may be movement onto/off the GPU at that point in the code as well.

Commenting out the lines that save and restore num_workers fixed the memory problem, but not surprisingly I got a new error about not being able to re-initialize CUDA in a forked process.

UPDATE: I totally solved the problem by fixing the code in a custom callback that I was writing. I think CUDA is just very restrictive about what it will accept; in my case the memory issues were caused by a combination of me not being careful enough about what was on/off the GPU, not being careful enough about tensor copy semantics, and using Pytorch functions that worked on the CPU but not the GPU.

John

stocks29 · March 3, 2020, 5:23pm

I was running into this same issue on my mac. Ran fine with num_workers=0 but not with more than zero unless I set the pixels to 80 or lower.

I made a few changes and now things are working:

switched from python 3.6.8 to 3.7.6
increased shared memory settings as described here: http://flummox-engineering.blogspot.com/2014/05/increasing-shared-memory-for-os-x.html
installed latest versions of all libs as of this writing installed from pip

Unfortunately I did all of these things at once so it’s unclear which of them actually solved the issue for me. If I have some time later, I may try to pinpoint but in a bit of a rush right now.

I’m now able to train with num_workers=16 and 500 pixel images without issue. I hope this helps someone.

bkinard9 · June 23, 2020, 6:17pm

I was also having this issue, with errors about workers being unexpectedly killed and segmentation faults. I only increased my shared memory settings as described in the link (sharing again), and it solved the issue for me.