RuntimeError: DataLoader worker is killed by signal

(Haider Alwasiti) #81

I mentioned the link here because this forum thread had been discussing such memory issues on local and remote servers for several months. Maybe increasing the shared memory on local or remote servers solve it with such errors.I know that kaggle kernels should be solved by the kaggle team, but I didn’t say it will solve this issue for kaggle kernels. Hopefully they will do it soon as they promised.

(Ilia) #82

By the way, here is a reference to the issue I had. I was thinking that it is somehow related to data augmentation, i.e., the landmarks are falling outside of the image after various image transformations. However, then I’ve got another error:

There seems to be something wrong with your dataset, can't access self.train_ds[i] for all i in 
[65237, 47545, 8078, 53990, ..., 758]

I am going to try to reproduce this issue on some small/dummy dataset to see if it still exists in the library.

(Andrea de Luca) #83

You know, I thought about the transformations too. The ideal setup would be a very small dataset where you can visualize exactly all the transfs performed upon every img.

But that’s about vision. I had plenty of killed by signal back when I was working upon text…

Thanks, however: you commitment in finding a solution is commendable.

(David Mwambali) #84

RuntimeError: DataLoader worker (pid 81) is killed by signal: Bus error.
I am getting this error while running an kernel on kaggle, any solution please

(Yijin) #85

Hi David,

Please see here. I think Pytorch 1.0.1 fixed this problem.


(Jeff Hale) #86

I think Kaggle still doesn’t have a high enough shared memory limit for their Docker containers.

Some options:

  1. reduce your batch size, say to bs=16 maybe, instead of the default 64.
  2. reduce the number of workers. This will slow down your training.
  3. train on Colab instead of Kaggle. Colab fixed this issue in fall 2018.

I would favor option #1 or #3.