RuntimeError: DataLoader worker is killed by signal

(Andrea de Luca) #7

I work with very large datasets (indeed, my boss lets me use the dgx), and I get that error continuously. It seems that pytorch dataloader is not capable of working with real-world datasets AND multiprocessing. Going from 40 threads to 1 thread on the dgx, the difference in speed is immense. I really hope they could address that problem quickly, but judging from what I read on the forum, they haven’t any clue yet.

1 Like

(Andrea de Luca) #8

I think the vast majority of people just works with small datasets: in the end, they allow you to learn without spending too much time in processing the data and training the net on them.

1 Like

(Ilia) #9

Right, I am doing the same and mostly work on smaller datasets that fit into memory even if the leak exists. The problem was revealed only when I’ve done with experimenting and decided to train the model using more data to see how it will affect the accuracy.

0 Likes

(Ilia) #10

Ok, I’ve tried to create a simple notebook for tests, to reproduce the leakage. There is a script also in this repo but it is not synchronized with the notebook yet.

Of course, any other experiments are highly appreciated to help to track down the issue.

0 Likes

(Ilia) #11

Yeah, that’s a real trouble. I see a serious drop in performance even on my machine when switching to single-threaded execution. If this problem really comes from PyTorch’s DataLoader implementation, can’t figure out how it wasn’t spotted earlier? Don’t want to switch back to Tensorflow :sweat_smile:

1 Like

(Haider Alwasiti) #12

I found your notebook when I tried to find an example for callbacks to do Telegram notifications.
Thanks it helped me a lot…

I tried your notebook on my GCP instance, and seems I had no memory issues:

I used GCP:
skylake Xeon 8 vCPU, 52GB, hdd, V100

0 Likes

(Ilia) #13

That’s great that the callback helped!

Hm, that’s interesting, I don’t have memory errors with smaller datasets. Could you please tell me, what dataset you use? The problem is that the kernel kills my training script when I run it on huge amounts of data, i.e., a lot of training batches.

0 Likes

#14

Ilia, which platform are you using? It seems that I have the exact same error on Paperspace P5000 with 32GB CPU, and there is no such problem on GCP (exact same configuration as @hwasiti mentioned). However, I have some other issues with GCP – it works extremely slow on big datasets. htop shows that though I explicitly mention num_workers=8, there is only one process running.

1 Like

(Ilia) #15

Hi!

Thank you for your feedback! Yeah, that sounds interesting. So it seems that the GCP doesn’t use 8 workers as you ask, right? Probably that’s the reason why there is no error on that platform. Because everything works fine (though tremendously slow) with num_workers=0.

I am using a dedicated desktop with the following configuration:

i7-6800K
Corsair Hydro H45 32 GB (8x4)
GTX 1080Ti
Ubuntu 18.04
CUDA 9.2 (410 drivers)

However, I believe that I’ll see the same problem on other machines as well. I have also tried V100 on Salamander but somehow it was slower than my local machine so I’ve stopped training and didn’t have a chance to check if the problem exists here as well.

I am going to write a custom training loop to see if the problem exists when using plain PyTorch.

0 Likes

(Marc P. Rostock) #16

So, unfortunately now I also find myself in the group of people having this issue…

But, googling this I quickly found that there are potentially many different things going on, and the second, highlighted part of the error is very critical. So could everyone here check that we actually have the same error, otherwise we might be researching different things. Possible endings of the error message after the colon:

  • Killed
  • Terminated
  • Bus Error
  • Illegal Instruction

All of these may stem from different problems.

@devforfu, Thank you for the helpful callbacks, I will try to use that now for further investigation.
By the way I am having the problem on the same huge dataset you are using in your first snippet above…
Ubuntu 18.04, only 16GB RAM.

1 Like

(Ilia) #17

@marcmuc Not a problem at all, thank you for your response! It is really important to gather as much information as we can. At least, it is a good thing that we can reproduce the same error using the same dataset but on different machines.

In my case, I for sure can say that I had two of them (using almost identical training code):

  • Killed
  • Bus Error

I am not sure if I’ve ever seen Terminated, and as I can recall, I’ve never had Illegal Instruction.

Do you think that all these errors categories represent something different? Could it be the same reason but failed in different stages? Like, we have a memory leakage in some specific place of the code but as soon as these kinds of problems are highly non-deterministic (taking into account multithreading and amount of available memory on different machines), we’re getting different “flavors” of the basically same exception?

I mean, in one case, the OS just kills the process in advance when sees lack of memory, in another case the process tries to allocate already occupied memory as soon as there is no free space, and is therefore killed, etc.

However, you’re definitely right, and it is a good idea to carefully record every case.

0 Likes

(Haider Alwasiti) #18

I ran your notebook without modifications. It used the oxford pet dataset.
If you want me to check it on any other dataset, please make changes on your notebook, and I will run it and report you back.

0 Likes

(Haider Alwasiti) #19

Do you mean the GCP instance is running slow for tasks involving I/O to the drive?
Usually the preprocessing and 1st training epoch needs the drive. 2nd epoch and further will use the cache of the OS and drive will be idle.

You can use iotop to check the drive activity.

What I discovered is that the GCP drive is very slow. That’s because GCP is dedicating a limited speed for the drive depending on the the size of the hdd. 100 GB SSD will give only 50MB/s write or read. Once you increase the size in the edit of the drive you use in the VM instance specs the drive speed increase (500GB SSD -> 250 MB/s). But once you increase the size you will not be able to decrease its size. The price of 100GB SSD is $17/month which is not cheap.

1 Like

(Ilia) #20

Sorry, I’ve misunderstood your response.

Yes, you’re right, with Pets dataset everything is fine. The main idea of that notebook is to provide some basic playground to test the model’s training code on various datasets and with various training parameters. (However, even in case of that dataset, I guess we see a small increase in consumed memory during single epoch).

Originally, I’ve encountered the problem when worked on Quick Draw Doodle dataset. The dataset requires some preprocessing before training and takes about 7GB of memory in CSV format, so I didn’t include it. The plot you see at the top of this post is generated during training on that dataset.

I guess the problem is revealed when one tries to train the model on large datasets that don’t fit into RAM require > 10-15K iterations per epoch.

0 Likes

#21

@hwasiti I mean for all the notebooks I ever run on my GCPs, htop shows one thread working. I have a 250GB SSD

0 Likes

(Marc P. Rostock) #22

EDIT: At least half of my post that this replaces was probably bullshit, because I didn’t know htop by default displays threads, not processes. Will research further and post an update. So what I thought I was seeing here didn’t actually exist. Too bad. Sorry, if I confused anyone…

0 Likes

#23

I also observe this problem working with Quick Draw dataset - on GCP, with both P100 and V100 (and not just me. Not sure if it’s related, but there also seem to be a GPU memory leak as well: when trainining with large images, I get CUDA out of memory error after several large epochs.

0 Likes

(Haider Alwasiti) #24

Hey @adilism
I tried to run your notebook to replicate the error.
In cell 11 it gave me this error:

------------------------------------------------------
NameError            Traceback (most recent call last)
<ipython-input-14-18718420d75b> in <module>
----> 1 class DoodleDataset(DatasetBase):
      2     def __init__(self, df, size=256, is_test=False):
      3         self.drawings = df['drawing'].values
      4         self.size = size
      5         self.is_test = is_test

NameError: name 'DatasetBase' is not defined

I am interested to replicate this error in GCP and in my local pc. Please provide notebooks as more as possible that can be run without too much fiddling so we encourage others to reproduce this error and find whether any differences in platforms. I think if this is confirmed as a bug shown consistently on large datasets, it worths more attention. Real world datasets are always big.

0 Likes

(Haider Alwasiti) #25

That’s strange.
When I run radek’s notebook for the quickdraw comp., I have seen that all 8 processors are working on my GCP instance.

Here is before unfreeze()

Here is after unfreeze() training resnet34:

And when I tried running 2 V100 running two versions of this notebook, I can see that almost all processors are maxed out (80-100%)

Processor utilization are definitely not the same (one processor util. is more than the other), but they are all significantly working more than the idle state before running the nb.

0 Likes

(Marc P. Rostock) #26

I can confirm this. Due to the problems mentioned in this thread I set up a gcloud instance hoping to be faster and have less problems. Now even though I specify 8 workers it only uses one of 8 cores while training and the v100 setup is slower than my home computer. I makes me wanna cry… :wink:

I have set up the machine according to the official setup docs by fastai using the DL image.

Now there seem to be people here (@hwasiti?) for whom it works alright on gcloud. Could it be that this is due to differences in pytorch versions? Could you guys check your pytorch version in gcloud to see if this might make a difference? Mine is pytorch-nightly 1.0.0.dev20181024 py3.6_cuda9.2.148_cudnn7.1.4_0 according to conda list. When I try to install pytorch-nightly to a new env it would give me a release version from today apparently. So maybe something was changed/fixed in the meantime?

1 Like