RuntimeError: DataLoader worker is killed by signal

So I can confirm: Upgrading pytorch to the 2018-11-30 build without any other changes now makes it obay the num_workers parameter. So anyone having problems on gcloud with pytorch running only a single process, upgrade pytorch, that should fix it! With the previously installed version (from the DL image) of build 2018-10-24 it always only used one process, num_workers had no effect whatsoever.

1 Like

And does it help with num_workers > 0 as well?

This is worthwhile to cross post on the GCP thread for other folks.

1 Like

No, unfortunately it doesn’t. But I have continued monitoring this, and maybe what I stated before (copying the huge lists into the workers) is the actual problem. As you also mentioned, forking a process should not actually copy the entire memory, but rather due to copy-on-write only make “virtual copies” until the main process or the workers modify something and ony then make “pyhsical” copies. So maybe that is actually what is happening here and not a real memory leak. I also don’t understand what would cause “writing” here, but maybe that is what happens (maybe storing/“marking” the last “item delivered” for the iterator somehow is enough?). Anyways, if you look at the two screenshots below, you can see that the memory consumption has gone up (steadily over time) by 3.1 GB. But the processes still show about the same amount of virt. and reserved mem. So the processes don’t actually use more memory (from “their point of view”) but just more memory is actually allocated for real instead of being shared (copy-on-write). This would happen memory page by page I assume, which would explain why the consumption goes up steadily over time?! (Just wild guesses…)



So if anyone knows why the supposed “read only” operations actually make changes, that would be interesting to understand!

Ok, understood, thank you for helping to figure out what is going on!

I’ve decided to write a small custom training loop to see if it show the same memory error. I was running it against MNIST with > 96% accuracy so I guess that probably it is more or less correctly implemented :smile: So I am going to run it using my custom Quick Doodle dataset class to see if it also fails.

Update: So far the results are not too promising. I do not use any fastai dependencies to train the model, only pytorch and torchvision but still getting a gradual increase of consumed RAM during the very first training epoch. I’ll share the memory consumption plot soon.

Also, I’ll share the code I am using as soon as make it more readable then it is now. So one could check my implementation and repeat on other machines/versions of PyTorch.

1 Like

Other folks are getting same issues in the Google doodle competition :

3 Likes

Thank you for the link! Have posted a message on that thread.

I didn’t say that…

1 Like

Yes that’s a fair comment. For such giant datasets, you should store the filenames and labels in a CSV file, and use the from_csv methods. That will just leave things as strings.

5 Likes

Thanks, that is very helpful! So that means that depending on the method used for loading the data into the ItemList, it is stored in different format/data types? That’s not something I would have thought of…

3 Likes

Ok, I’ve written a custom training loop that looks like this:

def train(model, opt, phases, callbacks, epochs, device, loss_fn):
    model.to(device)

    cb = callbacks

    cb.training_started(phases=phases, optimizer=opt)

    for epoch in range(1, epochs + 1):
        cb.epoch_started(epoch=epoch)

        for phase in phases:
            n = len(phase.loader)
            cb.phase_started(phase=phase, total_batches=n)
            is_training = phase.grad
            model.train(is_training)

            for batch in phase.loader:

                phase.batch_index += 1
                cb.batch_started(phase=phase, total_batches=n)
                x, y = place_and_unwrap(batch, device)

                with torch.set_grad_enabled(is_training):
                    cb.before_forward_pass()
                    out = model(x)
                    cb.after_forward_pass()
                    loss = loss_fn(out, y)

                if is_training:
                    opt.zero_grad()
                    cb.before_backward_pass()
                    loss.backward()
                    cb.after_backward_pass()
                    opt.step()

                phase.batch_loss = loss.item()
                cb.batch_ended(phase=phase, output=out, target=y)

            cb.phase_ended(phase=phase)

        cb.epoch_ended(phases=phases, epoch=epoch)

    cb.training_ended(phases=phases)

A couple of points about the implementation:

  • Quick Draw Dataset
  • No fastai dependencies
  • No Path objects stored in memory (reading data directly from pd.DataFrame and rendering images on the fly)
  • Direct usage of torch.DataLoader classes, the transformations are taken from the torchvision
  • num_workers=12

Here are a memory usage plots:

The process is killed in the middle of the training epoch. So we can suppose that the problem is somewhere inside the torch package. (Except if my training loop contains the exact same bug as one in the fastai library which sounds like a very unusual coincidence :smile:)

I am going to roll-back to 0.4.1 and see if the problem was introduced in the recent master or exists in the stable version as well.

Can’t claim for sure but it seems that, at least, pytorch-nightly has a problem with data loaders leaking memory when num_workers > 0.


I am going to share the implementation of the training loop I have. It is a bit involved because tries to mirror at least a couple of features that fastai includes. (Callbacks and cyclic schedule mostly). However, it shows a memory issue and probably could be helpful demonstration/starting point if we decide post to PyTorch’s forums.

4 Likes

For what it’s worth I believe I’m running into the same issue.

Personal machine, Ubuntu 18.04, 32 gig main RAM. Whale data set. Num_workers = 0 seems to fix it but is 3-4 times (maybe more) slower than trying to use … = 8. I’ve got 6 physical (12 logical) cores and 8gig VRAM.

I7-8750H, 1070 max-q.

I’ll post lib versions tomorrow.

1 Like

This is really interesting, thanks for taking the time to check this. Just one thought: like fastai you also use callbacks in many places, those get triggered also by every batch.
In order to pinpoint the problem really we should try to use the most simple version of a loop, so without any extra callbacks (or functionality copied into the batch loop), don’t you think?!
(and also without running e.g. tqdm or fast_process in order to avoid those as the unlikely but possible source of the problem)

The memory tracking could be done time based from a separate process.

Yes, sure, it is possible. My intention was to replicate the accuracy of the training as well so I’ve added various callbacks to make training loops more similar to each other. Actually, I’ve decided to combine testing and learning so now I have a custom training loop :smile:

And I agree, now when the loop seems to work I can create a “plain” version without callbacks and progress bars to make the tracking process more transparent.

oops… I’m sorry if I misunderstood.

Here I asked you about lesson 1:

@jeremy In your notebook output, resnet34 learn.fit_one_cycle(4) took around 2 minutes in the main video notebook (using sagemaker)
While I can see, it took for you around 1 minute in the github lesson 1 notebook.
Can you please mention what was the specs for the video lesson nb training and what was for the github nb?

And you replied:

For github repo I’m using a 1080ti. In the lesson it’s a V100

So I thought that the 1 min. github repo (1080ti) is faster than the ~2min V100.

In lesson-6-pets-more notebook Jeremy used garbage collector.
gc.collect()

Does it make any difference for your memory leak?
Any insight why did he use that?

I worked on the doodle comp. for few days, and trained varieties of resnet models (resnet34 , 50 ,101) on ~8 million images, and had no memory issues. My GCP had gigantic memory for this comp (256GB) and 4 GPUs running 2 models each on 2 GPUs in parallell. The total max. RAM usage was never more than 50GB for the total 2 models + ~50GB cache.

My instance is new, so pytorch-nightly is a recent version.

One model has already finished, so you see only 2 GPUs are working

1 Like

Hm, worth to make a try! I guess that the reason was that the Jupyter notebook keeps everything in the memory until you delete variable or restart kernel. So sometimes it is worth to call GC to free some memory.

However, I am not sure if it can help in a case when the data is captured in a more tricky way. For example, I had a similar issue here:

The problem was, I had incorrectly implemented a python generator using closures (lambda-functions). So it was keeping data in the memory after each batch iteration. And, as I can recall, GC wasn’t a big help in my case.

Yeah, that’s a problem if you have 32 GB only :smile: And as @balnazzar noted above, all his DGX 256GB were exhausted on a huge dataset. Also, 8M records are approx. 24K records per category while I was trying 50K per category, i.e, 17M.

I would say that the main problem here is that the framework requires you to feet all this stuff into memory at once instead of iteratively load things on demand. So you need an enormous amount of memory even in the cases when it is not really required.

1 Like

The following discussion makes me feel sad :slight_smile:

Could somebody comment this thing? Or better, do you know any practical advice on how to take part in Kaggle competitions with big datasets if we have such a limitation?

Probably I just don’t understand something but I was thinking it is possible to train the models on relatively huge datasets having limited hardware capabilities. However, if you can’t use multiprocessing to read the data, it sounds like an incredible I/O bottleneck which effectively means that Python itself is a problem in our case. I knew about things like GIL and multiprocessing limitations but didn’t think it could affect that much.

Would really appreciate if some experienced Kaggle practitioners could give an advice about participating in modern data competitions. Do you think that local a local machine should be used only to make experiments on relatively small datasets and everything else should be performed on hosts with a lot of RAM to cache the data in memory?

Sorry if my statements sound a bit like a panic :smile: I just trying to figure out what is going on here and which best practices could be used to alleviate the issue. (Except switching into single-threaded mode).

Update: As a recommendation from the competition’s winner, here is a link to the library that I guess could help with multiprocessing. Haven’t tested yet but sounds interesting.

2 Likes

Dunno if this applies, but I’ve been running the last two days or so with num_workers = 2 and it’s been working. Defaults and I was getting the mentioned errors… Picked most of the speed back up with 2 vs. 0.

Interesting. I’ll try it tomorrow and report back!

1 Like