Update: I’ve cross-posted the question on devs section as well. Probably it is worth to merge these threads together for the sake of clarity to carry on the discussion in a single place.
This question was raised several times on PyTorch and fastai forums/issue trackers but I would like to clarify, is it possible to safely use num_workers > 0
in data loaders? Is it possible to use all available CPUs or better use less than available?
The reason I am asking is that during a couple of weeks I am struggling with the error from the title of this post. Every once in a while, the training process is killed due to lack of RAM. I was tracking the amount of memory used while running the training process via free -mh
, and it shows a slow decreasing of the amount of available memory during a single training epoch. I guess something goes wrong because it shouldn’t increase indefinitely, right? Sounds like a memory leak, from my point of view. Or am I doing something wrong?
Here is a Gist I am using:
It seems that the PyTorch community claims that the problem is in custom datasets. The fastai forum seems to say the opposite And, the general advice is to use num_workers=0
.
Also, I’ve seen an advice about increasing the amount of swap space. However, I am not sure if it can help as soon as all my 32GB of RAM are getting occupied during a single epoch, and I guess adding more space can’t really help.
Probably I just don’t understand something. Could you please let me know if someone else has a similar issue? Do I just need to use a single core?
If somebody would like to know more or try to help with the issue, I am ready to provide information about my setup, hardware, gather memory usage log, etc.