Does DataLoader have multithreading memory leak issues?

devforfu · November 26, 2018, 8:45am

It seems that I can’t train the model on a huge dataset (17M images) on my local machine. I’ve created a post here describing my problem.

Could somebody advice what to do in this case? Do I need to enforce num_workers=0? But in this case, the training speed becomes really low due to I/O bottleneck.

Is it a memory-leakage problem in PyTorch, or am I doing something wrong?

sgugger · November 26, 2018, 12:52pm

This is definitely worth investigating more. Can you reproduce a minimal example? It would be with a small dataset of course, but something showing with your memory callback how the memory gets used more and more would be enough for us to try to debug (and send to pytorch if the issue is on their side).

devforfu · November 26, 2018, 1:22pm

Yes, agree, I am going to build a reproducible snippet for this purpose. I am fighting with this problem for a couple of weeks, and really would like to figure out where the leakage happens

devforfu · November 27, 2018, 3:55pm

Ok, here is a simple Gist where I am trying to reproduce the leakage using a smaller dataset:

Also, here is a link to the repository with the notebook:

It doesn’t allow to reproduce the issue exactly. However, as I can see, it shows a gradual increase of consumed memory during a single epoch (but not between epochs).

devforfu · December 2, 2018, 3:30pm

I’ve decided to implement a simple training loop to see if the problem still here when using plain PyTorch. So far the results are not too promising, the memory consumption is still growing during a single training epoch.

sshleifer · August 9, 2019, 4:59pm

I am experiencing a related issue on 1.056 with pytorch 1.1

If I run a notebook that makes a TextClassifierBunch and run show_batch, 4,561 MB of GPU memory are allocated.
If I restart the notebook, 3,944 MB remain allocated with 3% GPU utilization.
If I restart the notebook server, the memory continues to be allocated.
If i restart the machine, the memory is freed.

I am on a p100 instance from GCP if that’s relevant.

Any fix ideas?

Code:

data_clas = (TextList.from_df(edf, MOD_PATH, cols=[CTX_COL], vocab=data_lm.vocab)
      .split_from_df(IS_VALID).label_from_df(cols=[TARGET_COL])
      .databunch(bs=256))

sgugger · August 10, 2019, 7:03am

I have enver encountered a situation where the memory stayed used after restarting the notebook.Can you reproduce consistently?

sshleifer · August 12, 2019, 11:27pm

Yes. Turns out this is not at all a fastai bug. If you enable jupyterlab in GCP it starts a jupyterlab process on machine startup and that process grabs some gpu memory without even opening a notebook. Wierd but resolved by killing the process. Sorry to bother!

sgugger · August 13, 2019, 5:08pm

That’s why we recommend using jupyter notebook. Jupyter lab is still a bit clunky.