It seems that I can’t train the model on a huge dataset (17M images) on my local machine. I’ve created a post here describing my problem.
Could somebody advice what to do in this case? Do I need to enforce num_workers=0? But in this case, the training speed becomes really low due to I/O bottleneck.
Is it a memory-leakage problem in PyTorch, or am I doing something wrong?
This is definitely worth investigating more. Can you reproduce a minimal example? It would be with a small dataset of course, but something showing with your memory callback how the memory gets used more and more would be enough for us to try to debug (and send to pytorch if the issue is on their side).
Yes, agree, I am going to build a reproducible snippet for this purpose. I am fighting with this problem for a couple of weeks, and really would like to figure out where the leakage happens
Ok, here is a simple Gist where I am trying to reproduce the leakage using a smaller dataset:
Also, here is a link to the repository with the notebook:
It doesn’t allow to reproduce the issue exactly. However, as I can see, it shows a gradual increase of consumed memory during a single epoch (but not between epochs).
I’ve decided to implement a simple training loop to see if the problem still here when using plain PyTorch. So far the results are not too promising, the memory consumption is still growing during a single training epoch.
I am experiencing a related issue on 1.056 with pytorch 1.1
If I run a notebook that makes a TextClassifierBunch and run show_batch, 4,561 MB of GPU memory are allocated.
If I restart the notebook, 3,944 MB remain allocated with 3% GPU utilization.
If I restart the notebook server, the memory continues to be allocated.
If i restart the machine, the memory is freed.
I am on a p100 instance from GCP if that’s relevant.
Yes. Turns out this is not at all a fastai bug. If you enable jupyterlab in GCP it starts a jupyterlab process on machine startup and that process grabs some gpu memory without even opening a notebook. Wierd but resolved by killing the process. Sorry to bother!