GPU Memory Leaking

bisonbrain · June 14, 2020, 6:41pm

I’m using a lightly modified version of the train_imagenette example notebook. While doing some testing with larger numbers of runs I noticed that GPU memory usage continued to creep upward. This is a bit of a problem because it eventually runs out of memory, limiting the number of runs I specify.

It seems as if fastai2 is leaking GPU memory somewhere? In main the only thing thing that stays alive between iterations is the dataloader.

Has anyone encountered a similar bug and knows a fix? Or what tools can I use to track down where the GPU memory is being leaked?

Thank you.

philchu · June 16, 2020, 3:24am

Do you have a snippet of code that can reproduce this from command line? It’ll help anyone who is interested in debugging it. Thanks.

You can try to insert torch.cuda.empty_cache() and gc.collect() at the end of each run to free up some memory, and see if GPU memory usage still grows (monitor it using nvidia-smi -l 1 in another shell).

bisonbrain · June 17, 2020, 6:08am

Thank you for the suggestion! Adding those at the end of each run does indeed keep the GPU memory usage from growing. Now I don’t have to keep restarting my notebook kernels