Developer chat

do you know if we can instruct the GPU to stop keeping a mem cache so that we can see the timeline for mem allocations.

I don’t know, perhaps there is a way to compile pytorch w/ caching disabled? Ask at http://discuss.pytorch.org/ and report back your findings?

Until, then try to run torch.cuda.empty_cache() at strategic points.

And you might find this cell-by-cell gpu memory logger that I have just released useful: GitHub - stas00/ipygpulogger: GPU Logger for jupyter/ipython (memory usage, execution time) - it’s totally new so I’m still tweaking the interface (and feedback is welcome!). But the main reason I mentioned it to you is that it runs empty_cache() automatically for you before and after each cell is run to measure the gpu memory usage correctly. (and gc.collect() but that can be turned off)

I have also just discovered this pytorch CUDA memory profiler, which perhaps can be useful to you. A CUDA memory profiler for pytorch · GitHub

I have just started a new thread GPU Optimizations Central - let’s have that discussion over there and use that thread for compiling all the knowledge we collectively discover.

When i restart the PC and/or jupyter notebook then i can see a surge in GPU-mem when starting training of a language model. i think it is when backprop starts.

It could be this too: GitHub - stas00/ipygpulogger: GPU Logger for jupyter/ipython (memory usage, execution time) - if it’s the first 0.5GB then it certainly is the case.