Understanding GPU memory usage

This thread’s intention is to help increase our collective understanding around GPU memory usage. Here are some potential subjects to discuss: NVIDIA context, pytorch memory allocator and caching, memory leaks, memory re-use and reclaim.

So if you have questions about these topics or, even better, insights you have gained through reading some papers, forums and blog posts, and, even better, yet, by doing your own experimentation, please post those here.

If you see some subject matter being discussed where as a group we lack expertise and you know of someone who does have that knowledge, please, invite them to help us gain the understanding.

This first post is also acting as a summary post, so please edit it to add links to relevant tools, online discussions and tutorials and also important posts in this thread.




Profiling GPU memory usage:

[this is a wiki post, so please improve it by editing it]


@stas I want to follow up on your excellent discussion about GPU memory on the Pytorch forum. In short, I can’t quite replicate your result and I want to know where I have gone wrong. Do you have a script/notebook that shows this “use of what is free” on the first step of the build of the MNIST model? When I try to replicate (here is my gist) I seem to be getting use of memory beyond what is “free” when I start the learn epoch. That is, Pytorch seems to take the memory it needs. Maybe my method to use up the GPU memory is not right (torch.ones((n,n)).cuda()? Any advice/suggestion is appreciated.

It’d be much less work if you switched your nb to ipyexperiments, as it’ll do the reporting automatically for you and those reports are much less noisy and easier to read/compare. Just split each segment you want to profile into its own cell.

And when you did that please tell me which cell(s) do you refer to when you ask this question. I’m just unclear at the moment what your questions is. What do you mean you are getting use of memory beyond what is “free” - using specific cell number(s) and memory usage numbers would help me with understanding your question.

Thank you.

p.s. GPUMemTrace is most useful when you need to profile parts of code that aren’t inlined in the nb. Also use its reporting tools, which again are easier to read:

1 Like

Great suggestion to use ipyexperiments! Thank you.

Do you expect two experiments to have the same peak memory usage without restarting the kernel? Or is it all order dependent?

I find that when I run the “second” experiment, whatever it is, the delta-peaked is much smaller. Compare these two results here and here where I have flipped the order of cells 9 and 10 after a kernel restart.

If it is expected to restart the kernel, then I am seeing what you state in your Pytorch post: When the “free” memory is lower, the peak usage is lower and the free memory can be taken down to some min that is required for that particular batch size. If the free memory is below this min amount, then you are out of luck and need to drop down the batch size.


  • Start using the new function gpu_mem_allocate_mbs(n) (ipyexperiments.utils.mem or fastai test suite’s utils.mem) so that it’s easier to see how many mbs you were allocating. I will need to update the demo/tests to use that instead eventually.

  • Well, I was actually suggesting to use ipyexperiments’s CellLogger, so that you can see cell by cell memory consumption - i.e. not the context manager. You don’t need to do anything differently other than starting the experiment and then splitting your code into multiple cells. https://github.com/stas00/ipyexperiments/blob/master/demo.ipynb scroll to gpu experiments (anchors don’t work).
    you can also disable the experiment part and only have the profiler like here https://github.com/stas00/ipyexperiments/blob/master/demo_cl.ipynb and I think what you’re trying to do here, you don’t need the experiment here at all, just the cell profiler and call learn.destroy() at the end of each experiment.

  • Your testing is bit overcomplicated since not only you’re comparing learn.fit_one_cycle(1) with fit(1), you throw in some other allocations. Unless you’re trying to force your card into having a fixed amount of free ram, in which case I recommend: gpu_mem_leave_free_mbs() (ipyexperiments.utils.mem or fastai test suite’s utils.mem) which lets you emulate your card’s free memory in one command. it’s easier to read the intention then.

    But still it’s the best to compare apples to apples, so if you change the order stick to the same fit() call. The less variations you use the better is the test.

  • When you do such tests you most likely need to fix up the seed: https://docs.fast.ai/dev/test.html#getting-reproducible-results albeit it doesn’t always help, I’m currently having this exact difficulty with unet_learner whose memory fluctuates quite wildly, which makes it impossible to try to make optimizations.

So let me know if I understand your setup correctly. You’re comparing the memory allocations in a lots-of-free gpu ram available vs. just enough to run the fit() function. Correct? Change your setups to move the pretend allocation out of the experiment, so it’ll be much easier to compare all the numbers. and use gpu_ram_leave_free_mbs().

So, yes, your observation is correct. This is what that thread on pytorch forums was discussing - when there is lots of RAM, the pytorch allocator will use a more efficient way using more RAM in the process, but returning it at the end. When there is little of RAM available it’ll need what it needs to run subsequent batches plus perhaps 10% extra for temp allocations (I haven’t tested the 10% - just an estimate from my experiments).

1 Like

In general the only reason we should care about peak memory is when we don’t have enough RAM to accommodate the peak’s need and the program fails. e.g.:

x1 = gpu_mem_allocate_mbs(1000)
x2 = gpu_mem_allocate_mbs(1000)
x1 = x2


x1 = gpu_mem_allocate_mbs(1000)
del x1
x2 = gpu_mem_allocate_mbs(1000)
x1 = x2

same result, but if we only have 1GB free, the first program fails, the 2nd succeeds.

This is the problem we have for example with learn.load() which first loads the replacement and then frees up the old allocations. Practically, it’s not worth fixing it though (and the only way to fix it is for pytorch to support unloading) because if you don’t have enough GPU RAM margin to support 2 x size of the model, you won’t be able to do anything anyway once you loaded the model.

And my initial shock was how inefficient pytorch was by peaking 5 times the normal usage, until it was explained to me that it’s only the case when that luxury can be afforded and otherwise pytorch can go through succeeding on an extremely tight memory “budget”.

Happy to use these functions instead. I was unfamiliar with them, thanks for pointing them out!

I agree it is unclear what is going on but with things like gpu_mem_leave_free_mbs() the experiments will be much clearer.

1 Like

@bfarzin, fyi, I made a whole bunch of improvements for https://docs.fast.ai/utils.mem.html#GPUMemTrace (git master required or 1.0.47 when it gets released).

This includes a new decorator https://docs.fast.ai/utils.mem.html#gpu_mem_trace, so now you can sprinkle those above methods and functions and get automatic reporting, e.g.: some output from unet learner debug I’m in the process of doing:

△Used Peaked MB:      0      0 (UnetBlock.forward: exit)
△Used Peaked MB:      0      0 (UnetBlock.forward: exit)
△Used Peaked MB:      0    154 (UnetBlock.forward: exit)
△Used Peaked MB:    372     64 (UnetBlock.forward: exit)
△Used Peaked MB:    128    282 (FeatureLoss.make_features: exit)
△Used Peaked MB:  1,220      0 (FeatureLoss.make_features: exit)
△Used Peaked MB:  1,508     32 (FeatureLoss.forward: exit)

I also changed the output format to make it easier to have stacks of those.

I know that column on the left looks redundant, but remember each of these prints is unrelated to each other and various other outputs may come in between.

Here I used an assumption that 5 digits fixed width should be enough for now (6 with ,), as I don’t know anybody with 100GB+ cards yet.

Other important changes in GPUMemTrace:

  • no need to start(), it starts automatically

  • context manager prints report automatically

  • added context and subcontext in reports, so you could easily tell where the report has come from, but only need to set the main context in the constructor. Example:

    m1 = GPUMemTrace(ctx='foo')
    m2 = GPUMemTrace(ctx='bar')


    △Used Peaked MB:      0      0 (foo: sample1)
    △Used Peaked MB:      0      0 (foo: sample2)
    △Used Peaked MB:      0      0 (bar: sample1)
    △Used Peaked MB:      0      0 (bar: sample2)

Have a look at the doc, lots of examples there.

As you use it please let me know if anything could be improved. The idea is to type as little as possible and to get intelligible outputs that could quickly help find leaks and inefficient code.

For example with the decorator it should be possible to turn debug traces on and off w/o touching the code (once the decorators are in the code). Just need to tweak it some more. It’s a work in progress. So please start using it and send back feedback if you find any. Thank you.

p.s. see the note in the doc about peak measurement being unreliable due to not having control over the thread that performs that measurement. We need to get pytorch support for this to give correct numbers always - please vote for this feature request.


2 posts were merged into an existing topic: IPyExperiments: Getting the most out of your GPU RAM in jupyter notebook

@stas I started using with gpu_mem_restore_ctx(). It would save same xx minutes if the doc included examples including which import to setup. fx gpu_mem_restore_ctx() requires from fastai.utils.ipython import * .

Yes, will do, thank you - just tell me which doc you’re referring to.

if you have notebooks with all your gpu-experiments then that would also be super to reference them

I see. The thing is that’s the API docs, and they are all like that. So we need to set a standard and make things consistent. If I added it for this particular function, where do we draw a line and not do it for every API.

However, you’re bringing a bigger issue here, this could have been done for all APIs, by showing the relevant import line.

The code is already there, but it’s hidden.

So I think what you’re really asking for is:

which can be done for all API docs. Just need to unhide/split/hide for each nb.

if you have notebooks with all your gpu-experiments then that would also be super to reference them

I’m not sure which gpu-experiments you refer to. Perhaps some of these?

thanks. you are right.
I guess my mis-step was that i expected that the utils stuff was so basic (and good) that they would be part of fastai.basic and thus already be available when importing fastai.text

i have been thinking for a while that it would be rather convenient if PeakMemMetric could included columns for average cpu and GPU usage. these metrics are often discussed in the forums.
Fx the transformerXL uses a lot of GPU mem but the GPU usage is very low (about 11%). So there is probably a large marging for improvement.

But, no, your question was valid and as you can see the import is shown now - just waiting to see that it’s ok with devs so we can have it in all notebooks - so thank you for that prompt, @Kaspar.

PeakMemMetric is for memory

But, of course, add a new callback and send a PR :slight_smile: test + doc included please.

Implementation wise: The problem with measuring the averages is that you will have to monitor them somehow and threads are far from reliable.

finally, let’s take this discussion into another thread since it’s diverging from the topic.

1 Like

I need some help here. There is something going on with GPU memory that I don’t understand. When I call the same experiment twice, I get two different results. The effect seems to be related to the first batch size that I attempt to push through the pipeline. If I try something very large, and that causes an OOM error, then it seems hard/impossible to recover. If I try a batch that is more reasonable, then it seems to proceed no problem. Maybe I am using the mem tools incorrectly. Any help appreciated.

Notebook is here. I print out a lot of details to help with debugging, sorry if that clouds up what is going on. Here is what I see:

  • Cell 11, I start with a ridiculous batch size, 4000, and it proceeds to tell me it is OOM and then does a binary search and comes up with 2 as the max size.
  • Cell 12, I repeat the same code, this time it fails again at 4,000 but recommends 126 as the max size.
  • Cell 13, I start with 800, and right away it says it can fit a batch of 800 on the card.

If I move the order of the cells around, I get a similar result where I can’t proceed after I have tried to put a very big batch on the card.

Is this related to:

First of all I highly recommend fixing the seed. It’s kind of crazy since there are like 6 calls to that to cover all bases:
update: looked at your nb - you did that so that is covered!

another quick feedback is that sometimes it depends on the learner - with cnn I get stable results, but with unet I get crazy fluctuations even with a fixed seed. I’m yet to figure out why this is happening.

FitNBatch, eventually, switch to the integrated (renamed) version (it fixes the batch count too, FitNBatch runs 1 batch too many, so with huge batches it’d make things faster a bit): https://docs.fast.ai/callbacks.misc.html#StopAfterNBatches

wrt progress_disabled - is there a way to completely remove progress output? no epoch and loss printing - we don’t need it and it just adds to the noise.

I’ll look at your notebook a bit later and follow up.

1 Like

I don’t know if this might be helpful, while I was trying to write a memory fragmentation tool, part of it was needing to write a function that will try to allocate the maximum memory chunk possible (which is almost always smaller than reported free memory due to fragmentation). And that’s how it goes - it tries what free memory reported first, then goes down if it’s OOM, then up if it’s successful and so on - it’s a recursive search.

For some reason github doesn’t render it, but if you download it and look at the Mem fragmentation mapper section, it’s the first functions there: https://github.com/stas00/fastai-misc/blob/master/debug/pytorch/mem-frag-map.ipynb update: I see that you’re using pretty much the same search alg.

It’s half-baked, since I didn’t realize that fragmentation happens on a sub-page level and cuda reallocates pages, so my attempt was doomed to failure from the get go, but perhaps some useful bits can be salvaged there.

1 Like