First:
-
Start using the new function
gpu_mem_allocate_mbs(n)
(ipyexperiments.utils.mem
or fastai test suite’sutils.mem
) so that it’s easier to see how many mbs you were allocating. I will need to update the demo/tests to use that instead eventually. -
Well, I was actually suggesting to use ipyexperiments’s CellLogger, so that you can see cell by cell memory consumption - i.e. not the context manager. You don’t need to do anything differently other than starting the experiment and then splitting your code into multiple cells. https://github.com/stas00/ipyexperiments/blob/master/demo.ipynb scroll to gpu experiments (anchors don’t work).
you can also disable the experiment part and only have the profiler like here https://github.com/stas00/ipyexperiments/blob/master/demo_cl.ipynb and I think what you’re trying to do here, you don’t need the experiment here at all, just the cell profiler and call learn.destroy() at the end of each experiment. -
Your testing is bit overcomplicated since not only you’re comparing learn.fit_one_cycle(1) with fit(1), you throw in some other allocations. Unless you’re trying to force your card into having a fixed amount of free ram, in which case I recommend:
gpu_mem_leave_free_mbs()
(ipyexperiments.utils.mem
or fastai test suite’sutils.mem
) which lets you emulate your card’s free memory in one command. it’s easier to read the intention then.But still it’s the best to compare apples to apples, so if you change the order stick to the same fit() call. The less variations you use the better is the test.
-
When you do such tests you most likely need to fix up the seed: https://docs.fast.ai/dev/test.html#getting-reproducible-results albeit it doesn’t always help, I’m currently having this exact difficulty with unet_learner whose memory fluctuates quite wildly, which makes it impossible to try to make optimizations.
So let me know if I understand your setup correctly. You’re comparing the memory allocations in a lots-of-free gpu ram available vs. just enough to run the fit() function. Correct? Change your setups to move the pretend allocation out of the experiment, so it’ll be much easier to compare all the numbers. and use gpu_ram_leave_free_mbs().
So, yes, your observation is correct. This is what that thread on pytorch forums was discussing - when there is lots of RAM, the pytorch allocator will use a more efficient way using more RAM in the process, but returning it at the end. When there is little of RAM available it’ll need what it needs to run subsequent batches plus perhaps 10% extra for temp allocations (I haven’t tested the 10% - just an estimate from my experiments).