IPyExperiments: Getting the most out of your GPU RAM in jupyter notebook

[this has been moved from the dev chat to support a focused discussion]

TL:DR How can we do a lot of experimentation in a given jupyter notebook w/o needing to restart the kernel all the time? Solution: https://github.com/stas00/ipyexperiments

Post 1: I have been contemplating how we have a very inefficient way of dealing with ‘cuda: out of memory’ problems and also when we do experiments in the notebook and the GPU memory “leaks” with each experiment.

I’d like to explore two closely related scenarios:

  1. dealing with ‘cuda: out of memory’ by being able to roll back to a processor state where we can change the parameters to consume less memory. We already save intermediate states of data, but often it’s cumbersome since it’s not enough to restart the kernel and load the data again, one needs to go and re-run some parts of the notebook, which is very inefficient and error-prone.

  2. having an experiment framework, so that at the end of each experiment the GPU RAM is released.

So the theoretical idea is to have a block like ‘try/except’, but wrt objects, so that at the end of that block all objects that were created since the beginning of this block get freed and the consumed by this block GPU RAM gets automatically released to be used again.

So when I’d like to find out some parameters for better training outcome or for finding the limit of the card, I’d run:

data = ...
memory_block:
    learn = ...
    learn.lr_find()
    learn.fit...

and I could go back and repeat it w/o needing to restart the kernel.

I guess implementation-wise it’d be some kind of fixture that will start recording all new objects created and then destroy them at the end of the block?

Perhaps just using a simple function will do the trick as the local variables should get destroyed upon its exit, so there is no need to re-invent the wheel - I am yet to try it - I’m not sure under which circumstances pytorch releases its GPU memory. Except it’d be going against the-action-per-cell convention we use, as the whole experiment will need to be moved into a single cell.

Your thoughts are welcome.

Post 2:

It somewhat works. i.e. some memory gets released, but not all.

Is it possible that fastai leaks some objects?

Currently I’m trying to figure out the parameters for lesson3-imdb.ipynb so that my 8GB card can do the lesson, since as is it’s running out of GPU RAM.

def block():
    learn = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)

    learn.lr_find()
block()
torch.cuda.empty_cache()

Since cache has been emptied, in theory the GPU RAM should go back to be the same after running this block.

Before running this code I get: 485MiB used, after - 2097MiB. If I rerun this cell it goes up to 2669MiB, and 3rd 3219MiB, then 3765MiB and so forth.

Any insights on what might be going on? Some circular references that prevent the memory release?

If I add gc.collect() before emptying torch cache:

block()
gc.collect()
torch.cuda.empty_cache()

then I can re-run the block w/o incremental RAM leakage.

Reading more on gc, gc.garbage contains a list of objects which the collector found to be unreachable but could not be freed (uncollectable objects). objects with a __del__() method don’t end up in gc.garbage . So if I add print(gc.garbage):

block()
gc.collect()
torch.cuda.empty_cache()
print(gc.garbage)

I get:

AttributeError                            Traceback (most recent call last)
<ipython-input-25-f06f01607ed3> in <module>()
      8 gc.collect()
      9 torch.cuda.empty_cache()
---> 10 print(gc.garbage)

~/anaconda3/envs/pytorch-dev/lib/python3.6/site-packages/dataclasses.py in __repr__(self)

AttributeError: 'LanguageLearner' object has no attribute 'data'

So does this mean gc couldn’t free the LanguageLearner object? If I print object reference, indeed you can see that there is a reference cycle there:

To get this graph I did:

! pip install objgraph xdot
import objgraph
learn = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
objgraph.show_refs([learn])

Plus, there are 920 objects that couldn’t be freed:

print(len(gc.garbage))
920

edit: I did a few experiments with pure pytorch:

import torch
def consume(n): torch.ones((n, n)).cuda()

n = 1
consume(n)
torch.cuda.empty_cache()
# always keeps 481MiB GPU RAM consumed

n = 5*2**13 # about 7GB
consume(n)
torch.cuda.empty_cache()
# back to 481MiB GPU RAM consumed

So it looks like it’s normal for pytorch to occupy and keep tied 0.5GB of GPU RAM to do even a tiniest thing of creating 1x tensor and loading it onto GPU.

However, it releases all the extra memory used by the process when I tried to load 7GB onto it.

But with fastai it doesn’t do that - the memory is not freed - which most likely means there is a leakage of objects that don’t get automatically freed. But if you look at the code that calls gc.collect() earlier most things do get reclaimed - so it seems that we get a conflict of python’s gc not knowing to run gc in time when the function returns and its variables needing to be released, and thus we get stuck with consumed GPU RAM, even though it can be freed up. But my suspicion is that while gc can magically release objects with circular references, fastai should do that on its own.

13 Likes

I’ve created the thread in the end so that we can write a bit more during the investigation

@stas the memory issue is quite important to me as well, we need it somehow fixed to work on multilingual version of ulmfit. There are as well some CUDA crashes other than OOM that I hit from time to time when I play with rnn api. So how about we create a new thread and start discussing there?

Regarding cyclic references in python, I remember that it was an issue in old versions but nowadays it just a matter of executing GC as opposed to quick reference counting strategy for acyclic graphs.

The cyclic reference comes in part from Callbacks (Learner Callbacks) as they are defined as dataclasses and they declare learner as property and some of the callbacks are stored in learner. Besides cyclic reference it makes it hard to list callbacks as the whole Learner object with Model object is printed as well for each callback.

I would vote for changing the Callbacks signature to make a learn a getter and use weak references to remove the cyclic dependency. What do you think @sgugger?

@dataclass
class LearnerCallback(Callback):
   learn: Learner                                                 # <-  source of cyclic references
   def __post_init__(self):
         if self.cb_name: setattr(self.learn, self.cb_name, self) # <- source of cyclic references

I would suggest using something along this lines

@dataclass
class LearnerCallback(Callback):
   learn: InitVar[Learner]
   _learn: Learner = field(init=False, repr=False, compare=False)
   def __post_init__(self):
         if self.cb_name: setattr(self.learn, self.cb_name, self)  # <- still a problem
         self._learn = weakref.ref(self.learn)
  @getter
  def learn():
      return self._learn()

What remains an issue is the setattr on learner, @sgugger can we replace that with a weekref dictionary and some lookup function? Why do we need it, do you have an example at hand?

1 Like

A good side-effect of your proposal is that if we save what happened to the data and not only a snapshot of data at a certain point, we get reproducibility.

I have been thinking about this for a while. I often go back in a notebook and change parameters and after some forwards and backwards I sometimes have no idea of what I did to get a good result and cannot reproduce it from scratch (just me? I swear I have been trying to be more organized with my notebooks).

1 Like

Or we need to use the __del__ method in such classes that correctly unwind the circular references.

Also it’s crucial to remember that if __del__ is added, it has to be correct. Since gc.collect() will not attempt to free objects with circular references whose class has __del__ .

I’m thinking that we probably need a whole set of tests that tests that objects get destroyed cleanly and no leakage happens. I was thinking that we need to deploy some of the gpu memory access modules discussed here: Show GPU utilization metrics inside training loop (without subprocess call!) to do the measuring. Which is now documented here: https://docs.fast.ai/dev/gpu.html#accessing-nvidia-gpu-info-programmatically.

But perhaps simply counting gc.garbage after calling gc.collect should do the trick. In fact I think this is the correct way, because what if we leak objects on CPU instead of GPU - this is just as bad.

So a test would look like:

gc.collect()
garbage_before = len(gc.garbage)  # should be 0 already 
fastai_operation_block_with_constructor() # should free everything up on return
collected = gc.collect() # should be 0 too - !0 means we have circular references
garbage_after = len(gc.garbage)  # again, should be 0, or == garbage_before

and add the asserts…

I am not sure whether it’d be the best to have a single test module dedicated to memory leaks or to spread them out to the various test modules that test specific functions of fastai.

edit: added a test that reproduces the leak using gc - so it works well as I hoped it would. https://github.com/fastai/fastai/commit/2f9697ac4dec048a43dbb0a16ec85334ce95b069

I try to make my notebooks into as close as possible to normal programs precisely for that reason - I want to be able to reproduce what I did and jumping back and forth between cells doesn’t help.

I often group many cells into one so that I could perform blocks of operations in a consistent manner. and I have a feeling I will be moving more into functions, since most of the time nb global variables are difficult to work with since you never know whether it’s up-to-date or not, when jumping between cells, often leading to a misleading outcome and a lot of wasted time.

3 Likes

That would be a reason why not to use __del__ . What do you have against weakref?

I’m not sure where in my last follow up you found any words that suggest that I’m against weakref.

We are exploring different options, __del__ being one of them.

To me __del__ sounds like the correct option because if the programmer purposefully created a circular reference it’s his duty to undo it.

I’m yet to use weakref in python, so please kindly answer the following question: Will using it result in the object freed automatically when it goes out of scope, or will that only happen during garbage collection? If the former, then we can try to see which of the two ways is better. If the latter, then it’s not good, since gc might not be called for a long time and the leaked object will meanwhile keep tied potentially huge amounts of memory.

Current python’s gc has several layers - automatic gc - freeing a variable when its reference count goes to zero (which is what we want here) and then real gc layer when it scans the whole tree, magically finds circular references and other unclean things and cleans those up.

The unclear part to me is this:

A weak reference to an object is not enough to keep the object alive: when the only remaining references to a referent are weak references, garbage collection is free to destroy the referent and reuse its memory for something else. However, until the object is actually destroyed the weak reference may return the object even if there are no strong references to it.

It’s not clear from this text whether it refers to automatic freeing gc that happens when ref count goes to zero or scheduled/manual gc.collect(). I suppose the easiest thing to try is for you apply the changes you suggested and then run the test I have added: add a test that reproduces a memory leak in learner (skipped state) · fastai/fastai@2f9697a · GitHub its outcome will tell us right away which is which.

Also, from reading weakref — Weak references — Python 3.12.5 documentation it doesn’t sound like it’s the intended use. But again I haven’t worked with this method myself yet, so please kindly share your experience and the merits/demerits of using weakref vs. __del__.

Thank you.

1 Like

Not sure now :). Maybe because __del__ is a rather fragile approach so I’ve assumed you had some bad exp. with weakref to suggest something that has this kind of drawbacks. Btw.in python 3.6 the issue with __del__ was solved and the objects will still be collected even if an exception is thrown,

I think so. Weakrefs are different from normal references, you may think about as they would not exist from the GC point of view. If the reference (normal) count the object is removed and all weakrefs are cleared. Try:

class AClass(): pass
a=weakref.ref(AClass())
a() # this should return nothing as AClass will be created and destroyed immidately

I guess this is a version of mark & sweep. Ie gc goes through the whole object graph marking all the objects it has access to. Once finished it removes all the objects weren’t marked.

This is especially a problem for pytorch, because GC does not see the memory allocated on the GPU and the RAM object size is neglectable. Normally this isn’t a big issue because whenever your memory consumption exceeds some threshold gc will fire and clean up. Not in our case though as learner or model is tiny in RAM.

Weakref can point to objects that have cyclic references. ie that aren’t immediately collected. In such cases the object will be reachable for some time until it is garbage collected.

I’ve worked with weekrefs in Java and with reference counting in objective-C. So I’m extrapolating my knowledge. But I’m pretty sure that anything that is designed to make caches is designed to cut cycles in references. I guess it is not stated as the primary reason in the docs because you wouldn’t care whether GC will fire or not in normal circumstances, as it will fire in the right moments. It is not the case when you use numpy, or torch as your real memory utilization is invisible to the gc.

I hope it helps.

1 Like

@stas, @sgugger

To implement the weekref’s i would need my first PR merged. https://github.com/fastai/fastai/pull/1138

Could you have a look there?

1 Like

I’ll have a look at this and the QRNN stuff, but probably not before tomorrow as the course needs to be ready for tonight and a big refactor is on its way.
Thanks for your work!

1 Like

I have the impression that when gc.collect is placed just after torch.cuda.empty_cache() then there is a yield that ensure that torch.cuda.empty_cache() takes before the app allocates cuda mem again ?

1 Like

Thank you for the detailed follow up, @piotr.czapla. Let’s put this into action and see how it works out.

The trigger for internal gc.collect in current python doesn’t care about the size of of consumed memory, but is a difference in the number of allocated and freed objects. 700 by default, see: gc — Garbage Collector interface — Python 3.12.5 documentation gc — Garbage Collector interface — Python 3.12.5 documentation

edit: and there are 3 levels of that functionality to make it much more efficient, a good summary of its workings can be found at Garbage collection in Python: things you need to know | Artem Golubin.

1 Like

And it happens, and you experiment with weakreaf - please make sure that this new test that currently gets skipped since it fails, starts working:

Thank you for your contribution, @piotr.czapla

I am not answering your question, but just have a remark about empty_cache().

torch.cuda.empty_cache() is only useful if you have more than one process using pytorch. otherwise you don’t need to empty the cache - it’ll get re-used automatically. It’s a useful tool when debugging memory leaks, you can also ask pytorch to tell you how much memory is cached. https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

1 Like

Thank you! you are really precise in your research. It explains quite a bit and it is a better choice given libraries like numpy.

I have it the first impl. working, but let’s wait for the refactor that @sgugger is pereparing.

1 Like

So let’s refocus on the purpose of this thread, which got a bit side-tracked by discovered memory leaks.

Here is the summary so far:

We want to be able to either re-use the GPU RAM after some earlier experiments in the notebooks have been completed, or due to running into ‘cuda: out-of-memory’ and needing to rollback to some earlier states where we could instrument a different bs, bppt, etc., to support the current card’s memory limitations.

So similar to saving intermediary data states, we want to have the same capability for processor states. The easiest way to accomplish that is by bundling several cells into a function, and with help of gc.collect() we can re-gain the memory lost to the execution of that function, so an example I used in the first post:

def block():
    learn = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)

    learn.lr_find()
block()
gc.collect()
torch.cuda.empty_cache()

should give us all the memory consumed by that function back (once the leaks have been fixed).

However this is not how teaching notebooks have been written - those are written with about 1 call per cell, i.e. spread out through several cells. If that’s the case we need to go and manually destroy the objects we no longer need:

learn = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
learn.lr_find()
del learn
gc.collect()
torch.cuda.empty_cache()

and the same effect will be achieved.

However it’s a slow and error-prone process, trying to hunt down all the variables that were introduced in earlier cells we want to roll back, so ideally we need some mechanism that will automate that.

Other than implementing from scratch, do you know of any python modules that can create sentinels through the nb code, record newly created global variables between each sentinel and then we could easily destroy them.

And finally perhaps there can be done a notebook level extension that will record newly created global variables in each cell automatically and with a click of a mouth we could roll back to any of the earlier cells, with the caveat of the variable not being the same any longer, unfortunately, since it could have been modified in the later cells since its creation, but working correctly wrt memory release. And there could probably be a lot of issues with that too. I’m just thinking aloud here.

At the very least the teaching notebooks could have a few well positioned sentinels that a student could roll back to easily.

Really, what we are after is emulating user-defined variable scopes and self-destruction at the end of the scope. Except we can’t use functions, because we have multiple statements spread out through several cells. So the way I envision it is:

cell 1: scope1 = create_new_scope()
cell 2: learn1 = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
cell 3: learn1.lr_find()
cell 4: scope1.destroy
cell 5: scope2 = create_new_scope()
cell 6: learn2 = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
cell 7: learn2.lr_find()
cell 8: scope2.destroy

and the last command will delete new variables, call gc.collect',torch.cuda.empty_cache()` - so we are emulating a sort of fixture over multiple notebook cells.

So now at any point you can go back to cell 1 or cell 5 and re-run the experiment, optionally after modifying it, and without needing to restart the kernel and re-running the setup cells at the beginning of the notebook.

and of course, if you re-use the same global variable, say learn, the previous version will go automatically out of scope, so you only need to force gc.collect() to make it free up its memory holds if the object has circular references.

2 Likes

@stas and @piotr.czapla sorry to maybe sidetrack the discussion with a proposal.

Wouldn’t it be simpler and more useful to continue down stas’ “def block” approach by creating an Experiment and ExperimentManager so that you use the cells to define your experiments that you register with the ExperimentManager in the run order the user prefers.

class MyExp1(Experiment):
def run()

em = ExperimentManager()
em.addExp( MyExp1() )
em.addExp( MyExp2() )
em.run()
em.report()
em.clean()

This approach would also have the advantage of making experiments more manageable to the user and with a report function of hyperparameters an progress (losses and metrics) I would personally be en the 7th. Heaven

It surely is a way, and nothing stops you from doing that already. In particular since you run them all at once in your pseudo-code, you can just write a single function in first place - it’s your experiment, since you aren’t really taking advantage of the separate cells…

The way your pseudo-code is written it will take a lot more typing and you’re not really gaining anything by doing that.

Being able to just keep the cells as they are now would be much nicer. The only difference in my proposal is a small number of additional cells with sentinels, sprinkled at strategic places.

OK, here is the initial implementation of the concept: https://github.com/stas00/ipyexperiments

Please let me know what you think.

The demo notebook is here.

1 Like

@stas, the experiments looks super cool! It is super useful to show the memory utilisation to for batch size searching. I’m planning to get in to the memory management of fastaiv1 to figure out why I’m getting random OOM exceptions during language model training. I will as well remove the cyclical reference of callbacks.

1 Like