So let’s refocus on the purpose of this thread, which got a bit side-tracked by discovered memory leaks.
Here is the summary so far:
We want to be able to either re-use the GPU RAM after some earlier experiments in the notebooks have been completed, or due to running into ‘cuda: out-of-memory’ and needing to rollback to some earlier states where we could instrument a different bs
, bppt
, etc., to support the current card’s memory limitations.
So similar to saving intermediary data states, we want to have the same capability for processor states. The easiest way to accomplish that is by bundling several cells into a function, and with help of gc.collect() we can re-gain the memory lost to the execution of that function, so an example I used in the first post:
def block():
learn = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
learn.lr_find()
block()
gc.collect()
torch.cuda.empty_cache()
should give us all the memory consumed by that function back (once the leaks have been fixed).
However this is not how teaching notebooks have been written - those are written with about 1 call per cell, i.e. spread out through several cells. If that’s the case we need to go and manually destroy the objects we no longer need:
learn = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
learn.lr_find()
del learn
gc.collect()
torch.cuda.empty_cache()
and the same effect will be achieved.
However it’s a slow and error-prone process, trying to hunt down all the variables that were introduced in earlier cells we want to roll back, so ideally we need some mechanism that will automate that.
Other than implementing from scratch, do you know of any python modules that can create sentinels through the nb code, record newly created global variables between each sentinel and then we could easily destroy them.
And finally perhaps there can be done a notebook level extension that will record newly created global variables in each cell automatically and with a click of a mouth we could roll back to any of the earlier cells, with the caveat of the variable not being the same any longer, unfortunately, since it could have been modified in the later cells since its creation, but working correctly wrt memory release. And there could probably be a lot of issues with that too. I’m just thinking aloud here.
At the very least the teaching notebooks could have a few well positioned sentinels that a student could roll back to easily.
Really, what we are after is emulating user-defined variable scopes and self-destruction at the end of the scope. Except we can’t use functions, because we have multiple statements spread out through several cells. So the way I envision it is:
cell 1: scope1 = create_new_scope()
cell 2: learn1 = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
cell 3: learn1.lr_find()
cell 4: scope1.destroy
cell 5: scope2 = create_new_scope()
cell 6: learn2 = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
cell 7: learn2.lr_find()
cell 8: scope2.destroy
and the last command will delete new variables, call gc.collect',
torch.cuda.empty_cache()` - so we are emulating a sort of fixture over multiple notebook cells.
So now at any point you can go back to cell 1 or cell 5 and re-run the experiment, optionally after modifying it, and without needing to restart the kernel and re-running the setup cells at the beginning of the notebook.
and of course, if you re-use the same global variable, say learn
, the previous version will go automatically out of scope, so you only need to force gc.collect()
to make it free up its memory holds if the object has circular references.