IPyExperiments: Getting the most out of your GPU RAM in jupyter notebook

Thank you! you are really precise in your research. It explains quite a bit and it is a better choice given libraries like numpy.

I have it the first impl. working, but let’s wait for the refactor that @sgugger is pereparing.

1 Like

So let’s refocus on the purpose of this thread, which got a bit side-tracked by discovered memory leaks.

Here is the summary so far:

We want to be able to either re-use the GPU RAM after some earlier experiments in the notebooks have been completed, or due to running into ‘cuda: out-of-memory’ and needing to rollback to some earlier states where we could instrument a different bs, bppt, etc., to support the current card’s memory limitations.

So similar to saving intermediary data states, we want to have the same capability for processor states. The easiest way to accomplish that is by bundling several cells into a function, and with help of gc.collect() we can re-gain the memory lost to the execution of that function, so an example I used in the first post:

def block():
    learn = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)

    learn.lr_find()
block()
gc.collect()
torch.cuda.empty_cache()

should give us all the memory consumed by that function back (once the leaks have been fixed).

However this is not how teaching notebooks have been written - those are written with about 1 call per cell, i.e. spread out through several cells. If that’s the case we need to go and manually destroy the objects we no longer need:

learn = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
learn.lr_find()
del learn
gc.collect()
torch.cuda.empty_cache()

and the same effect will be achieved.

However it’s a slow and error-prone process, trying to hunt down all the variables that were introduced in earlier cells we want to roll back, so ideally we need some mechanism that will automate that.

Other than implementing from scratch, do you know of any python modules that can create sentinels through the nb code, record newly created global variables between each sentinel and then we could easily destroy them.

And finally perhaps there can be done a notebook level extension that will record newly created global variables in each cell automatically and with a click of a mouth we could roll back to any of the earlier cells, with the caveat of the variable not being the same any longer, unfortunately, since it could have been modified in the later cells since its creation, but working correctly wrt memory release. And there could probably be a lot of issues with that too. I’m just thinking aloud here.

At the very least the teaching notebooks could have a few well positioned sentinels that a student could roll back to easily.

Really, what we are after is emulating user-defined variable scopes and self-destruction at the end of the scope. Except we can’t use functions, because we have multiple statements spread out through several cells. So the way I envision it is:

cell 1: scope1 = create_new_scope()
cell 2: learn1 = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
cell 3: learn1.lr_find()
cell 4: scope1.destroy
cell 5: scope2 = create_new_scope()
cell 6: learn2 = language_model_learner(data_lm, bptt=70, drop_mult=0.3, pretrained_model=URLs.WT103)
cell 7: learn2.lr_find()
cell 8: scope2.destroy

and the last command will delete new variables, call gc.collect',torch.cuda.empty_cache()` - so we are emulating a sort of fixture over multiple notebook cells.

So now at any point you can go back to cell 1 or cell 5 and re-run the experiment, optionally after modifying it, and without needing to restart the kernel and re-running the setup cells at the beginning of the notebook.

and of course, if you re-use the same global variable, say learn, the previous version will go automatically out of scope, so you only need to force gc.collect() to make it free up its memory holds if the object has circular references.

2 Likes

@stas and @piotr.czapla sorry to maybe sidetrack the discussion with a proposal.

Wouldn’t it be simpler and more useful to continue down stas’ “def block” approach by creating an Experiment and ExperimentManager so that you use the cells to define your experiments that you register with the ExperimentManager in the run order the user prefers.

class MyExp1(Experiment):
def run()

em = ExperimentManager()
em.addExp( MyExp1() )
em.addExp( MyExp2() )
em.run()
em.report()
em.clean()

This approach would also have the advantage of making experiments more manageable to the user and with a report function of hyperparameters an progress (losses and metrics) I would personally be en the 7th. Heaven

It surely is a way, and nothing stops you from doing that already. In particular since you run them all at once in your pseudo-code, you can just write a single function in first place - it’s your experiment, since you aren’t really taking advantage of the separate cells…

The way your pseudo-code is written it will take a lot more typing and you’re not really gaining anything by doing that.

Being able to just keep the cells as they are now would be much nicer. The only difference in my proposal is a small number of additional cells with sentinels, sprinkled at strategic places.

OK, here is the initial implementation of the concept: https://github.com/stas00/ipyexperiments

Please let me know what you think.

The demo notebook is here.

1 Like

@stas, the experiments looks super cool! It is super useful to show the memory utilisation to for batch size searching. I’m planning to get in to the memory management of fastaiv1 to figure out why I’m getting random OOM exceptions during language model training. I will as well remove the cyclical reference of callbacks.

1 Like

Thank you for your feedback, @piotr.czapla.

I’d like to make the amount of data it prints customizable, since some probably would like it to be more terse, probably never silent, since it’d be hard to tell whether it was run or not otherwise. So probably by default it can be terse, printing the consumed/reclaimed data in a tight one or two liners, and if someone wants verbose as it is now, it’ll be so via a constructor argument.

Any other data to collect? time, duration?

I would print the parameters that you remove from the global scope to minimise amount of surprises. Actually we could replace the global variables by a proxy object that when printed would tell user what just happend to his variable, or throw an error when a member is accessed on such object. What do you think?
I’m not sure regarding time, as this might depend on user.

1 Like

Good idea.

Actually we could replace the global variables by a proxy object that when printed would tell user what just happend to his variable, or throw an error when a member is accessed on such object. What do you think?

I think if you access a variable and you get the error that it doesn’t exist it’s the best telltale sign, no? If you replace it with something else it would be more confusing, since if the user forgets they wanted the variables to be annihilated, and try to use them, rather than print, the error would be even more confusing or misleading.

Basically, it’d behave exactly as if you were to jump into a middle of a notebook and run some cell with variables that were supposed to be initialized earlier in the notebook cells - so you will get the same error here. Which is very consistent. And most season jupyter notebook users will instantly ask themselves - did I run the above cells?

1 Like

OK, other than refactoring I added a bunch of new changes.

  1. printing what vars got deleted
  2. added .get_stats() and .finish() methods so that the user can get the numbers programmatically for an even better experimentation.
  3. elapsed wallclock time report

See the 3rd experiment in the demo notebook here to see the new methods in action.

The API is still wide-open so any suggestions for improvement are welcome.

1 Like

Added:

  1. a way to prevent some local variables from deletion
  2. context manager support

See the 3+4 experiments in the demo notebook here to see the new methods in action.

  • replaced gputil with much faster nvidia-ml-py3

I have no idea whether it works on windows, but I see no reason why it shouldn’t - as it accesses the nvml library directly.

  • added a test suite
  • made the package available on pypi and conda
  • some minor fixes
2 Likes

I’ve found some time today to have a look at the cyclic references of learner. Adding WeakReferences to callbacks fixes the issue but we still have a cyclic reference in scipy module which cannot be easily fixed. I’ve updated the test to reflect this. https://github.com/fastai/fastai/pull/1375

1 Like

Recent changes:

  • on GPU backend loading report the ID, Name and Total RAM of the selected GPU
  • print_state now gives an easier to read report

Some breaking changes in the last release:

  • made the module into proper subclasses, no more global function aliases. So now use directly the desired backend: IPyExperimentsCPU, IPyExperimentsPytorch as an experiments module. It should be trivial now to add other backends.
  • and get_stats method has been replaced with data property method, which now returns one or more IPyExperimentMemory named tuple(s) depending on the used subclass.

Latest API is here: https://github.com/stas00/ipyexperiments#api

2 Likes

It was painful to maintain two somewhat similar systems, so I integrated both into one.

So the big change is: ipygpulogger got integrated into ipyexperiments.

I’d like to finalize the API and to make sure that all the reported numbers and their names make sense and are intuitive. So if you get a chance please kindly play with the latest version and let me know if anything is unclear/confusing/can be improved/etc.

Thank you.

3 Likes

These are helpful updates. I am making more progress on these tests and I think the notebooks look cleaner now.
I am getting an error I don’t understand, which I believe comes from IPytExperiments.

Full notebook in this link. Error in cell 16.

The error is not thrown consistently, and it has something to do with the del call, but I can’t figure out what.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/envs/fastaiv1_dev/lib/python3.7/site-packages/backcall/backcall.py in adapted(*args, **kwargs)
    102                 kwargs.pop(name)
    103 #            print(args, kwargs, unmatched_pos, cut_positional, unmatched_kw)
--> 104             return callback(*args, **kwargs)
    105 
    106         return adapted

~/anaconda3/envs/fastaiv1_dev/lib/python3.7/site-packages/ipyexperiments/cell_logger.py in post_run_cell(self)
    158 
    159         if self.backend != 'cpu':
--> 160             self.gpu_mem_used_new = self.exp.gpu_ram_used()
    161 
    162             # delta_used is the difference between current used mem and used mem at the start

AttributeError: 'NoneType' object has no attribute 'gpu_ram_used'

(fyi, I moved your report and my followup to the thread where it belongs so that we don’t discuss off-topic things there.)

Yup, I’ve been battling with this one for a while.

It seems to have to do with python threads. There is a peak memory manager python thread and there is the ipython callback thread . I need to be able to do an atomic check and quit the thread if that check fails, but I’m not sure how one goes about this in python threads. What happens now is that it intermittently fails at:

if are_we_still_running:
   do_something()

where it succeeds at the conditional check, and gets immediately yilded to the main process, and when it comes back do_something fails, because the condition is no longer true.

Because of two overlapping contexts - cell-level and notebook level, which reference each other in order to avoid circular references and have del exp do the right thing, I use a weakref proxy - that’s where it fails, since the proxy is gone and the thread still wants to run. so I’m not quite sure how to resolve this race condition.

And I can’t kill the ipython thread, that would be a disaster.

And if I keep the real parent object and not a proxy the sub-system will prevent it from being destroyed, which defeats the purpose of the experiment.

Perhaps I made a design mistake and it needs to be redone.

If you have the know-how please have a look.

I’ve not worked with python threads before. I am glad this is a known issue for you.

maybe this snippet can show the way.Copied from here: https://opensource.com/article/17/4/grok-gil

So, despite the GIL, you still need locks to protect shared mutable state:
n = 0
lock = threading.Lock()

def foo():
global n
with lock:
n += 1

2 Likes