IPyExperiments: Getting the most out of your GPU RAM in jupyter notebook

stas · November 16, 2018, 5:34pm

Thank you for your feedback, @piotr.czapla.

I’d like to make the amount of data it prints customizable, since some probably would like it to be more terse, probably never silent, since it’d be hard to tell whether it was run or not otherwise. So probably by default it can be terse, printing the consumed/reclaimed data in a tight one or two liners, and if someone wants verbose as it is now, it’ll be so via a constructor argument.

Any other data to collect? time, duration?

piotr.czapla · November 16, 2018, 6:35pm

I would print the parameters that you remove from the global scope to minimise amount of surprises. Actually we could replace the global variables by a proxy object that when printed would tell user what just happend to his variable, or throw an error when a member is accessed on such object. What do you think?
I’m not sure regarding time, as this might depend on user.

stas · November 16, 2018, 6:52pm

Good idea.

Actually we could replace the global variables by a proxy object that when printed would tell user what just happend to his variable, or throw an error when a member is accessed on such object. What do you think?

I think if you access a variable and you get the error that it doesn’t exist it’s the best telltale sign, no? If you replace it with something else it would be more confusing, since if the user forgets they wanted the variables to be annihilated, and try to use them, rather than print, the error would be even more confusing or misleading.

Basically, it’d behave exactly as if you were to jump into a middle of a notebook and run some cell with variables that were supposed to be initialized earlier in the notebook cells - so you will get the same error here. Which is very consistent. And most season jupyter notebook users will instantly ask themselves - did I run the above cells?

stas · November 17, 2018, 2:05am

OK, other than refactoring I added a bunch of new changes.

printing what vars got deleted
added .get_stats() and .finish() methods so that the user can get the numbers programmatically for an even better experimentation.
elapsed wallclock time report

See the 3rd experiment in the demo notebook here to see the new methods in action.

The API is still wide-open so any suggestions for improvement are welcome.

stas · November 17, 2018, 9:20pm

Added:

a way to prevent some local variables from deletion
context manager support

See the 3+4 experiments in the demo notebook here to see the new methods in action.

stas · November 19, 2018, 5:27am

replaced gputil with much faster nvidia-ml-py3

I have no idea whether it works on windows, but I see no reason why it shouldn’t - as it accesses the nvml library directly.

stas · December 20, 2018, 12:54am

added a test suite
made the package available on pypi and conda
some minor fixes

piotr.czapla · December 21, 2018, 10:54pm

I’ve found some time today to have a look at the cyclic references of learner. Adding WeakReferences to callbacks fixes the issue but we still have a cyclic reference in scipy module which cannot be easily fixed. I’ve updated the test to reflect this. https://github.com/fastai/fastai/pull/1375

stas · January 4, 2019, 6:47pm

Recent changes:

on GPU backend loading report the ID, Name and Total RAM of the selected GPU
print_state now gives an easier to read report

Some breaking changes in the last release:

made the module into proper subclasses, no more global function aliases. So now use directly the desired backend: IPyExperimentsCPU, IPyExperimentsPytorch as an experiments module. It should be trivial now to add other backends.
and get_stats method has been replaced with data property method, which now returns one or more IPyExperimentMemory named tuple(s) depending on the used subclass.

Latest API is here: https://github.com/stas00/ipyexperiments#api

stas · January 16, 2019, 3:30am

It was painful to maintain two somewhat similar systems, so I integrated both into one.

So the big change is: ipygpulogger got integrated into ipyexperiments.

I’d like to finalize the API and to make sure that all the reported numbers and their names make sense and are intuitive. So if you get a chance please kindly play with the latest version and let me know if anything is unclear/confusing/can be improved/etc.

Thank you.

bfarzin · March 5, 2019, 6:43pm

These are helpful updates. I am making more progress on these tests and I think the notebooks look cleaner now.
I am getting an error I don’t understand, which I believe comes from IPytExperiments.

Full notebook in this link. Error in cell 16.

The error is not thrown consistently, and it has something to do with the del call, but I can’t figure out what.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/envs/fastaiv1_dev/lib/python3.7/site-packages/backcall/backcall.py in adapted(*args, **kwargs)
    102                 kwargs.pop(name)
    103 #            print(args, kwargs, unmatched_pos, cut_positional, unmatched_kw)
--> 104             return callback(*args, **kwargs)
    105 
    106         return adapted

~/anaconda3/envs/fastaiv1_dev/lib/python3.7/site-packages/ipyexperiments/cell_logger.py in post_run_cell(self)
    158 
    159         if self.backend != 'cpu':
--> 160             self.gpu_mem_used_new = self.exp.gpu_ram_used()
    161 
    162             # delta_used is the difference between current used mem and used mem at the start

AttributeError: 'NoneType' object has no attribute 'gpu_ram_used'

stas · March 5, 2019, 7:23pm

(fyi, I moved your report and my followup to the thread where it belongs so that we don’t discuss off-topic things there.)

Yup, I’ve been battling with this one for a while.

It seems to have to do with python threads. There is a peak memory manager python thread and there is the ipython callback thread . I need to be able to do an atomic check and quit the thread if that check fails, but I’m not sure how one goes about this in python threads. What happens now is that it intermittently fails at:

if are_we_still_running:
   do_something()

where it succeeds at the conditional check, and gets immediately yilded to the main process, and when it comes back do_something fails, because the condition is no longer true.

Because of two overlapping contexts - cell-level and notebook level, which reference each other in order to avoid circular references and have del exp do the right thing, I use a weakref proxy - that’s where it fails, since the proxy is gone and the thread still wants to run. so I’m not quite sure how to resolve this race condition.

And I can’t kill the ipython thread, that would be a disaster.

And if I keep the real parent object and not a proxy the sub-system will prevent it from being destroyed, which defeats the purpose of the experiment.

Perhaps I made a design mistake and it needs to be redone.

If you have the know-how please have a look.

bfarzin · March 5, 2019, 7:26pm

I’ve not worked with python threads before. I am glad this is a known issue for you.

Kaspar · March 6, 2019, 3:18pm

maybe this snippet can show the way.Copied from here: Grok the GIL: How to write fast and thread-safe Python | Opensource.com

So, despite the GIL, you still need locks to protect shared mutable state:
n = 0
lock = threading.Lock()

def foo():
global n
with lock:
n += 1

stas · March 6, 2019, 6:42pm

Yes, I made a first attempt with lock yesterday and thanks to the test suite it caught a deadlock, so I need to try harder. But yes, it seems to be the only way to make it thread-safe.

Thank you for the link, @Kaspar - this was an excellent article (and the comments after fix some of the incorrect things said in the article).

balnazzar · March 11, 2019, 6:48pm

Hi! First of all, thanks for having shared ipyexperiments, @stas. I find it quite useful in my everyday practice.

The tool works flawlessly on my 1080ti, but it has some issues on the Tesla V100 I use at work. Sometimes, it even reports a negative memory usage.

See this notebook: https://github.com/terribilissimo/otherstuff/blob/master/TESTs-FP16_V100.ipynb

Any idea about that strange behaviour?

stas · March 11, 2019, 8:49pm

the negative mem report has been fixed in master, I’m just waiting to find time to fix the thread race condition and will make a new release. until then install:

pip install git+https://github.com/stas00/ipyexperiments

what else is not working?

balnazzar · March 12, 2019, 1:30pm

Mhh, apart from the negative mem, it reports a wrong value (when positive), as you may see in the notebook.
But maybe it’s part of the same issue as the negative mem.

Thanks!

stas · March 12, 2019, 3:22pm

Can you please be more specific? I don’t know how to tell from your notebook which value is wrong.

Perhaps re-run it after updating to the master first and then let me know which numbers are off and why you think they are so. Thank you.

balnazzar · March 12, 2019, 4:38pm

Ok, I updated ipyexperiments and then tried it again over the same notebook.

At the end of the training phase, it reports a peak memory utilisation of 85 Mb, while via gpustat I saw 11602 Mb during training.