IPyExperiments: Getting the most out of your GPU RAM in jupyter notebook

  • replaced gputil with much faster nvidia-ml-py3

I have no idea whether it works on windows, but I see no reason why it shouldn’t - as it accesses the nvml library directly.

  • added a test suite
  • made the package available on pypi and conda
  • some minor fixes
2 Likes

I’ve found some time today to have a look at the cyclic references of learner. Adding WeakReferences to callbacks fixes the issue but we still have a cyclic reference in scipy module which cannot be easily fixed. I’ve updated the test to reflect this. https://github.com/fastai/fastai/pull/1375

1 Like

Recent changes:

  • on GPU backend loading report the ID, Name and Total RAM of the selected GPU
  • print_state now gives an easier to read report

Some breaking changes in the last release:

  • made the module into proper subclasses, no more global function aliases. So now use directly the desired backend: IPyExperimentsCPU, IPyExperimentsPytorch as an experiments module. It should be trivial now to add other backends.
  • and get_stats method has been replaced with data property method, which now returns one or more IPyExperimentMemory named tuple(s) depending on the used subclass.

Latest API is here: https://github.com/stas00/ipyexperiments#api

2 Likes

It was painful to maintain two somewhat similar systems, so I integrated both into one.

So the big change is: ipygpulogger got integrated into ipyexperiments.

I’d like to finalize the API and to make sure that all the reported numbers and their names make sense and are intuitive. So if you get a chance please kindly play with the latest version and let me know if anything is unclear/confusing/can be improved/etc.

Thank you.

3 Likes

These are helpful updates. I am making more progress on these tests and I think the notebooks look cleaner now.
I am getting an error I don’t understand, which I believe comes from IPytExperiments.

Full notebook in this link. Error in cell 16.

The error is not thrown consistently, and it has something to do with the del call, but I can’t figure out what.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/envs/fastaiv1_dev/lib/python3.7/site-packages/backcall/backcall.py in adapted(*args, **kwargs)
    102                 kwargs.pop(name)
    103 #            print(args, kwargs, unmatched_pos, cut_positional, unmatched_kw)
--> 104             return callback(*args, **kwargs)
    105 
    106         return adapted

~/anaconda3/envs/fastaiv1_dev/lib/python3.7/site-packages/ipyexperiments/cell_logger.py in post_run_cell(self)
    158 
    159         if self.backend != 'cpu':
--> 160             self.gpu_mem_used_new = self.exp.gpu_ram_used()
    161 
    162             # delta_used is the difference between current used mem and used mem at the start

AttributeError: 'NoneType' object has no attribute 'gpu_ram_used'

(fyi, I moved your report and my followup to the thread where it belongs so that we don’t discuss off-topic things there.)

Yup, I’ve been battling with this one for a while.

It seems to have to do with python threads. There is a peak memory manager python thread and there is the ipython callback thread . I need to be able to do an atomic check and quit the thread if that check fails, but I’m not sure how one goes about this in python threads. What happens now is that it intermittently fails at:

if are_we_still_running:
   do_something()

where it succeeds at the conditional check, and gets immediately yilded to the main process, and when it comes back do_something fails, because the condition is no longer true.

Because of two overlapping contexts - cell-level and notebook level, which reference each other in order to avoid circular references and have del exp do the right thing, I use a weakref proxy - that’s where it fails, since the proxy is gone and the thread still wants to run. so I’m not quite sure how to resolve this race condition.

And I can’t kill the ipython thread, that would be a disaster.

And if I keep the real parent object and not a proxy the sub-system will prevent it from being destroyed, which defeats the purpose of the experiment.

Perhaps I made a design mistake and it needs to be redone.

If you have the know-how please have a look.

I’ve not worked with python threads before. I am glad this is a known issue for you.

maybe this snippet can show the way.Copied from here: https://opensource.com/article/17/4/grok-gil

So, despite the GIL, you still need locks to protect shared mutable state:
n = 0
lock = threading.Lock()

def foo():
global n
with lock:
n += 1

2 Likes

Yes, I made a first attempt with lock yesterday and thanks to the test suite it caught a deadlock, so I need to try harder. But yes, it seems to be the only way to make it thread-safe.

Thank you for the link, @Kaspar - this was an excellent article (and the comments after fix some of the incorrect things said in the article).

Hi! First of all, thanks for having shared ipyexperiments, @stas. I find it quite useful in my everyday practice.

The tool works flawlessly on my 1080ti, but it has some issues on the Tesla V100 I use at work. Sometimes, it even reports a negative memory usage.

See this notebook: https://github.com/terribilissimo/otherstuff/blob/master/TESTs-FP16_V100.ipynb

Any idea about that strange behaviour?

1 Like

the negative mem report has been fixed in master, I’m just waiting to find time to fix the thread race condition and will make a new release. until then install:

pip install git+https://github.com/stas00/ipyexperiments

what else is not working?

1 Like

Mhh, apart from the negative mem, it reports a wrong value (when positive), as you may see in the notebook.
But maybe it’s part of the same issue as the negative mem.

Thanks!

Can you please be more specific? I don’t know how to tell from your notebook which value is wrong.

Perhaps re-run it after updating to the master first and then let me know which numbers are off and why you think they are so. Thank you.

Ok, I updated ipyexperiments and then tried it again over the same notebook.

At the end of the training phase, it reports a peak memory utilisation of 85 Mb, while via gpustat I saw 11602 Mb during training.

Thank you for running the update, @balnazzar.

Yes, this is a known caveat, please see: https://github.com/stas00/ipyexperiments/blob/master/docs/cell_logger.md#peak-memory-monitor-thread-is-not-reliable. Please vote for pytorch implementing the needed support. https://github.com/pytorch/pytorch/issues/16266

It probably has to do with Tesla being a faster card than GTX, so the monitoring thread can’t keep up with it, due to python thread implementation limitations.

Currently pytorch-1.0.1 added a single counter, so perhaps I’ll just try to use it for celllogger, instead of the python monitor thread. But we can’t use it concurrently.

1 Like

Thanks! I’ll do what you suggest.

OK, I implemented locking and 0.1.16 has been released. Please let me know if the crashing still occurs (or any deadlocks).

Fixes negative peak memory reports too.

1 Like

Hi @stas

I have two question in this regard.

1-In the output of IPyExperimentsPytorch, CPU Ram refersto the whole system RAM or to the CPU Cache?

2- how can I get the CPU, RAM, GPU usage and Time taken for training per epoch in my log file? Ideally I would like to have a CSV log file like this :

I will let the code answer in the best way:

    def cpu_ram_total(self): return psutil.virtual_memory().total
    def cpu_ram_avail(self): return psutil.virtual_memory().available
    def cpu_ram_used(self):  return process.memory_info().rss

now you can look them up and see what they all mean.

2- how can I get the CPU, RAM, GPU usage and Time taken for training per epoch in my log file? Ideally I would like to have a CSV log file like this :

It’s already done by:
https://docs.fast.ai/callbacks.mem.html#PeakMemMetric

2 Likes