- replaced
gputil
with much fasternvidia-ml-py3
I have no idea whether it works on windows, but I see no reason why it shouldn’t - as it accesses the nvml library directly.
gputil
with much faster nvidia-ml-py3
I have no idea whether it works on windows, but I see no reason why it shouldn’t - as it accesses the nvml library directly.
I’ve found some time today to have a look at the cyclic references of learner. Adding WeakReferences to callbacks fixes the issue but we still have a cyclic reference in scipy module which cannot be easily fixed. I’ve updated the test to reflect this. https://github.com/fastai/fastai/pull/1375
Recent changes:
Some breaking changes in the last release:
IPyExperimentsCPU
, IPyExperimentsPytorch
as an experiments module. It should be trivial now to add other backends.get_stats
method has been replaced with data
property method, which now returns one or more IPyExperimentMemory
named tuple(s) depending on the used subclass.Latest API is here: https://github.com/stas00/ipyexperiments#api
It was painful to maintain two somewhat similar systems, so I integrated both into one.
So the big change is: ipygpulogger got integrated into ipyexperiments.
I’d like to finalize the API and to make sure that all the reported numbers and their names make sense and are intuitive. So if you get a chance please kindly play with the latest version and let me know if anything is unclear/confusing/can be improved/etc.
Thank you.
These are helpful updates. I am making more progress on these tests and I think the notebooks look cleaner now.
I am getting an error I don’t understand, which I believe comes from IPytExperiments.
Full notebook in this link. Error in cell 16.
The error is not thrown consistently, and it has something to do with the del
call, but I can’t figure out what.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~/anaconda3/envs/fastaiv1_dev/lib/python3.7/site-packages/backcall/backcall.py in adapted(*args, **kwargs)
102 kwargs.pop(name)
103 # print(args, kwargs, unmatched_pos, cut_positional, unmatched_kw)
--> 104 return callback(*args, **kwargs)
105
106 return adapted
~/anaconda3/envs/fastaiv1_dev/lib/python3.7/site-packages/ipyexperiments/cell_logger.py in post_run_cell(self)
158
159 if self.backend != 'cpu':
--> 160 self.gpu_mem_used_new = self.exp.gpu_ram_used()
161
162 # delta_used is the difference between current used mem and used mem at the start
AttributeError: 'NoneType' object has no attribute 'gpu_ram_used'
(fyi, I moved your report and my followup to the thread where it belongs so that we don’t discuss off-topic things there.)
Yup, I’ve been battling with this one for a while.
It seems to have to do with python threads. There is a peak memory manager python thread and there is the ipython callback thread . I need to be able to do an atomic check and quit the thread if that check fails, but I’m not sure how one goes about this in python threads. What happens now is that it intermittently fails at:
if are_we_still_running:
do_something()
where it succeeds at the conditional check, and gets immediately yilded to the main process, and when it comes back do_something
fails, because the condition is no longer true.
Because of two overlapping contexts - cell-level and notebook level, which reference each other in order to avoid circular references and have del exp
do the right thing, I use a weakref proxy - that’s where it fails, since the proxy is gone and the thread still wants to run. so I’m not quite sure how to resolve this race condition.
And I can’t kill the ipython thread, that would be a disaster.
And if I keep the real parent object and not a proxy the sub-system will prevent it from being destroyed, which defeats the purpose of the experiment.
Perhaps I made a design mistake and it needs to be redone.
If you have the know-how please have a look.
I’ve not worked with python threads before. I am glad this is a known issue for you.
maybe this snippet can show the way.Copied from here: https://opensource.com/article/17/4/grok-gil
So, despite the GIL, you still need locks to protect shared mutable state:
n = 0
lock = threading.Lock()
def foo():
global n
with lock:
n += 1
Yes, I made a first attempt with lock yesterday and thanks to the test suite it caught a deadlock, so I need to try harder. But yes, it seems to be the only way to make it thread-safe.
Thank you for the link, @Kaspar - this was an excellent article (and the comments after fix some of the incorrect things said in the article).
Hi! First of all, thanks for having shared ipyexperiments
, @stas. I find it quite useful in my everyday practice.
The tool works flawlessly on my 1080ti, but it has some issues on the Tesla V100 I use at work. Sometimes, it even reports a negative memory usage.
See this notebook: https://github.com/terribilissimo/otherstuff/blob/master/TESTs-FP16_V100.ipynb
Any idea about that strange behaviour?
the negative mem report has been fixed in master, I’m just waiting to find time to fix the thread race condition and will make a new release. until then install:
pip install git+https://github.com/stas00/ipyexperiments
what else is not working?
Mhh, apart from the negative mem, it reports a wrong value (when positive), as you may see in the notebook.
But maybe it’s part of the same issue as the negative mem.
Thanks!
Can you please be more specific? I don’t know how to tell from your notebook which value is wrong.
Perhaps re-run it after updating to the master
first and then let me know which numbers are off and why you think they are so. Thank you.
Ok, I updated ipyexperiments
and then tried it again over the same notebook.
At the end of the training phase, it reports a peak memory utilisation of 85 Mb, while via gpustat I saw 11602 Mb during training.
Thank you for running the update, @balnazzar.
Yes, this is a known caveat, please see: https://github.com/stas00/ipyexperiments/blob/master/docs/cell_logger.md#peak-memory-monitor-thread-is-not-reliable. Please vote for pytorch implementing the needed support. https://github.com/pytorch/pytorch/issues/16266
It probably has to do with Tesla being a faster card than GTX, so the monitoring thread can’t keep up with it, due to python thread implementation limitations.
Currently pytorch-1.0.1 added a single counter, so perhaps I’ll just try to use it for celllogger, instead of the python monitor thread. But we can’t use it concurrently.
Thanks! I’ll do what you suggest.
OK, I implemented locking and 0.1.16 has been released. Please let me know if the crashing still occurs (or any deadlocks).
Fixes negative peak memory reports too.
Hi @stas
I have two question in this regard.
1-In the output of IPyExperimentsPytorch, CPU Ram refersto the whole system RAM or to the CPU Cache?
2- how can I get the CPU, RAM, GPU usage and Time taken for training per epoch in my log file? Ideally I would like to have a CSV log file like this :
I will let the code answer in the best way:
def cpu_ram_total(self): return psutil.virtual_memory().total
def cpu_ram_avail(self): return psutil.virtual_memory().available
def cpu_ram_used(self): return process.memory_info().rss
now you can look them up and see what they all mean.
2- how can I get the CPU, RAM, GPU usage and Time taken for training per epoch in my log file? Ideally I would like to have a CSV log file like this :
It’s already done by:
https://docs.fast.ai/callbacks.mem.html#PeakMemMetric