Visualizing GPU memory usage

kwojcik · January 28, 2017, 9:50am

EDIT: Adding in the bash version becasue it’s way easier:

Python version:

I made a little python notebook to monitor GPU memory usage after getting a few OOMs. It uses nvidia-smi --display=MEMORY -q, to query memory stats. It can only update a few times per second because I’m using matplotlib inefficiently.

If anyone can figure out how to run it in parallel with model fitting in a single notebook, let me know!

https://github.com/kwojcik/nvidia_mem_grapher

maxim.pechyonkin · January 28, 2017, 10:08am

Is it possible to launch a separate notebook and run it over there in parallel with the model being trained in another notebook?

kwojcik · January 28, 2017, 10:15am

Yes, that’s exactly what I’m doing. It would be great to have it run in parallel in a cell right above/below the training cell, but I could not figure out how to do that.

One thought is to use the same mechanism that progress bars are using, for example: https://github.com/alexanderkuk/log-progress

davecg · January 28, 2017, 12:19pm

You can probably do something like what you want with IPython widgets.

This isn’t a graph, but it will display the same information. The way I’ve written it you need to update it manually with update_widget, but I’m sure someone else will know how to get it to update at timed intervals asynchronously.

import ipywidgets as widgets
from IPython.display import display
from subprocess import check_output

def nvidia_smi(options=['-q','-d','MEMORY']):
    return check_output(['nvidia-smi'] + options)

def update_widget(w=None, new_box=False):
    if w is None:
        w = widgets.Textarea(
            value=nvidia_smi(),
            placeholder='nvidia-smi output',
            width=100,
            disabled=False
        )
        display(w)
        return w
    else:
        w.value = nvidia_smi()
        if new_box:
            return w
        else:
            return None

# display widget and get handle
w = update_widget()

# update info in widget
update_widget(w)

davecg · January 28, 2017, 12:21pm

Although FYI, this does not seem to be helpful with the TensorFlow backend. TF seems to use all memory by default, so it doesn’t give any warning signs that it is going to go out of memory.

If anyone has any advice on dealing with that, please let me know.

kwojcik · January 29, 2017, 3:48am

Here’s a text-only bash alternative

#!/bin/bash
rm plot.dat
PERIOD=0.1
DATA_POINTS=100
X=0
writedata() {
    while true; do
        X=$(echo $PERIOD "+" $X | bc -l)
        mb=$(nvidia-smi | head -n +9 | tail -n 1 | awk '{print $9}' | sed 's/MiB//')
        echo -e $X"\t"$mb >> plot.dat
        echo "$(tail -n $DATA_POINTS plot.dat)" > plot.dat
        sleep $PERIOD
    done
    echo "done"
}
trap "killall background" INT TERM
writedata &
sleep 0.5
watch -n $PERIOD gnuplot -e "'set term dumb; plot \"plot.dat\" using 1:2 with lines'"