Profiling Keras (and TF/Theano) code

This might be a silly question, but how do we actually tell if we’re using the GPU with as much efficiency as possible? So far I’ve only looked over in nvidia-smi dmon, which shows things like sm/memory usage and power consumption, but does sm at 100% mean the GPU is actually busy computing? I’m asking since for CPUs it generally counts as busy-time if the CPU is waiting for memory accesses.

For example, training on 2000 images of cats&dogs with 300 images in validation set, measuring time per 1 epoch

  • batch 2: time: 34s
  • batch 4: time: 29s
  • batch 8: time 29s
  • batch 16, time: 30s
  • batch 32, time: 32s
  • batch 64: time: 37s
  • batch 128: time: 40s … this time at first run I had SM usage jump back and forth between 60-100%, run out of memory at the end, and when I restarted it, it ran continuously at 100% until the validation phase
  • batch 256: doesn’t work, OOMs every time

Here’s a screenshot of the SM % usage on the batch 128 run. In all of the other runs the GPU was mostly at 100%, almost never dropping below 90-95%.


Looking at the results, this goes against what I’ve heard, that bigger batches are faster. It seems like it should be the case, I mean otherwise the GPU is just slacking waiting for the next data to come in, right?

How can I get some more insight into what is really going on? Is there a way to get more precise measurements, such as the perf_tools but for GPUs? Or is there some good way to for example correlate CPU/GPU usage to see if one is waiting for the other?

This was measured on GeForce 1080ti with 11GB VRAM and i7-5820k with TensorFlow backend.

Check out the thread I started a few days ago: Huge performance improvement with network training!. It will help you to understand what’s going on GPU-wise.

1 Like

You can also use Tensorflow’s profiler to get in depth details: