Impressive speedup on old GPU

Since I have a Windows gaming PC sitting around with a GTX 970 (and a Core i5 @4ghz), I decided to try to get things running on it. I finally got it all together (including remote jupyter login from my laptop :slight_smile: ). I am running the dogs vs cats redux example, using the theano backend (CUDA 8, CuDNN 5.1).

On my MacBook Pro, this reports an ETA of about 25,000sec per epoch. Longer than I care to wait. For this, I am using the tensorflow backend. Incidentally, it seems to utilize the CPU quite well, with about 700% CPU usage on averge during a run.

On AWS P2, it takes about 760 sec per epoch.

I expected my aging 970 to be a lot slower than the K80 on AWS. But, in fact, it is more than twice as fast! About 360sec per epoch. This uses about 51% CPU, 100% GPU, and 3.5gb of memory on the GPU, as reported by GPU-Z (though I think I read somewhere that Theano reserves all GPU memory regardless).

The only downside is that I had to limit batch size to 16 due to CUDA out of memory errors :frowning: I still get 98.3% accuracy after 3 epochs. I am a bit confused about the memory usage. Do the memory needs scale linerarly with batch size? In that case, it seems like one would need a GPU with 14GB to run with batch size 64…


That’s so cool! Congrats on getting it working, and thanks for sharing. :slight_smile:

Yes, pretty much - but remember it also needs room to store the model itself, so there’s a non-zero intercept on that line…


Ok, cool, that makes sense.

This then leads to another question: At which point does it become a problem that your batch size is too small (because you’re trying to fit it into GPU memory). For example, with a model that’s 4x larger, I might still be able to run it with batch size = 4. At what point does this become a problem, and if it does, to what extent can it be ameliorated by changing the optimization algorithm? Is even batch size = 1 reasonable for some problems?

A batch size of 1 should work OK but it does make the SGD bounce around a bit more, so you may need to be more careful with your learning rate.

However, just because you can only compute a small batch at a time on the GPU does not mean you also need to update the model’s weights each time. You can run multiple batches, accumulate the gradients over these batches, and do the weights update after every N batches. That is (pretty much) equivalent to using a larger batch size.

Keras does not currently support this, however, but it’s easy enough to do this in pyTorch. (And you can hack it into Keras if you feel adventurous.)


Also, using a batch size of 1 watch out for Batch Normalization (they say batch renormalization is better in such cases, with small batch sizes)

1 Like

What is the performance on the P3 (Volta)? I’m guessing 500 secs/epoch.