Why don't larger batch sizes train faster?

I expect that larger batch sizes should result in faster training epochs due to GPU parallelization. However in my experiment, batch sizes from 2 … 64 all take ~30 seconds to train one epoch.


  • Task is image-to-image, with 10k 224x224x1 samples.
  • Model is a CNN with 4 layers of 96 filters of size 3x3 (and a single filter in a 5th layer).
  • GPU is a Tesla T4 16GB. The GPU says memory is about 60% utilized with batch size of 64

Related Notes:

  • In a second experiment, I got similar results with a much smaller model (32 filters instead of 96 per layer) but over 10x the data. Here, bs=2 took 1:20 and bs=64 took 1:06. That’s only 12% faster for a batch size 32x larger.
  • Incidentally, smaller batch sizes are getting better results too, but that’s a different topic.

This suggests that something besides training the network is the bottleneck and I was hoping someone could shed light on this. Thanks!

Usually, getting the data to the GPU is the bottleneck.

So depending on what HDD (best a nvme m.2 SSD in RAID config) and CPU preprocessing you use, this can take quite some.

1 Like