I expect that larger batch sizes should result in faster training epochs due to GPU parallelization. However in my experiment, batch sizes from 2 … 64 all take ~30 seconds to train one epoch.
Details:
- Task is image-to-image, with 10k 224x224x1 samples.
- Model is a CNN with 4 layers of 96 filters of size 3x3 (and a single filter in a 5th layer).
- GPU is a Tesla T4 16GB. The GPU says memory is about 60% utilized with batch size of 64
Related Notes:
- In a second experiment, I got similar results with a much smaller model (32 filters instead of 96 per layer) but over 10x the data. Here, bs=2 took 1:20 and bs=64 took 1:06. That’s only 12% faster for a batch size 32x larger.
- Incidentally, smaller batch sizes are getting better results too, but that’s a different topic.
This suggests that something besides training the network is the bottleneck and I was hoping someone could shed light on this. Thanks!