I just finished reading a new paper and I think it is flawed. The paper is:
“On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent” at
My own results with large batch sizes and image classification with Cifar-10 show that these authors did not sufficiently explore the hyper-parameter space - especially learning rates and weight decay (likely they are unaware that it is best to increase weight decay when increasing batch sizes). Within the range of batch sizes 16 to 2048, I am able to reduce the generalization gap to 0. BTW, my latest experiments indicate there exists a sweet spot for LR, BS, WD, and momentum so I think the generalization gap happens when one is outside this sweet spot.
I know that fast.ai students and alumni have also used large batch sizes for faster training and obtained good results. Please reply if your experience contradicts the conclusions in this paper. I am especially interested in training non-image classification tasks because I have very little experience with these other tasks. For example, the paper states that large batch training breaks down for NLP tasks even more than for image classification. Have you done NLP training and seen otherwise?
Thank you in advance for answering my questions.