I’ve been wondering about optimum batch size as well. Here’s a paper I just came across on how batch size affects generalization: https://arxiv.org/pdf/1705.08741.pdf
My understanding after a first read through is that larger batches can result in worse generalization because there are fewer gradient updates per epoch. The paper makes the case for adjusting the number of training epochs by the batch size - basically measuring length of training by the number of weight updates rather than the number of epochs.
Leslie’s paper on batch size:
Small batch sizes have been recommended for regularization effects (Wilson & Martinez, 2003) and others have shown there to be an optimal batch size on the order of 80 for Cifar-10 (Smith & Le, 2017). Contrary to this early work, this Section recommends using a larger batch size when using the 1cycle learning rate schedule, which is described in the above.
What I understood is that, although batch size can be used for regularization, using a large learning rate is a better regularization method, as big learning rates in 1cycle help you achieve convergence faster than small batch sizes (which causes the opposite effect).
My point is that if we are to use 1cycle, why not build a function that gives as the optimum batch size (the biggest that fits your gpu memory) automatically? This would prevent a lot of guess and error during training.
Do the findings in Leslie’s paper apply to architectures other than convolutional network? For example, is the recipe the same for a model like Transformer?