Difference in Batchsize and effect on training

Nubbinsonfire · August 11, 2019, 5:36pm

Hi All,

I have been been training a couple of networks for a skin lesion competition and have been coming across some weird issues with my batch size. I have always assumed the larger the batch size you could fit on your GPU the better but during my training I have found its the opposite.

I have experimented tonight with changing the batchsize and nothing else on my a network trained on anime character faces and the results are not what I expected. I have placed a notebook at https://github.com/Dakini/BatchSize that show my results. Training for 5 epochs with fit_one_cycle with a batchsize of 728,got an accuracy of ~58%. While a batchsize of 32 got 79%. There seemed to be no difference between the training times for them either.

Does anyone know why this is? Any comments or feed back would be great.
I was thinking maybe a tool like lr_find() but for batch size, might help with training too!

fgfm · August 11, 2019, 7:11pm

Hey there @Nubbinsonfire,

Since I’m not familiar with this dataset, I cannot give you an exhaustive answer but here are a few things to keep in mind:

For hardware efficiency purposes, pick batch sizes as a power of 2. Here is a good summary of the topics and a detailed more technical answer.
Batch size, number of epoch and learning rate are not independent in regards to your final performances (time before reaching convergence, and accuracy of predictions)
If the dataset was infinite, picking a larger batch size would be equivalent to improving your estimator of the error. Since we have a finite dataset, there are two things to balance since we have a computing budget: the quality of the estimator, the number of updates that will be performed in an epoch.

Let’s illustrate the later simply with an imaginary dataset of size 2^N with edge cases:

Batch size = 1. Your parameter update will be made considering a single sample as being representative of the dataset overall distribution. It literally defies the principle of stochasticity in gradient descent.
Batch size = 2^N. The estimator is as perfect as it can get, but you’re actually using Batch Gradient Descent rather than Mini-batch Gradient Descent (also known as Stochastic Gradient Descent). Meaning that you will backpropagate the error and update your parameters only once in an epoch.

Between those two extrema, there is a sweet spot. Best way to pick is to understand the reasons the stochasticity was introduced: speed of convergence (more frequent update than batch gradient descent), gradient estimation (better than taking a single sample as being the overall distribution), generalization (reaching the global minima of your training loss does not mean you have reached the global minima of your validation loss). Disclaimer: those are at least the main ones I can think of

Furthermore, the batch size choice is a controversial topic, but there are legitimate claims about limiting it smartly. And when Yann Lecun says so, I usually think twice before going the opposite way . In Computer Vision, apart from GAN training, I cannot remember using a batch size higher than 32 or 64 perhaps.

In your case, the simplest explanation is: with a fixed amount of epochs (5 here), the number of parameter updates with a batch size of 32 is more than 20 times higher than when your batch size is 728.

In conclusion, your batch size only needs to be as high as allowing the gradient estimator to be somehow representative of the direction towards the global minimum. Higher than this, you might get better final results but at the expense of a much longer (and more expensive) training. Lower than this, the oscillation and the stability of your training will be impacted.

I hope this helped, cheers!