Please explain why batch size matters

(Antonov Arseniy) #1

Hi all,
I didn’t get details why batch_size matters.
For example I’m using fit_generator function from Keras.
And I see that my results really improves when I set up batch_size = 32 instead of batch_size=16.

(Ramesh Sampath) #2

You may want as large as a Batch Size as possible (powers of 2 usually - 16, 32, 64, 128, 256, 512…), so that model sees more of the Training data before it updates the weights towards minimizing loss. Because the batch update to weights is based on the images it has seen in the Batch, you want the batch data to be representative of the overall training data. Shuffling and other methods help, but having larger batch size can be thought of as selecting more samples from the population of training data. So, it would be no surprise if the larger batch size improves performance of your model.

You might also want to experiment with Varying Batch Size. Start with small batch size (so weights move from random to a reasonable zone) and make quick updates. Then increase batch size. There are a few papers written in this topic.

(Sanyam Bhutani) #3

Id like to add Jeremy’s words to this, preferably work with multiples of 2. So you can always double or half the value as per needs from there.


clarifying/elaborating: each batch is used to calculate an error value by running it through your network, getting the batch outputs and comparing them. the error is then backpropagated and and gradients are produced for each weight; the weights are updated with those gradients. the number of data points in a batch determines how many data points go into each loss value that gets backpropagated. 5 things in your batch? then the first error/gradients come from running 5 things thought the net and doing the above. the second 5 then get used, etc. a bigger batch size captures more information in each error value and so usually is better. batch size is used for stochastic gradients descent (sgd); in normal gradient descent you use all the data before calculating an error/gradients/backpropagating. SGD is called ‘stoachastic’ gradient descent because your batch size is a sample and introduces deviations that would not be present had you used all the data to produce your error/gradients.


I was thinking on batch size also, and I am not sure if I am oversimplifying, but the only thing that matters for the batch size is memory.

We are generally speaking of GPU nowadays and the cards running the thing are like 12GB, 16GB.

The memory is like the water pipe. If it is bigger the water pipe is wider and more water can pass in (read: more of your operations can execute).

In that logic, the more memory you have the bigger the batches you can have. (assuming you selected the right hyperparameters for your architecture).