Disadvantages of using very large batch size

garfieldchh · November 3, 2018, 3:19pm

Hello,

Just a general question about negative effects of large batch sizes.

Positive points about large size based on my newbie understanding:

Large batch seems to allow larger learning rate and faster convergence and there are less number of batches so the training is faster.

Question:

What are the negative effects? Is it advisable to have batch size of 10,000 if the memory can fit the data ?

dalupus · November 4, 2018, 1:13pm

I think the main disadvantage is that it the parameters are only updated after each batch.
So say you have 1k images and a batch size of 100.
The parameters will be updated 10 times over the course of 1 epoch.
If you were to set it to something like 500, then the parameters would only get updated twice, making it take longer to get to good values.

cadolphs · November 5, 2018, 9:04pm

I don’t think it’s as simple as that.
Realize that in your stochastic gradient descent (SGD), the gradient of the loss function is computed over the entire batch. If you have a very small batch size, your gradient will really be all over the place and pretty random, because learning will be image by image.
With a large batch size, you get more “accurate” gradients because now you are optimizing the loss simultaneously over a larger set of images. So while you are right that you get more frequent updates when using a smaller batch size, those updates aren’t necessarily better. Trade-off is then: Many “bad” updates versus few “good” updates.
At the extreme end you’d have batch-size = training size (probably not feasible for image problems, but when you have a small structured data set it’s certainly possible). In that case you get one update for epoch, and the update is supposed to be “globally” good, i.e., good for the entire dataset. In a way, it’s not SGD any more, but rather just GD without the “stochastic” part.

kxh · December 2, 2020, 6:06am

but should we say the larger the batch size, the better the performace we will have for gradient descent (with the cost of larger training time)?

What about in a common practice where our cost function (objective function) is not convex and we cannot find a update that is “globally” good?

haozhen · February 25, 2022, 3:07am

for a single update step, larger batch size is definitely better than smaller one, because of the reduced variance of gradient;
for an epoch, larger batch size may not be better, because the number of updates is reduced