New Google paper proposing to increase Batch Size instead of decaying Learning Rate

With a shouting tilte :slight_smile:

If I read this correctly, they are saying that increasing the batch size can be equivalent to decaying of the learning rate, right?

But this only does nice things for parallelism as we have less parameter updates? They go up to batch size of 65536 which probably requires some absurd amount of RAM to store concurrently.

This is interesting to think about but if my understanding is correct, I think one could use this only on toy examples to experiment running on a single GPU? But at google scale or for some other massive enterprise this definitely can be nice :slight_smile:

1 Like

Let’s have discussions about new papers in #theory or #applications as appropriate, unless it’s particular to a lesson we’ve done in this course.


I could’ve sworn there was a paper that recommended to use smaller batch-sizes. The rational was: using smaller batch-size lead to having larger number of iterations, thus having more weight updates and a better solution.

I suppose it doesn’t make sense to take an arbitrarily small batch size either (or else you’d never be able to train in a reasonable amount of time). Anyways, the rational makes some crude sense to me…

Any feedback is welcome.