Don't Decay the Learning Rate, Increase the Batch Size

I came across this paper almost 1 year ago while working on a Kaggle competition, although I didn’t have the time to implement it.

I was wondering if anyone tried to compare its results against the 1cycle training regime. The researches at Google are claiming that a training regime where we increase the batch size during training …

[…] reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate ϵ and scaling the batch size B∝ϵ.

Using gradient accumulation this should be possible also for GPUs with a moderate amount of memory.

Intuitively, I think this makes sense and looks like a logical extension to the 1cycle policy. At every mini-batch increase the learning rate but also batch size so that you can take bigger steps with more confidence.

What are your thoughts?

P.S.: I need to find the time to implement this paper :slight_smile:


It seems like a good idea to implement it.

I would also see what combining it with a proper learning rate scheduler would do. I don’t think they spent enough on their hybrid training regime. It should be a lot easier for us than them to make a proper hybrid because of the callbacks we have.

Like this:

1 Like

I wonder how this compares to the definition that the smaller batch size you have, the better your model generalizes.(there are a lot fo papers about that)