Adjusting learning rate when increasing batch size

Max1 · February 21, 2019, 9:31am

When checking the GPU memory usage I saw that I could increase the batch size. (I understand that it has influence on NN’s ability to represent and generalize).

If I double the batch size, should I halve the learning rate, in order keep the same learning dynamics? Or is lr already compensated for bs in fastai or pytorch? I’m using wrn_22 and the default AdamW optimizer.

Late edit. found this paper: “Don’t Decay the Learning Rate, Increase the Batch Size” https://arxiv.org/abs/1711.00489

It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam.

sgugger · February 21, 2019, 1:40pm

the rule of thumb is that you can double your learning rate, but that’s only true up to a certain point. You should use learn.lr_find() to find the best value.

Max1 · February 21, 2019, 1:43pm

Doubling the batchsize --> doubling learning rate?

Would learn.lr_find() still deliver a useful result even after quite some epochs of training? I used to think it’s only for fresh initialized models.

ark_aung · March 6, 2019, 3:39am

I am also wondering about this

The plot of learning rate VS loss get more chaotic as we do more stuff with the network.
For example, the fine tuning stage or scenarios where the CNN is trained in several stages with different input image sizes, the lr_find plot is more chaotic without having any distinct downward slope.

This is what one should expect after training the network for a while since the weights must have descended to areas where the loss manifold has many local minima and it is easier to jump back up rather than having a more definite downward slope. But, I am wondering what are the good rules of thumb for choosing learning rates in those cases?