When checking the GPU memory usage I saw that I could increase the batch size. (I understand that it has influence on NN’s ability to represent and generalize).
If I double the batch size, should I halve the learning rate, in order keep the same learning dynamics? Or is
lr already compensated for
bs in fastai or pytorch? I’m using
wrn_22 and the default
Late edit. found this paper: “Don’t Decay the Learning Rate, Increase the Batch Size” https://arxiv.org/abs/1711.00489
It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam.
the rule of thumb is that you can double your learning rate, but that’s only true up to a certain point. You should use
learn.lr_find() to find the best value.
Doubling the batchsize --> doubling learning rate?
learn.lr_find() still deliver a useful result even after quite some epochs of training? I used to think it’s only for fresh initialized models.
I am also wondering about this
The plot of learning rate VS loss get more chaotic as we do more stuff with the network.
For example, the fine tuning stage or scenarios where the CNN is trained in several stages with different input image sizes, the
lr_find plot is more chaotic without having any distinct downward slope.
This is what one should expect after training the network for a while since the weights must have descended to areas where the loss manifold has many local minima and it is easier to jump back up rather than having a more definite downward slope. But, I am wondering what are the good rules of thumb for choosing learning rates in those cases?