When checking the GPU memory usage I saw that I could increase the batch size. (I understand that it has influence on NN’s ability to represent and generalize).
If I double the batch size, should I halve the learning rate, in order keep the same learning dynamics? Or is lr already compensated for bs in fastai or pytorch? I’m using wrn_22 and the default AdamW optimizer.
It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam.
the rule of thumb is that you can double your learning rate, but that’s only true up to a certain point. You should use learn.lr_find() to find the best value.
The plot of learning rate VS loss get more chaotic as we do more stuff with the network.
For example, the fine tuning stage or scenarios where the CNN is trained in several stages with different input image sizes, the lr_find plot is more chaotic without having any distinct downward slope.
This is what one should expect after training the network for a while since the weights must have descended to areas where the loss manifold has many local minima and it is easier to jump back up rather than having a more definite downward slope. But, I am wondering what are the good rules of thumb for choosing learning rates in those cases?