Paper: On the training dynamics of deep networks with L2 regularization

Jeremy recently suggested this paper that was suggested to him by Leslie Smith:

Basically they claim you can find the optimal L2 regularization by training with a very large L2 coefficient until the model reach maximum performance on the test set (by using early stopping). Using a very large L2 makes training much shorter and then then use a formula they found empirically to calculate the optimal L2 coefficient.

This could basically open the door to a Weight decay finder for fastai.

Here is the relevant excerpt from the paper:


It’s an interesting paper. I hope that researchers will continue their research on the join behavior of learning rate scheduling, weight decay and true weight decay on the future. I have some doubts about a weight decay finder (at least on this stage) as optimal weight decay / weight decay scheduler seems couple with the learning rate / learning rate scheduling + ImageNet results on Resnet50. See figures S8 and S9 on the appendix.

That’s being said, I find interesting the concept of weight decay scheduling to speed up the training. In my opinion, it may be worthy to investigate if starting with a high weight decay and reduce it linearly to the optimum value about the middle of the training and then keep it constant speed up the training without reducing the accuracy.