Now that I’ve read the LARS paper I can say that this thread/research project is still worthwhile pursuing - maybe even more so than before.
My first question is: has anyone, particularly @sgugger, replicated the results in the LARS paper? That paper shows results for ImageNet, using a modified AlexNet with batch norm. But @sgugger said:
For anyone who hasn’t read the LARS paper, it suggests a layer-wise learning rate computed as:
lambda^l = eta * norm(w^l) / norm( gradient^l )
where eta is essentially a constant learning rate and the superscript ‘l’ means layer-wise. This makes it a layer-wise adaptive LR.
However, I immediately saw what I believe is an improvement I’d like to make to this. Based on theory, I think a better LLR is:
lambda^l = norm( MADw^l ) / norm( MADgradient^l )
where MAD means a moving average of the difference over iterations. IMO, it is the change in the weights and the change in the gradients that estimate the second derivative (i.e., the Hessian), which indicate the slope and learning rate. Is it clear why the rate of change of the weights and the gradient is what is important (think of the definition of derivatives in calculus)? The moving average is to smooth out the noise in the gradients. It would be informative to compare my version to LARS.
Also, I was dismayed that the LARS paper doesn’t seem to compare to Adam (or AdamW discussed in the latest blog post at Redirect). It weakens the paper to not more thoroughly compare the method.
Coming back to the topic of this post, I’d say it is worthwhile to start with manually setting LLR and adding a few more experiments. Obviously, we should compare to LARS. Also, we should compare manual setting to my version above. In addition, AdamW needs to be part of the experiments.
Finally, I’d like to say that I started this LLR thread for educational purposes for any of the fast.ai students who would like to experience my version of doing research (i.e., the thought experiment, searching the literature, designing and running experiments, observing and trying to understand the results, and perhaps writing a paper). For that reason, I’d like to continue this LLR “lesson” in the public forum. Is this interesting to anyone? Should we continue?
Best,
Leslie