Techniques like cyclic LR, cyclic momentum, AdamW are state-of-the-art methods for training Neural Networks. But do these techniques transfer to Machine Learning models also.
Intuitively, the reason for superconvergence is well explained in the Super-convergence paper by Leslie N.Smith and it can carry to machine learning models also, as at the end we are just trying to find a minima of the loss function, which does not depend upon the architecture of our model, but on the parameters of our model.
Can someone tell if I am on the right path? I am thinking about making a notebook comparing classic sklearn models with and without the above-mentioned techniques.
Also, I am having some confusion. Can people from their experience share which optimizer generally works best for them. I have been using AdamW, but in many research papers that I have read, they have used RMSprop with momentum. Are their some theoretical benefits behind using RMSprop instead of Adam.
Depends on what kind of models you are talking about. Some of them do not have a notion of learning rate, while others have concave loss surfaces that do not need advanced learning rate scheduling.
As for the second part of your question, Adam has been for a long time the default choice of an optimizer in practice, but when people strived for state of the art results, it somehow converged to worse minima. Sebastian Ruder does a great job explaining the reasons here:
On the other hand, AdamW is quite a new method of optimization and is not implemented by defaults in major DL libraries, so I guess when it makes it way into PyTorch or Tensorflow, it gain bigger adoption.
If we forget about concave loss surfaces (as we can deal with a negative sign in that case). I don’t want to comment much about it, as I have yet to experiment with it, but does it make sense to use the above techniques for models as simple as linear regression (as the loss function is still similar to the superconvergence one)?
And thanks for the article you posted. It is very intuitive.
And if we are on this topic, what about GANs? @jeremy maybe you can help in this case, as I have not experimented much with GANs and you can tell whether the techniques of cyclic learning rate and momentum and AdamW also apply to GANs.
We cannot use the loss curves for selecting the hyperparameters in this case. And one of the important points laid by Leslie in his papers is to monitor the loss curves for first few iters to get the values of all the hyperparameters. I am just not able to get the idea, how to do this in case of GANs. Are there any other techniques for training GANs that I am not aware of?
Well there’s no point using an optimization procedure for linear regression as you can solve for the best fit analytically. You could use it for logistic regression. You can look at the optimizers sklearn uses for comparison.
For GANs you can look at the lesson 7 WGAN notebook. It uses adam with the momentum coefficient set to zero. This is covered in the lesson video.
Actually it is often very expensive to solve for best fit analytically. For example, inverting a matrix is often posed as an optimization problem for more efficient solution than O(n^3)
I am sorry, when I said concave, I actually meant convex.