I was going over leslie Smith’s paper: https://arxiv.org/pdf/1506.01186.pdf . And he says that cyclical learning rates work best when used with SGD and do almost the same with used with Adam. I know fastai uses cyclical learning rates and Jearmy also explained Adam in the new course. So does the fastai combine Adam and cyclical or what is it?
I believe that you can apply cyclical learning rate to any optimization algorithm you would like. I guess that by default fastai uses AdamW with cyclical learning rate and fixed issue with weight decay.
Adam has individual learning rates for every parameter. And even those learning rates are learnt after every batch during training. One cycle learning policy maintains a single learning rate, for all the parameters, varying between two ends. How can Adam and One Cycle Policy work together?
CLR or the One Cycle Policy update the learning rate before any other adjustment to the learning rate is made (e.g. by Adam, which calculates the distinct learning rate values for each parameter using a single learning rate value and the distinct gradients of the respective parameters.)
The reason Adam had separate learning rates is due to the division by standard deviation at the end. However in the lr*grad/std step we can change lr dynamically. I don’t know how the running std will adjust but I have seen it work