What to set as the maximum learning rate while using SGD with warm Restarts?

rghosh · April 20, 2021, 8:41am

I am intending to use the fit_sgdr() method to train a Vision model, and initially, I have received significantly better performance using it over fit_one_cycle(), and fine_tune() methods.

My understanding is that it uses a similar method to using cyclical learning rates, but instead of using triangle-like learning rates, it uses cosine-annealing.

It starts with a high learning rate and decreases it over cycles in steps supplied to the cycle_len parameter of the method. After the cycle finishes, it starts again with the high learning rate again.

It is suggested in the paper that it might yield better results if the upper and lower bounds are decreased progressively. Is it implemented by fastai?

By reading the source code, I think it is-

pcts = [cycle_len * cycle_mult**i / n_epoch for i in range(n_cycles)]

I am also new to schedules other than fine_tune(). So feel free to correct me if I am wrong anywhere.

When using fine_tune() or using fit_one_cycle() directly, we choose the point with the steepest gradient, i.e. where the loss has fallen most quickly. Should the lr_max parameter be any different and should I choose the place where the loss starts to rise up again?