Intuition to LR Range Test, Cyclical LR and the One-Cycle Policy

lucha6 · March 12, 2020, 12:25am

Hello, fast.ai community! This is my first post!

I have been reading about CLR, the One-Cycle policy and the LR Range test (the one implemented by lr_find()) by @Leslie. I understand what the CLR and what the One-Cycle policy but I struggle to understand the intuition behind the LR range test method (and plots).

My understanding is the following (please let me know if it is wrong and if so, where):
During one or a few epochs we train a given network updating (linearly or exponentially) the learning rate, we calculate the associated loss, we backpropagate the loss, update parameters and keep on reducing the LR until the loss starts diverging abruptly. Additionally, as we go, we take an exponentially weighted average of the loss and perform bias correction by dividing by 1- \beta^t where t is the iteration at which the loss is calculated and beta is the coefficient in the weighted average.

What I struggle to understand is how this is an accurate (and unbiased) estimate of the effect of the learning rate on the entirety of the loss ‘landscape’. For example, given that the first iteration starts somewhere random, that would certainly affect the loss calculated at that iteration (wouldn’t it?). Additionally, what if at some point the optimizer gets stuck in a minimum (by chance/randomness), wouldn’t the loss during those iterations also be biased? And finally, are the results we get from an LR Range test batch-dependent? The only intuition I have is that all these problems are taken care of by the exponentially weighted average but if that is the case I would like a deeper explanation of it. Any help is appreciated.

I have read the explanations by @sgugger in his personal blog, and while they have helped they still do not solve my lack of intuition.

nestorDemeure · March 12, 2020, 1:42pm

What I struggle to understand is how this is an accurate (and unbiased) estimate of the effect of the learning rate on the entirety of the loss ‘landscape’.

It is not.

Hence we take a learning rate below the one that falls at the minimum of the loss plot as we expect the optimal value for the first few batches to be an overestimation of the optimal value for the full training (converting from the lr_finder plot to a learning rate that will perform well during the full training is still an art).

The weighted average is just there so that the plot is not to noisy and can be ignored comprehension wise.

lucha6 · March 12, 2020, 2:25pm

Thanks for your reply @nestorDemeure, would you agree with my description of method ? (below)

During one or a few epochs we train a given network updating (linearly or exponentially) the learning rate, we calculate the associated loss, we backpropagate the loss, update parameters and keep on reducing the LR until the loss starts diverging abruptly. Additionally, as we go, we take an exponentially weighted average of the loss and perform bias correction by dividing by 1-\beta^t where t is the iteration at which the loss is calculated and beta is the coefficient in the weighted average.

nestorDemeure · March 12, 2020, 2:30pm

It seems accurate