Regarding the 1cycle policy blog

To get to know what exactly does fit_one_cycle() do behind the scenes, I read this article by @sgugger . I have a couple of doubts after reading it.

  1. Say, max_lr=slice(1e-6, 1e-4). Then does the learning rate increase linearly or log-linearly? It should be log-linearly intuitively, but in the blog, its written
  1. It is mentioned that the length of the cycle is slightly less than total number of epochs.

So, that would basically mean that we only go through a single cycle (first increasing LR and then decreasing LR) for any n in fit_one_cycle(n)?


Regarding point 1, the increase is linear. In the original paper, Smith reports trying different ways of varying the rate and also cites some other researchers who did the same thing, with the end result being that they all worked about the same. So it made sense to go with the most straightforward option.

Please note that the slice refers to something different: attributing different learning rates to the layers of the model, 1e-6 for the deepest ones and 1e-4 for the head.

The original 1cycle policy goes from lr_max/div_factor to lr_max with momentum decreasing and then back from lr_max to lr_max/100 with momentum decreasing and a part at the end where you decrease the learning rate even more.
In fastai, the default for this div_factor is 25, but you wan change it when invoking learn.fit_one_cycle.

I think Jeremy will explain it in tonight but our more recent version of 1cycle is a tiny bit different.


Yes on the difference part, why did you all change the linear cooldown to cosine curve?

Because we found out it yielded better results in general.

1 Like