Shedding some light about LR management in fastai

balnazzar · April 15, 2019, 3:24pm

But the whole point of final_div would be to make some epochs with a very small learning rate. If you look at Gugger’s summary, he recommends something like one hundredth of the minimum. That is (leaving everything at default) 1/25 * 1/100. If you pass a final div of 2500, the plot should actually show the graph touching the line y=0, but this does not seem to happen.

Also, from your example plot with final div = 2, you can see it alters the whole cycle, not just the final part. Looking at the typical linear example, the final part is the line segment characterized by a slightly smaller negative slope:

The main cycle is the part of the graph above the line y=0.001, the final part is the one below that line. Setting such part should not alter the rest of the cycle.

I would not dare to tag Jeremy, but since he liked the answer, I would just ask him for clarification particularly about such matters.

Mh, why not? You can safely glue a cosine (in fact, since it starts from ~0, I would have called it sine annealing) with another function… It would be continuous nonetheless.

Just another thing: what is the point of having max_lr? It doesn’t seem to do anything useful…

I may be wrong, but I think that as you unfreeze the network, it get splitted, by default in three layer groups: the first half of the body, the last part of it, and the head. If you pass slice(a,b), you will get a applied to the 1st part of the body, b to the head, and something in the middle to the 2nd part of the body. Indeed, rather than slice(), you can try and pass a list like this: [a,c,b], a-la-fastai 0.7. It’ll work.

I think not, but let’s wait and see if more informed fellas will answer.

Correct, but don’t confuse the “variation” of LR for the 1-cycle policy with differential learning rates. They are very different concepts, and as you train an unfrozen network, both of them are applied.
Suppose you start fit_one_cycle() upon an unfrozen net, just specifying a learning rate of lr=1, and a slice like (lr/a,lr). What follows will happen:

It will do a cycle where the learning rate for the head varies between 1/25 and 1.
The learning rate for the first group of the body will go between 1/25a and 1/a.
The learning rate for the central group will vary accordingly within some middle ground. Looking at the code above, if layer_groups>3, the array will be spaced in proportion, unless you pass it explicitly.

Read above, but ask more informed people for confirmation.

Not at all!