Varying Learning Rates for distinct layers

Hi,

I am a bit confused on how the varying learning rate works for distinct layers and 1Cycle.
In the first lesson we are introduced to 1cycle

learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4))
You use this keyword in Python called slice and that can take a start value and a stop value and basically what this says is train the very first layers at a learning rate of 1e-6, and the very last layers at a rate of 1e-4, and distribute all the other layers across that (i.e. between those two values equally).

From what I have read (from the paper and here https://sgugger.github.io/the-1cycle-policy.html#the-1cycle-policy ) the range applies to the entire network, and not just some layers. The learning rate will increase from time to time on the first step, then will decrease on the second.

Could someone explain how the varying learning rates works for the distinct layers and 1Cycle together? Am I missing something?

Thanks!

1 Like