Understanding max_lr for fit_one_cycle

I’ve been watching the videos and reading through the docs and feel like I have a good understanding of what fit_one_cycle is doing.

It varies the learning rate from a small number up to max_lr and then back down to a small number. When you pass a slice of two values, it linearly adjusts the max_lr for each layer so that the first layers train slower than the deeper layers.

It looks like the default value is slice(None, 0.003, None).

I have a couple of questions:

  • If my model is frozen, does this learning rate scale over the whole model or just the unfrozen layers?
  • If I am training from scratch instead of using transfer learning, should I still be using a slice? Or should I train all layers at the same rate? My understanding was that we do this because the earlier layers are more likely to be “pretty good” already – but that doesn’t seem to be the case if the weights are randomly initialized.
  • In most of the notebooks we run fit_one_cycle with the default max_lr initially when the model is frozen – why do we not run lr_find on that step? Is None->0.003 just always a good choice for a frozen model?
  • What does the third None parameter to slice in the default signify? The docs say it’s a “step” – should it always be None?
7 Likes

@yeldarb Now do you have insight into the questions above?
Would really appreciate the help. Also some attention from the @jeremy @sgugger to the question would be lovely.

Hey there,

Ran into this thread earlier and I’ve looked into a (partially) similar matter in the thread below with the help of @muellerzr, just gonna drop it here in case it’s relevant to any of you.