Looking for an explanation of max_lr

Hi everyone

I’ve been trying to figure out and understand the meaning of the mar_lr parameter when passing the slice function.

Can somebody explain to me the meaning of the following line of the train proccess?

learn.fit_one_cycle(5, max_lr=slice(1e-6/2., 1e-5))

What is the right way to choose the parameters for max_lr=slice(left_param, right_param)) ?

Best regards,


Hi Jonathan,
at first I suggest you to read one cycle paper by Leslie Smith because all this is based on that. To put it simple, when using fit_one_cycle method, all the iterations that happen in the training time are divided into two parts:
1- in the first part, your learning rate is getting higher and higher. How high it gets finally and when it stops increasing? It depends somehow! If you give it two numbers, the different group layers get trained with different learning rates. The closer the layer to your input, the lower lr it gets and the highest lr these layers will have will be equal to the first number you give to the slice, 1e-6/2. in your example. For layers near the output in your model, they get trained at a higher lr and the highest lr they reach in the cycle of training is equal to the second number you give to the slice, 1e-5 in your example. The layers between the first and last layers of your model, get proportional lrs according to the groups they are in.
These trend of increasing lr is stopped at pct_start*total_iterations
2- the second part of the training is where the learning starts to file and that will make it a full cycle
I hope it makes it clearer for you
Regards :slight_smile:

Thanks @moein

That means, if my network arquitectute has 30 layers, and max_lr=slice(1e-6/2., 1e-5) … numeracly explain … What LR get each layer??

I’ll aprecciate you share me the paper you mentioned. I apologize if my questions are a little newbie, but I really need to undertand by word rather by docs (something I spent a lot of time) …

Best regards again

@PalaashAgrawal slightly incorrect.

If passing a slice, this done to perform what’s called differential learning rates. This is very important when it comes to transfer learning as we have learning rate groups. IE, when we want to fine-tune a model, we don’t want to retrain the base layers as much, so we keep that Lr higher, whereas the newer layers we do so we keep that LR lower. A way to see that is if you pass a slice(1e-2, 1e-3) to say a tabular model, where there are no layer groups, it’ll throw an error because there is only one group.

So generally the first is a faster learning rate, as this is for our pre-trained layers, and the second is a slower learning rate for our newer layers. The number of LR’s depends on how many splits you have. For instance I’ve seen some cases where people use four because they have four different layer groups within the model

The div parameter is what controls the minimum LR used

1 Like

This is explained in more detail in lecture 5 - https://youtu.be/CJKnDu2dxOE?t=983

Quoting the relevant part from notes by hiromis:

One slight tweak - to make things a little bit simpler to manage, we don’t actually give a different learning rate to every layer. We give a different learning rate to every “layer group” which is just we decided to put the groups together for you. Specifically what we do is, the randomly added extra layers we call those one layer group. This is by default. You can modify it. Then all the rest, we split in half into two layer groups.

By default (at least with a CNN), you’ll get three layer groups. If you say slice(1e-5, 1e-3) , you will get 1e-5 learning rate for the first layer group, 1e-4 for the second, 1e-3 for the third. So now if you go back and look at the way that we’re training, hopefully you’ll see that this makes a lot of sense.

You’re right! Thanks for helping me revise. Infact, I was totally off!

The learning rate is differentiated layer-wise. The lower limit of the slice is assigned to the first layer group, and the upper limit is assigned to the final layer group.

1 Like

Both this and this papers are good ones to read. You will find a lot more and gain a perfect intuition.

thanks @moein, @PalaashAgrawal, @amardeep and @muellerzr

In the case, of a restnet34 model, ¿how many layer groups it has? …

Best regards

In the case of any standard fastai pretrained model where we cut off the head of the model and add custom layers (done via cnn_learner), 2. Whether it’s a resnet50, 152, 18, etc it’s two, the pretrained encoder (pretrained body of the model) and the fastai custom head. The encoder starts frozen while the head is unfrozen

So, you’re telling me that a resnet34 model, has tow groups:

  1. Group 1: the entire pretrained and frozen model
  2. Group 2: The extra (group) layer added at the end of the model, which customize mi model.

So @muellerzr , this call:

learn.fit_one_cycle(5, max_lr=slice(1e-6/2., 1e-5))

Applies a LR of 1e-6/2 for the whole pretrained resnet 34 model?
Applies a LR of 1e-5?

If I’m mistaking and you consider I’m lacking some NN concepts … do you have any good resource?

I passed and got the certificate of Andres Ng’ Machine Learning course in Coursera, and to practice the concepts and Linear Algrebra, I did all the homeworks with Numpy and Pytorch.

I’m in the Lesson 3 on the version 3 of coder’s course. However, FastAI does a terrific job abstracting many complexities of working with NN in Pytorch. That high abstraction doesn’t allow me to understand the underlying concepts … so I decided to rewrite all the Andrew Ng’ homework by using Pytorch’s NN (not linear algebra this time)

I don’t know if continue with the FastAI lessons, evethough many friends have toldme that in later lessons, Jeremy explaing many concepts (Top-down way, the opposite to Andrew Ng’ Bottom-up way)

Best regards.

correct, however I should have been more specific.

The pretrained model is every layer except the final layer (so the 2048, n_classes) or whatever filters the final layer output wants, and instead we pop on fastai’s custom head

Correct, the latter on the head (the new layers added on top)

Nope you’re doing just fine :slight_smile: I’d scour the forums for any similar topic or question you may have, and slowly get through Jeremy’s course, this is all I’ve done :slight_smile:

100% do! At least to lesson 5. From there it gets very heavy, however up until lesson 5 gets you a very strong footing, then the latter lessons when you’re ready take them slow as it’s much more what it does and why it does.

Thanks @muellerzr :slight_smile:

Hi Can you explain me why we are using learn.fit_one_cycle(5, slice(lr)) this in training? what is slice(lr) function is doing?