What does the slice(lr) mean in fit_one_cycle()

In Lesson 3 - planet, I saw these 2 lines of code:

lr = 0.01
learn.fit_one_cycle(5, slice(lr))

if the slice(min_lr, max_lr) then I understand the fit_one_cycle() will use the scattered Learning Rates from slice(min_lr, max_lr). (Hopefully, my understanding to this is correct)

But in this case slice(lr) only has one parameter,

What are the differences between fit_one_cycle(5, lr) and fit_one_cycle(5, slice(lr)) ?
And what are the benefits of using slice(lr) instead of lr directly?

2 Likes

With the former, every parameter group will use a learning rate of lr, whereas with the latter, the last parameter group will use a learning rate of lr, while the other groups will have lr/10.

4 Likes

thanks @immarried when you say the former, do you mean use slice(lr)?
the latter, do you mean using slice(min_lr, max_lr)?

New to the AI and Python world, missing a lot of concept.

E,g, layer_groups, is it epoch?

What is the syntax of

  1. method(): return lr? => isinstace(ir,slice): return lr
  2. if lr.start: res = xxxx

Yes, say you have 3 layer groups: group 1, 2 and 3. max_lr=slice(1) means that the learning rate for group 3 is 1, and 0.1 for groups 1 and 2. max_lr=1 means the learning rate is 1 for groups 1, 2 and 3.

A model’s weights/parameters can be divided into different groups, called “parameter groups” or “layer groups”. You can give each group a different learning rate, and then during training, parameters from different groups will be updated using these different learning rates.

6 Likes

How is it going to help us? I mean having different lr for different groups?
and beside that, doesn’t slice in python work with this parameters? slice (start, stop, step)? i can’t recognize these parameters in this line of code
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

@ZahraHkh
Applying different lr for different groups is a technique called “discriminative layer training” that is introduced in part 1. This technique is commonly used in both computer vision and natural language processing. You can refer to this fastai doc for more details

I am not sure if I fully understand your second question. Did your question refer to why slice() could be passed 1 or 2 arguments only, while it in fact has 3 arguments?

slice() can be passed 1 or 2 arguments only. Below is a snippet of experiments for your reference:

In [9]: slice(5)
Out[9]: slice(None, 5, None)

In [10]: slice(1, 5)
Out[10]: slice(1, 5, None)

Therefore, in your last line, slice(5e-3/(2.6**4),5e-3) is equivalent to slice(start = 5e-3/(2.6**4), stop = 5e-3, step = None)

I hope my address could help!

5 Likes

Thank You so much for your complete and perfect answer. :slight_smile:
yes, that was what i meant.
Yes, it indeed helped me a lot. :slight_smile:

1 Like

Thanks alot @riven314 For this answer on Discriminative Learning , This was unresolved doubt which helped me alot , Also Thanks to @ZahraHkh For bringing to notice doubt on step size.

1 Like

You are welcome :slightly_smiling_face:

1 Like

Hi,

It cleared a lot of ambiguity. I am still stuck with using slice for max_lr and having two parameter.

learn.fit_one_cycle(20, max_lr=slice(1e-5,1e-4))

Thank you,

In this case, slice(1e-5, 1e-4) is essentially slice(start = 1e-5, stop = 1e-4, step = None).

From the source code (fastai2), it triggers discriminative learning: 1e-5 will be the learning rate applied on the top layer, 1e-4 will be the learning rate applied on the bottom layer. The layers in between will have learning rate somewhere between 1e-5 and 1e-4 (geometrically progress).

See the definition of set_hyper for more details on how the learning rates distribute among layers: link

2 Likes

Hii Riven,
I am currently working on an image classification project for Melanoma classification. I am using FastAI. I have 3 different layer groups where the first two are of the densenet161 model and the last is a fully connected layer group.

In training, I first freeze the model and train it for a few epochs on a small learning rate (found by lr_find() function ). Then, I unfreeze the model and train it using the slice function. I am confused of which two learning rates to chose. Is there a way to approximate these rates using the lr_find() function?