Okay, then I’ll say it again, but I’ll go into more detail and separately on the more ambiguous parts.
Q4:
The “gradual unfreezing” and “discriminative learning rates” are two related concepts, but they are not the same. They can be used simultaneously (as here in the above example), but also can be used separately (if you want).
And as stated in my first answer too, “gradual unfreezing” means we are unfreezing a few layers at a time, more in succeeding epochs and “discriminative learning rates” means we use different learning rates for different parameter groups. You see, they are not the same.
We also said that the deeper layers don’t need that high learning rate, because they are closer to the input and these lower layers may have learned more basic features. The later layers are closer to the output.
And here we stop for a moment to clarify a few more things.
deeper = lower = closer to the input = closer to the bottom = earlier layers
higher = closer to the output = closer to the top = later layers
We build the layers from the bottom up, but we start from 0 at the bottom.
This might confuse you, but layer 0 is the deepest layer, it’s the bottom layer of the net.
Based on your questions, you might have thought that 0 was the top.
I link my answer to another post here, because you can see the structure of a network printed there:
other helping post
(it’s a CNN, not an AWD LSTM, but the directions are the same :))
Now, we are pretty sure what is “gradual unfreezing”, “discriminative learning rates”, where is bottom and top directions.
We need all of these for the following questions.
Q5:
Slice handles the body & head separately, so it gives one learning rate for the group 0 (body) and another learning rate for group 1 (head). (and again because we use here 2 different learning rate values, that’s why we call it “discriminative learning rates” - we can easily enter 2 identical learning rate values, but then it is not discriminative). And we also see the body’s lr is smaller than the head’s lr here.
So we didn’t lie - the lower part got smaller lr
In this book, “learn” here is a text_classifier_learner
, which has 2 main parts, the “body” is an AWD LSTM and above it is a classifier “head”. This slice thing always applies to these 2 parts - the “body” and the “head”, not to sublayers. It can be confusing.
The “body” is an AWD LSTM with 4 layers if I remember correctly. That’s why we can use freeze_to
with -2 parameter first, then -3 and finally unfreeze all 4. So the confusing part is freeze_to
refers to layers, but fit_one_cycle's
slice refers to the body+head (AWD LSTM + classifier).
Q6:
After the answer to Q5, now you can see all of these slices only refers to body+head - that’s why they have only 2 learning rates, 1 for the body, 1 for the head (the network has more sublayers than that, but all sublayers are in the body and in the head, so they got these different lr-s through the body and the head if you will).
We “gradual unfreezing” the network with all the learn.freeze_to()
calls, and we use “discriminative learning rates” with learn.fit_one_cycle()
's max_lr
slice parameter (different lr-s for body & head).
Q7:
The first argument of the fit_one_cycle
is cyc_len
(cycle length) and yes, as you said it is the number of epochs in the cycle - so now 2 epochs in the example.
Q8:
I think in your link there is a 1cycle policy description with 3 steps - ok, it’s not a definition, but kind of it is. The 1cycle is really just progressively increase the lr then progressively decrease the lr.
The whole cycle length is in epochs (this is the relationship between the two).
If you want to learn more about cycles you can read Leslie Smith’s 2018 paper as on the site and you can read the previous Leslie Smith’s 2017 paper too.
The 2017 version is about cycles - there is a good picture right on page 2 of the pdf.
2018 is about the 1cycle.