Understanding gradual unfreezing of model

I am reading fast ai book. In “Chapter 10: NLP Deep Dive: RNNs”, in section “Fine-Tuning the Classifier”, it says:

The last step is to train with discriminative learning rates and gradual unfreezing. In computer vision, we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference:

learn.fit_one_cycle(1, 2e-2)

We can pass -2 to freeze_to to freeze all except the last two parameter groups:

learn.freeze_to(-2) 
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

I have following doubts:

  1. What does it mean by “descriminative learning rate”?
  2. In “freeze all except the last two parameter groups”, what does it actually mean by “last two parameter group”? Or what exactly does it mean by “gradual unfreezing”? Does it mean updating weights of more and more neural network layers in each succeeding epoch? If yes, then does learn.freeze_to(-2) mean update weights of only last two layers?
  3. Why unfreezing the whole model at once works well in computer vision, but for NLP, gradual unfreezing is better suited? What is insight here that I am missing?

For the 1. and 2. questions:

Yes, just as you said learn.freeze_to(-2) means we freeze all except the last two parameter groups.

And yes you already answered “gradual unfreezing” means we are unfreezing a few layers at a time - more in each succeeding epoch.
“Discriminative learning rates” (“descriminative” is a typo) means we use different learning rates for different parameter groups and the reason for this the deepest layers don’t need that high learning rate (On the one hand, these are closer to the input, and the later layers to the output, and on the other hand, the lower layers may have learned more basic features).

1 Like

For the 3. question:

In fact, gradual unfreezing is better in computer vision too for very deep models, you can see examples of this in many competitions, so I wouldn’t say that this is only a topic concerning NLP.

But it depends on a lot of things, which model, how deep the model is, what model structure, what data it was trained on, what data is now being fine-tuned, etc., etc.

2 Likes

(I know am asking many follow up questions. Very sorry in advance. Answer them as you can. Am numbering them for convenience)

Q4. So gradual unfreezing is basically done to apply smaller learning rate to deeper layers and bigger learning rate to shallower (initial / input side) layers?

Just want to give the whole code in the book:

In computer vision, we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference:

    learn.fit_one_cycle(1, 2e-2)

In just one epoch, we get the same result as our training in Chapter 1—not too bad! We can pass -2 to freeze_to to freeze all except the last two parameter groups:

   learn.freeze_to(-2) 
   learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2)) 

Then we can unfreeze a bit more and continue training:

   learn.freeze_to(-3) 
   learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

And finally, the whole model!

   learn.unfreeze() 
   learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3)) 

I am finding it difficult to understand what he is trying to do here.

Q5. I understand 1e-2 > 5e-3 > 1e-3. But then what exactly does slice(1e-2/(2.6**4),1e-2) do here?

Q6. Jeremy is using

  • slice(1e-2/(2.6**4),1e-2) for last two layers
  • slice(5e-3/(2.6**4),5e-3) for last three layer
  • slice(1e-3/(2.6**4),1e-3) for all layers
    Right? If yes, then isn’t he applying smallest learning rate to all layers in last cycle? If yes, then dosn’t this contradict with what we said in Q4: deeper layer does not require smaller learning rate?

Q7. What does the first argument 2 to fit_one_cycle(2,... mean? The doc says it’s cycle length. Is it the number of epochs in cycle?

Q8. I guess I should spend some more time to understand what exactly is “cycle” from here and Sylvains’ post and Leslie’s paper linked there. But can you give couple of sentence definition of cycle (vs epoch)? Unable to find the definition by quick go through.

Okay, then I’ll say it again, but I’ll go into more detail and separately on the more ambiguous parts.

Q4:
The “gradual unfreezing” and “discriminative learning rates” are two related concepts, but they are not the same. They can be used simultaneously (as here in the above example), but also can be used separately (if you want).
And as stated in my first answer too, “gradual unfreezing” means we are unfreezing a few layers at a time, more in succeeding epochs and “discriminative learning rates” means we use different learning rates for different parameter groups. You see, they are not the same.
We also said that the deeper layers don’t need that high learning rate, because they are closer to the input and these lower layers may have learned more basic features. The later layers are closer to the output.
And here we stop for a moment to clarify a few more things.
deeper = lower = closer to the input = closer to the bottom = earlier layers
higher = closer to the output = closer to the top = later layers
We build the layers from the bottom up, but we start from 0 at the bottom.
This might confuse you, but layer 0 is the deepest layer, it’s the bottom layer of the net.
Based on your questions, you might have thought that 0 was the top.
I link my answer to another post here, because you can see the structure of a network printed there:
other helping post
(it’s a CNN, not an AWD LSTM, but the directions are the same :))

Now, we are pretty sure what is “gradual unfreezing”, “discriminative learning rates”, where is bottom and top directions.
We need all of these for the following questions.

Q5:
Slice handles the body & head separately, so it gives one learning rate for the group 0 (body) and another learning rate for group 1 (head). (and again because we use here 2 different learning rate values, that’s why we call it “discriminative learning rates” - we can easily enter 2 identical learning rate values, but then it is not discriminative). And we also see the body’s lr is smaller than the head’s lr here.
So we didn’t lie - the lower part got smaller lr :slight_smile:

In this book, “learn” here is a text_classifier_learner, which has 2 main parts, the “body” is an AWD LSTM and above it is a classifier “head”. This slice thing always applies to these 2 parts - the “body” and the “head”, not to sublayers. It can be confusing.
The “body” is an AWD LSTM with 4 layers if I remember correctly. That’s why we can use freeze_to with -2 parameter first, then -3 and finally unfreeze all 4. So the confusing part is freeze_to refers to layers, but fit_one_cycle's slice refers to the body+head (AWD LSTM + classifier).

Q6:
After the answer to Q5, now you can see all of these slices only refers to body+head - that’s why they have only 2 learning rates, 1 for the body, 1 for the head (the network has more sublayers than that, but all sublayers are in the body and in the head, so they got these different lr-s through the body and the head if you will).
We “gradual unfreezing” the network with all the learn.freeze_to() calls, and we use “discriminative learning rates” with learn.fit_one_cycle()'s max_lr slice parameter (different lr-s for body & head).

Q7:
The first argument of the fit_one_cycle is cyc_len (cycle length) and yes, as you said it is the number of epochs in the cycle - so now 2 epochs in the example.

Q8:
I think in your link there is a 1cycle policy description with 3 steps - ok, it’s not a definition, but kind of it is. The 1cycle is really just progressively increase the lr then progressively decrease the lr.
The whole cycle length is in epochs (this is the relationship between the two).

If you want to learn more about cycles you can read Leslie Smith’s 2018 paper as on the site and you can read the previous Leslie Smith’s 2017 paper too.
The 2017 version is about cycles - there is a good picture right on page 2 of the pdf.
2018 is about the 1cycle.

2 Likes