Understanding fine-tuning

I am trying to understanding the reason behind the steps of fine-tuning and would appreciate if someone can give me some pointers.

Here are the steps involved and my understanding of the reason and questions…

  1. unfreeze() - tell learner to retrain all layers
  2. fit_one_cycle(1) - train all layers
    Why just 1 this time and not 4 like before?
  3. load(‘stage-1’) - load all the weights that saved earlier.
    Does every layer has individual weights?
    Why retrain all the layers if we already have weights?
  4. lr_find() - find learning rate
  5. recorder.plot() - find the range of learning rate
  6. unfreeze() - why unfreeze again? I thought we already did that
  7. fit_one_cycle() - why 2 this time?

thanks in advanced for your help

3 Likes

IMO,Your steps looks incomplete /incorrect to me. So let me write generally the steps which I learned:

Stage1: train the last layer

  1. Start with a pre-trained model’s weights (pre-trained on imagenet). The last layer is removed and a new layer for our classes is added with random weights.
  2. Freeze the pre-trained weights (default behaviour in fastai I guess) as they are “good” starting weights, and train the last layer (as they are random weights).
  3. When the last layer is performing fine i.e. it’s weights are also “good” starting weights we move to stage 2.

Stage 2: unfreeze and train
Now we unfreeze all the weights and try to train the whole model.

Your answers:

No reason for fitting for 1 epoch, you can train for more… depends on use case.

I guess you are confused here. We have weights for each layer. We always have weights, but the weights are random. We train the model to change the weights from random to some useful weights.

You can maybe share the notebook, there is no reason to run unfreeze again (might be unfreezing layer by layer… useful for NLP tasks I guess). You can run fit again to train for more iterations.

3 Likes

Thanks for taking the time to answer my questions, Abhilash. Really appreciated.

So stage 1 is using the pre-trained weights for the previous layers and just train the last layer.

And stage 2 is unfreezing all the layers to improve their weights by using the new learning rate, is the correct?

1 Like

I’m facing the same confusion. Hopefully someone who understands the concept could answer this question

Not an expert, but here is my understanding on the topic:

If you train the whole network without freezing pre-trained part, it will be unstable: non pre-trained layers will work badly at start, generating big errors, which can end up impacting your good and working pre-trained layers, as they will be modified. Sometimes you will eventually decrease error until you end up being as good as the other method, but you used up more epochs than necessary and training whole network takes longer for each epoch.

So, the idea is that you assume the pre-trained part is good (which is usually the case and the reason you are using a pre-trained network in the first place) and works pretty well. You only train the last part, the error will only impact this part, it will train faster and you have more stability. Once that gets pretty good too, its time to unfreeze everything and train again. Loss won’t be big by this point and you give a chance to the pre-trained layers to be changed to better fit your particular dataset. Even in this case, its usually better to give these layers lower learning rate than last ones, you don’t want to change them too much.

When you create a model to be used as a pre-trained model by other people, its usually good practice to stop a bit early: don’t keep the best model which had the least error, stop a little earlier as that will generalize better. This also helps during this last step of unfreezing whole network and readjusting a a bit, the bit that was not done in the pre-trained one.

1 Like