Why do we need to unfreeze the learner everytime before retarining even if learn.fit_one_cycle() works fine without learn.unfreeze()

(Vinay Kumar) #1

As illustrated in the lectures notebook, we need to unfreeze the learner before we can retrain it. I understand the general reason why we need to unfreeze a model before training.

Usually, if the model weights are frozen, we should not be able to train it; but even if I don’t run learn.unfreeze() before learn.fit_one_cycle(1) everything works fine and the model gets trained. How??

Does fit_one_cycle() checks if the model is frozen or not before re-training??



1 Like


I think the default is freeze() so the last layer/group should be unfrozen only.
All other layers should be frozen.

def freeze(self)->None:
    "Freeze up to last layer."
    assert(len(self.layer_groups) > 1)

You can unfreeze; this will freeze_to(0).

def unfreeze(self):
    "Unfreeze entire model."

In general, you can freeze_to to any layer you like.


(Vinay Kumar) #3

So, when we learn.save() a model, does it freeze it by default (I cooldn’t see that in the src code)??
I think it just stores the parameters as dict.
When we load the saved model, does it does it freeze or unfreeze the model by default?? :thinking:


(Poonam Ligade) #5

Following are the steps to train a good model.

1. Load the model.

By default when you load the model from fastai library, it will have all the layers frozen i.e pre-trained weights, say resnet (or any other preatrained model) won’t get modified.

If you print


you will find most of the starting layers are set to trainable=False.( (i.e. requires_grad=False ).)
So you don’t need to explicitly say


you can directly start training using

:warning: Note: This is the case with fastai library only.
If you use plain Pytorch, you need to freeze the initial layers before training.

2. Freeze the initial layers

What freezing does?


not required if using fastai library.
Freezing basically prevents well-trained weights from being modified, that’s called transfer learning. (i.e. requires_grad=False ).
Gradients are not calculated for those layers.

There are layer groups in any model architecture

you can see that by


The initial layers are mostly used for understanding low-level features like curves, lines, shapes, patterns. When we use pre-trained models they are trained for identifying these features on a large dataset of images like Imagenet(1000 categories).

the later layers are mainly for capturing high-level features on current dataset like pets.
These are fully connected layers which identify features like the shape of a dog or cat in its entirety.
These layers hold composite or aggregated information from previous layers related to our current data.
We improve information captured by these layers by training the model and optimizing loss based on target labels.( (i.e. requires_grad=True ).)
Read this paper.
Check here for more info

3. training model


Train only the last layer group i.e fully connected layers. don’t train longer because you might overfit.

a) use lr_find() before fit_one_cycle() to get best suited learning rate for underlying data.

4. Unfreeze the layers


All of the layers are trainable =True now.
It sets every layer group to trainable (i.e. requires_grad=True ).
Model is getting retrained from scratch.
All weights from frozen layers of the model now can get updated from their pre-trained state according to loss function. (Thanks for suggesting better edit @Daniel )
you can change this behavior by instead using freeze_to() method which allows you to keep some layers frozen.

5. training model

learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4))

Make sure to use discriminative learning rates here(parameter max_lr), which maintains a low learning rate for initial layers as they need lesser tuning and gradually increase learning rate for later layers which need higher tuning especially fully connected ones.

a) use lr_find() before fit_one_cycle() to get best suited learning rate for underlying data.

6. Saving model parameters


Starting from layer 1 to layer n all the weights are saved.
Architecture is not saved so you have to define the same architecture in order to use these weights again.(freeze/ unfreeze details are not saved.)

Read more here -

  1. Understanding freeze()
  2. Recently @Daniel ran a kernel with experiments for freeze() function

Hope this helps :slight_smile: .


Fast.ai v3 2019课程中文版笔记
(深度碎片) #6

@PoonamV Thanks for this nice summary post.

This quotation is a nice explanation for questions like the following.

Why and when to use max_lr?


Fast.ai v3 2019课程中文版笔记
(深度碎片) #7

Hi @PoonamV

I am not sure the following statement is correct.

Model is getting retrained from scratch.

do you mean the following? (I think it is more accurate, correct me if I am wrong. thanks!)

Every layer with weights in the model now can get updated from the pretrained state.


(深度碎片) #8

Nice description of what the model does under the scenes!

@PoonamV do you know anyone tried to visualize the last few layers of transfer learning model trained on Pets dataset? I can imagine what those layer kernels would look like (lots of cats and dogs features), but I still would like to see it’s done right in front of our eyes.


(Poonam Ligade) #9

Ya correct you are right. From scratch is wrong word. I was myself doubting the stmt. Thanks for rephrasing it more accurately


(Charm) #10

To answer this question, we only need to enter the source code.
---------------- First, when we create a cnn model using cnn_learner -----------------------------

def cnn_learner(data:DataBunch, base_arch:Callable, cut:Union[int,Callable]=None, pretrained:bool=True,
                lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5, custom_head:Optional[nn.Module]=None,
                split_on:Optional[SplitFuncOrIdxList]=None, bn_final:bool=False, init=nn.init.kaiming_normal_,
                concat_pool:bool=True, **kwargs:Any)->Learner:
    "Build convnet style learner."
    meta = cnn_config(base_arch)
    model = create_cnn_model(base_arch, data.c, cut, pretrained, lin_ftrs, ps=ps, custom_head=custom_head,
        split_on=split_on, bn_final=bn_final, concat_pool=concat_pool)
    learn = Learner(data, model, **kwargs)
    learn.split(split_on or meta['split'])
"----------------watch out!  our's 'pre-trian model' has been 'freeze' ----------------------------"
    if pretrained: learn.freeze()     # there
    if init: apply_init(model[1], nn.init.kaiming_normal_)
    return learn

-----------------------------------now , we move to learn.freeze()----------------------------------

    def freeze(self)->None:
        "Freeze up to last layer group."
" there , we can see only trian the last layer group(such as fully-connect layers)"

So, if you don’t run learn.unfreeze() before learn.fit_one_cycle(1) just train fully connect layers, the convolutional framework has been pre-trained and freeze , everything works fine.

This is equivalent to using the pre-trained convolution network backbone as a feature extractor,only training the fully connected layer learning classification.

1 Like

Fast.ai v3 2019课程中文版笔记
(Antonio de Perio) #11

Thank you this explanation clarified things for me. What I wasn’t clear on was why we had to unfreeze and retrained all the layers again. I assumed the ImageNet layers were retrained from scratch and that didn’t make sense to me. So it sounds like after training when frozen, unfreezing then training again just updates the weights in the pre-trained layers rather than retraining them all again.


(Antonio de Perio) #12

Hi all, I’ve been reading the paper https://arxiv.org/abs/1411.1792 “How transferable are features in deep neural networks?” And thought I’d share some of my understanding from it (which I hope is in the right direction)

But it sounds like that when we connect a pre-trained network, and then train on a frozen network, we are prone to optimisation issues from connecting “fragile co-adpated” layers.

So when we unfreeze the network, and open up the pre-trained weights for updating, we smooth out the effect of the co-adapted layers and actually get better results and improved generalisation.

So I think

  1. Transfer learning alone gets good results
  2. Transfer learning + fine tuning after unfreezing gets even better results

I thought this graph in the paper highlighting their experimental results illustrated it pretty well for me. (Best read in the paper and interpreted in the context of the particular experiments they were running. It’s also a good read :slight_smile:


(Alex Arlanov) #13

Thank you, good explanation:+1:


(Dien Hoa TRUONG) #14

Thank you for this great post !

I have a question for my case. I have trained a quite ok model for an image classification problem. However I have just acquired new extra data. Should I retrain the model from the beginning with old data + extra data or it is better I train from the last checkpoint ?

Because I think if I train from the checkpoint (which is already a local minimum), It is very hard that I can get out of this to get a better model. This is similar to the case of trainning longtime with freezing then we can get overfit as you mentionned in your post.

In addition, if I just add a small number of extra data (like just 1%), run lr_find suggest a very small value then I think the model won’t change much after learning. Thank you :smiley: