Why do we need to unfreeze the learner everytime before retarining even if learn.fit_one_cycle() works fine without learn.unfreeze()

As illustrated in the lectures notebook, we need to unfreeze the learner before we can retrain it. I understand the general reason why we need to unfreeze a model before training.

Usually, if the model weights are frozen, we should not be able to train it; but even if I don’t run learn.unfreeze() before learn.fit_one_cycle(1) everything works fine and the model gets trained. How??

Does fit_one_cycle() checks if the model is frozen or not before re-training??




I think the default is freeze() so the last layer/group should be unfrozen only.
All other layers should be frozen.

def freeze(self)->None:
    "Freeze up to last layer."
    assert(len(self.layer_groups) > 1)

You can unfreeze; this will freeze_to(0).

def unfreeze(self):
    "Unfreeze entire model."

In general, you can freeze_to to any layer you like.

1 Like

So, when we learn.save() a model, does it freeze it by default (I cooldn’t see that in the src code)??
I think it just stores the parameters as dict.
When we load the saved model, does it does it freeze or unfreeze the model by default?? :thinking:

1 Like

Following are the steps to train a good model.

1. Load the model.

By default when you load the model from fastai library, it will have all the layers frozen i.e pre-trained weights, say resnet (or any other preatrained model) won’t get modified.

If you print


you will find most of the starting layers are set to trainable=False.( (i.e. requires_grad=False ).)
So you don’t need to explicitly say


you can directly start training using

:warning: Note: This is the case with fastai library only.
If you use plain Pytorch, you need to freeze the initial layers before training.

2. Freeze the initial layers

What freezing does?


not required if using fastai library.
Freezing basically prevents well-trained weights from being modified, that’s called transfer learning. (i.e. requires_grad=False ).
Gradients are not calculated for those layers.

There are layer groups in any model architecture

you can see that by


The initial layers are mostly used for understanding low-level features like curves, lines, shapes, patterns. When we use pre-trained models they are trained for identifying these features on a large dataset of images like Imagenet(1000 categories).

the later layers are mainly for capturing high-level features on current dataset like pets.
These are fully connected layers which identify features like the shape of a dog or cat in its entirety.
These layers hold composite or aggregated information from previous layers related to our current data.
We improve information captured by these layers by training the model and optimizing loss based on target labels.( (i.e. requires_grad=True ).)
Read this paper.
Check here for more info

3. training model


Train only the last layer group i.e fully connected layers. don’t train longer because you might overfit.

a) use lr_find() before fit_one_cycle() to get best suited learning rate for underlying data.

4. Unfreeze the layers


All of the layers are trainable =True now.
It sets every layer group to trainable (i.e. requires_grad=True ).
Model is getting retrained from scratch.
All weights from frozen layers of the model now can get updated from their pre-trained state according to loss function. (Thanks for suggesting better edit @Daniel )
you can change this behavior by instead using freeze_to() method which allows you to keep some layers frozen.

5. training model

learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4))

Make sure to use discriminative learning rates here(parameter max_lr), which maintains a low learning rate for initial layers as they need lesser tuning and gradually increase learning rate for later layers which need higher tuning especially fully connected ones.

a) use lr_find() before fit_one_cycle() to get best suited learning rate for underlying data.

6. Saving model parameters


Starting from layer 1 to layer n all the weights are saved.
Architecture is not saved so you have to define the same architecture in order to use these weights again.(freeze/ unfreeze details are not saved.)

Read more here -

  1. Understanding freeze()
  2. Recently @Daniel ran a kernel with experiments for freeze() function

Hope this helps :slight_smile: .


@PoonamV Thanks for this nice summary post.

This quotation is a nice explanation for questions like the following.

Why and when to use max_lr?


Hi @PoonamV

I am not sure the following statement is correct.

Model is getting retrained from scratch.

do you mean the following? (I think it is more accurate, correct me if I am wrong. thanks!)

Every layer with weights in the model now can get updated from the pretrained state.


Nice description of what the model does under the scenes!

@PoonamV do you know anyone tried to visualize the last few layers of transfer learning model trained on Pets dataset? I can imagine what those layer kernels would look like (lots of cats and dogs features), but I still would like to see it’s done right in front of our eyes.

Ya correct you are right. From scratch is wrong word. I was myself doubting the stmt. Thanks for rephrasing it more accurately

To answer this question, we only need to enter the source code.
---------------- First, when we create a cnn model using cnn_learner -----------------------------

def cnn_learner(data:DataBunch, base_arch:Callable, cut:Union[int,Callable]=None, pretrained:bool=True,
                lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5, custom_head:Optional[nn.Module]=None,
                split_on:Optional[SplitFuncOrIdxList]=None, bn_final:bool=False, init=nn.init.kaiming_normal_,
                concat_pool:bool=True, **kwargs:Any)->Learner:
    "Build convnet style learner."
    meta = cnn_config(base_arch)
    model = create_cnn_model(base_arch, data.c, cut, pretrained, lin_ftrs, ps=ps, custom_head=custom_head,
        split_on=split_on, bn_final=bn_final, concat_pool=concat_pool)
    learn = Learner(data, model, **kwargs)
    learn.split(split_on or meta['split'])
"----------------watch out!  our's 'pre-trian model' has been 'freeze' ----------------------------"
    if pretrained: learn.freeze()     # there
    if init: apply_init(model[1], nn.init.kaiming_normal_)
    return learn

-----------------------------------now , we move to learn.freeze()----------------------------------

    def freeze(self)->None:
        "Freeze up to last layer group."
" there , we can see only trian the last layer group(such as fully-connect layers)"

So, if you don’t run learn.unfreeze() before learn.fit_one_cycle(1) just train fully connect layers, the convolutional framework has been pre-trained and freeze , everything works fine.

This is equivalent to using the pre-trained convolution network backbone as a feature extractor,only training the fully connected layer learning classification.


Thank you this explanation clarified things for me. What I wasn’t clear on was why we had to unfreeze and retrained all the layers again. I assumed the ImageNet layers were retrained from scratch and that didn’t make sense to me. So it sounds like after training when frozen, unfreezing then training again just updates the weights in the pre-trained layers rather than retraining them all again.

1 Like

Hi all, I’ve been reading the paper https://arxiv.org/abs/1411.1792 “How transferable are features in deep neural networks?” And thought I’d share some of my understanding from it (which I hope is in the right direction)

But it sounds like that when we connect a pre-trained network, and then train on a frozen network, we are prone to optimisation issues from connecting “fragile co-adpated” layers.

So when we unfreeze the network, and open up the pre-trained weights for updating, we smooth out the effect of the co-adapted layers and actually get better results and improved generalisation.

So I think

  1. Transfer learning alone gets good results
  2. Transfer learning + fine tuning after unfreezing gets even better results

I thought this graph in the paper highlighting their experimental results illustrated it pretty well for me. (Best read in the paper and interpreted in the context of the particular experiments they were running. It’s also a good read :slight_smile:


Thank you, good explanation:+1:

1 Like

Thank you for this great post !

I have a question for my case. I have trained a quite ok model for an image classification problem. However I have just acquired new extra data. Should I retrain the model from the beginning with old data + extra data or it is better I train from the last checkpoint ?

Because I think if I train from the checkpoint (which is already a local minimum), It is very hard that I can get out of this to get a better model. This is similar to the case of trainning longtime with freezing then we can get overfit as you mentionned in your post.

In addition, if I just add a small number of extra data (like just 1%), run lr_find suggest a very small value then I think the model won’t change much after learning. Thank you :smiley:


Thanks for this summary!
Regarding point 3.a:

use lr_find() before fit_one_cycle() to get best suited learning rate for underlying data.

I thought the LR finding occurs after the first fit_one_cycle:

Source: Universal Language Model Fine-tuning for Text Classification

Can someone confirm the following:

Lets say you trained the model using fit_one_cyle and saved the weights (w1, w2,…wn)@time=t. Now, when you unfreeze and again run a learn cycle, it will start from the weights saved at time t. Instead of updating the weights in last layer, all weights will be updated because of unfreeze. Is that correct understanding?

Second how do you find which layers are frozen and which were updated. Is that dependent on architecture? eg. for resnet 34, 34 layers will stay frozen until you call unfreeze, in which case weights will be updated for all layers.

That’s correct.

All layers are frozen by default when u save a learned model. So, u need to unfreeze the layers yourself if you want to train the layers. Use unfreeze()

1 Like

In fastai2 we can call a learn.summary() to see what’s trainable (not frozen) and not trainable (frozen) along with how many total parameters that is too :slight_smile:


thank you @vinaykumar2491 and @muellerzr.

let me use learn summary since that can answer my questions.