Training longer for the frozen vs unfrozen model

Let’s say you were given the choice to train the frozen pretrained model for 2 extra epochs vs training the unfrozen model for 2 extra epochs (or some other constant number of epochs). Which would generally be the better choice to make?

I would think it would be better to train the unfrozen model because then the entire network will be adjusted for the particular task at hand, so this would especially be better for tasks that are much different compared to those the model was originally pretrained on.

If you look into what is under the hood of backpropagation, you will see that in one epoch, unfrozen model adjusts the weights of the earlier layers IN ADDITION to making the same updates to the last layer. After all, it means to freeze the model is to stop the backpropagation after the last layer and don’t update the earlier ones. So frozen model only adjusts the LAST layer to the specific task at hand, while the unfrozen model adjust the ALL the layers.

What does this imply? Unfrozen model will have more fitting power and could fit the training set better. Of course, it could more easily lead to overfitting compared to frozen model, but if you really want to fit the model to the task at hand, by all means unfreeze it. It will take longer to train per epoch because you need to update ALL the layers, instead of just the LAST layer in the case of frozen model.

1 Like

Thanks for your insight… Interestingly, most of the time training the unfrozen model is not significantly slower than training the frozen model (maybe a difference of a couple mins). Have you noticed similar behavior in your experience?
Therefore, it seems for tasks that are different from the original dataset that the model was pretrained, it is better to train longer with the unfrozen model. Otherwise, we would be risking overfitting.
Is this right?

I would be surprised if training my unfrozen model is “not” significantly slower than training my frozen model, even more so if my network is deep like ResNet-50.

Yes.

But the problem is if you train the unfrozen model longer, you will overfit for your dataset. The best way to handle this is to do EarlyStopping with validation dataset.

My advice is to:

  1. Train the frozen model as long as possible and early stop with validation dataset.
  2. Unfreeze the whole model and train the unfrozen model as long as possible and again, early stop with validation dataset.

Well that’s weird because that’s what has normally been happening with my experiments. Please see this kernel and this kernel. I have even done some experiments with relatively deep densenet models and only see maybe 3-4 min difference.

Have I been doing something wrong all along??!!

Also do you have any details on how to use EarlyStopping in fastai?

Assuming that your DataBunch is data and you model architecture is defined at arch. Here is how you would minimally create callbacks for EarlyStoppingCallback and callback for saving the best model (you don’t need to do this, just an extra).

learner = create_cnn(data, arch,
                     callback_fns=[partial(EarlyStoppingCallback, monitor='val_loss',
                                           min_delta=0.01, patience=4),
                                   partial(SaveModelCallback, monitor='val_loss', every='improvement',
                                           name='best_model')])

docs: https://docs.fast.ai/callbacks.html#EarlyStoppingCallback

Please check the source code for parameters such as min_delta and patience. In short, min_delta is the minimum difference between each validation loss to be accounted for as an improvement and patience is how many more epochs will the model be patiently trained for even if the model is not doing any better.

1 Like

I don’t see any problems in your code. Perhaps, it is fairly reasonable to have that difference for that dataset.

Well so far, I have played around with 3-4 other classification datasets and see the same kind of pattern. I guess maybe this might have to do with what datasets are being used…

That is actually not what I mean. You get the completely opposite idea. What I mean is that unfrozen model has more fitting power than frozen one and that actually makes it easier to overfit than a frozen model. That is the first takeaway.

However, that makes it neither desirable or undesirable by itself. You have to look at the problem at hand. Stronger fitting power is helpful when the problem at hand is complicated; if you apply it to simple problems, then you overfit.

So how do you know when to jack up your fitting power? If you cannot get the training loss lower, jack it up. It is simply that simple. If it overfits, meaning that the validation loss is significantly larger than the training loss, then turn it down a bit.

2 Likes

This is good information, thank you. So if you are strongly under fitting in your dataset (train loss much higher than val loss for many epochs), would it make sense to train frozen for only a few epochs and then switch to unfrozen sooner? Or maybe just start training with the model unfrozen to begin with?

Well, the thing is, you never know if you are under fitting if you haven’t even tried training a frozen model. It is when you find that training frozen model cannot get the training loss low enough did you realize that you are under fitting.

Starting with frozen model is a good strategy because it is faster and if it is good enough, then done; if not, then you can continue to unfreeze the model and train more.

1 Like