Why can't we just skip to unfreezing layers and train all layers of the model

These were the consolidated steps outlined by Jeremy in lesson 2:

  1. Enable data augmentation, and precompute=True
  2. Use lr_find() to find highest learning rate where loss is still clearly improving
  3. Train last layer from precomputed activations for 1-2 epochs
  4. Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
  5. Unfreeze all layers
  6. Set earlier layers to 3x-10x lower learning rate than next higher layer
  7. Use lr_find() again
  8. Train full network with cycle_mult=2 until over-fitting
  9. TTA on validation set

I was wondering why we can’t just bypass steps 3 and 4 and just go straight to step 5 where we unfreeze all the layers and train the full model on differential learning rates – this saves us the time having to train a preliminary model with precomputed activations which we will not use in the final model anyway.

Appreciate the help guys!


It would be a cool experiment to see whether the these steps are useful for improving accuracy or the time needed to train, given a particular dataset and architecture. Since your mind is primed with all this, I invite you to give it a go and let us know what happens.

I think the theory motivating the first few steps is that you know you’ll need to train the fully connected (FC) layers the most, with respect to the other layers, since you initialized them somewhat randomly. And if we freeze the other layers, we can train the FC layers very quickly by precomputing the activations. Also, since we know the FC layers will be trained a lot and the others only a little, this thinking says that the training we do on the other layers will be somewhat wasteful in the beginning, since how we train the other layers depends on the state of the FC layers and the state of the FC layers will be changing rapidly in the beginning. From this thinking, I predict that skipping steps 3 and 4 will make training to convergence take longer, but won’t affect accuracy significantly. The results of experiments have the final say, though.


Take the thought experiment out to the extreme. Why do we use pre-computed weights at all? Why do we even bother with transfer learning? Why not just initialize the known model with random weights?

1 Like

I think it’s an empirical observation that gradual unfreezing works better than unfreezing and retraining all layers of the network. The same is also mentioned in the fitlam paper - https://arxiv.org/pdf/1801.06146.pdf


Seconded! The underlying theory is described in this paper https://arxiv.org/abs/1411.1792 . But we’ve done few experiments to fully understand best practices here, so any experiments you run will be of great interest to all. :slight_smile: