Differential learning rate annealing

Arindam · December 23, 2018, 9:31am

In the second lecture of the deep learning part 1 course i came across the concept of Fine tuning and differential rate annealing. Over there Jeremy has formed an array of the learning rate for training the unfreezed model but i am confused as to why the learning rates of the initial layers are smaller than the final layers. We know that in the initial layers the network learns to detect the edges and corners which was detected by the pre-trained model so why are initializing it with a small value and more over we know that the earlier layers of the network take a longer time to train so by taking a small learning rate we are increasing the computational cost.

It would be very help if some one could shed some light on his matter.

orange_runner · December 25, 2018, 3:38am

I think Jeremy mentions whys. Off the top of my head: the first layers are very low-level - they detect simple, but very generalizable features like edges, gradients, etc. These layers are already trained enough to bring the value even without much modifications -> very little additional training needed to fine-tune these (concept of edges on cat’s pictures are somewhat similar to any edges on other pictures). The deeper we go, the larger context for the features, but less generalizeable they become -> more learning required (cat feature detectors will be not very good at classifying xray pictures).
Hope this gives you some context.

Arindam · December 25, 2018, 10:12pm

Thanks for your insight but aren’t we re-training the model from scratch by choosing different learning rates?

orange_runner · December 26, 2018, 5:27am

I think fine tuning concept has sense only with pre-trained models.