Lesson 2: differential learning rates

stickperson · April 19, 2018, 7:42pm

When unfreezing and fine tuning all layers, why are the earlier groups of layers (closer to pixels/edge detection) trained with a very low learning rate? The explanation in the video is that the earlier layers probably don’t need to be tuned as much, which I agree with. What is the relationship between learning rate and fine tuning? Are we simply assuming that earlier layers are in a “resilient” weight space, and as such we don’t need the resets/jumps to be that high?

Additionally, does a lower learning rate mean a longer runtime? If so does that mean we are taking a long time to retrain earlier layers even though they are stable?

Thanks.

radek · April 19, 2018, 9:56pm

Lower layers generally learn to detect more basic concepts - edges, shapes, etc. The higher in the stack you go, the more complex shapes are being learned (you might have a filter for recognizing faces, cars, etc).

There is a very neat visualization of this that will be shown in one of the lectures.

There is also this video and a paper by Matt Zeiler who did groundbraking work on visualizing what a CNN learns.

This is what the concept you mention is associated with. The idea is that depending how close our dataset is to what the CNN was trained on, we might not need it to relearn how to identify shapes, colors, textures. But higher up in the stack, maybe for the dataset we are working on we do not care about being able to tell dogs from cats, hence we might trade these feature maps for something more aligned to what we are trying to do. As this is where we would like the bulk of the training to happen, this is where we set the learning rates to be the highest.

sgugger · April 19, 2018, 10:12pm

I would also add this: by training all of your layers with the same learning rate, you might break the base of your network and actually get worse performance (this happens to me all the time when I forgot to pass an array of learning rates instead of just one).
Since the model already knows so much about recognizing pictures, we want to be very subtle when we fine-tune it. The last layers use all the features from before to recognize dogs, cats, frogs, or buildings, vehicles and other things. Those can change a bit more since we don’t really need all those categories. The earlier one that are the base of the network don’t need to change very much, though (and in fact if you let them frozen with unfree_to(-2) you will get almost the same results).

It doesn’t impact your training time since you are not training a model from scratch, you’re just making it learn to adapt to your particular problem when it has learnt something more general.

aakarshg · June 7, 2018, 7:38am

What I have understood is that the earlier layers need very less fine tuning and the subsequent layers need more fine tuning. This would mean that less time should be invested in tuning the first few layers as compared to the last layers or the subsequet layers. This, in turn should imply that the learning rate for the first few layers must be greater than that of the subsequent layers.
But this is opposite to what is taught in the lecture.
It would be really helpful if someone could correct me wherever I’m wrong.

arjanhada · December 16, 2018, 4:34pm

If I am implementing a Resnet152 what exactly will be first few and middle layers? Also, if we were to implement this in PyTorch how would we do it?

JustMeSam · December 17, 2018, 3:29pm

The earlier, more basic levels need to be less finetuned.

However, at the video, their learning rate is smaller than the learning rate of the more advanced levels ( that are specific to our current model).

I would think that setting a small learning-rate means better fine-tuning, however the opposite seems to be true. Why is this the case?