Question on how differential learning rates are applied to model

In the notebook we have this code:

lr=np.array([1e-4,1e-3,1e-2])

  1. How does the learner determine what layers to apply each learning rate too?

  2. How does learn.sched.plot_lr() know to plot on the learning rate for the final layers? Is that just the default when multiple learning rates are assigned?

Note that’s what being plotted above is the learning rate of the final layers. The learning rates of the earlier layers are fixed at the same multiples of the final layer rates as we initially requested

1 Like

The proof is in the pudding AKA the code :slight_smile:

I do not know how this works, but in Learner#fit we call Learner#get_layer_opt. This is on line #95 in learner.py.

From there, we instantiate LayerOptimizer and this whole class deals with layer_groups. I have not a whole lot clue how this works, but it seems that if the passed in learning rate is just an int, it will be turned into an array of ints I think will be the length of the layer_groups.

Without knowing more, I infer that probably for this model there exist 3 layer_groups and layers from each group get assigned a particular learning rate.

But I am not sure. Those are however the steps that would need to be take to start figuring this out. To go further, one would need to read more code and there might be Python features one would need to familiarize themselves with.

I think if we want to start tackling such questions, we really have to get into the habit of reading the source code. There are so many questions on this that can be asked that we cannot possibly address on these forums.

So there we go :slight_smile: We all have a chance to become better Python programmers :slight_smile: And if someone figures this out - and I am tempted to start figuring such things starting next week, than maybe we can start writing docstrings so that our colleagues can have an easier time figuring such things out.

2 Likes

I figured that many architectures have a whole lot of layers, so we probably don’t want to have to specify an LR for every single layer. So I created this idea of “layer groups”, which are layers in the same general part of a net. For convnets, the FC layers we add are one layer group, and the conv layers are split into two layer groups around the middle of the convnet. That gives us three groups. The LRs we pass are used for each group (that’s why we pass 3 LRs). If we pass just one LR, it’s simply copied to each group.

Each type of learner can define how to split layers into groups. It does this by overriding get_layer_groups().

11 Likes

Thanks for clearing this up. It definitely makes sense to think of layers as being in different categories, and of being able to assign a learning rate to a category vs. going layer by layer.

This is really one of the most interesting intuition in the first lessons for now, along with the LR finder and the concept of cycles. Very very nice, the boundary has been pushed a bit further!

That’s great @DavideBoschetto!

Finally I got the explanation from the author. Thanks.
For a long time I was looking for this answer, like how the 3 learning rates are divided across different layers. Finally got here. I have one request. These are some fundamental concepts of fast.ai, which makes it so unique. Could we have these explanation available in the official documentation of fastai v1. Then it will be very helpful. Otherwise, while extending the existing course notebook to different problems, sometime the students face this kind of concept-gap and look for the exact answer here and there.

When using differential learning rates, should the min gradient obtained from lrfinder be used as the low or high?

Has anyone tested what happens if each learning rate assigned to a group is further split by picking an arbitrary minimum(let’s say lr1e-1) so that layers within a group are not all trained on the same learning rate but the earlier layers to train with the smallest(lr1e-1) which increases to the maximum(the assigned lr). maybe this will not change anything but my crazy idea is to preserve more for each lower layer than its immediate upper layer.