ULMFit Paper: Question on learning rates (section 3.2)

Regarding …

“first choose the learning rate n(L) of the last layer by fine-tuning only the last layer and using n(l-1) = n(l)/2.6 as the learning rate for lower layers”

How is this implemented in fast.ai speak where the LM learner where we can apply different learning rates by layer group (of which there is 4)? Would it be something like:

lr = 0.004
learner.fit([lr/2.6*3, lr/2.6*2, lr/2.6, lr], use_clr_beta=(10,10,0.95,0.85), cycle_len=15)

For LM, there’s an example in the imdb.ipynb from Part 2:

In [16]:

lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])

So it should be power of 2 instead of multiply of 2.

By the way, did anyone try using this new discriminative rate setup on vision task?

1 Like

Good catch!

But this is done for the classifier whereas the paper seems to indicate that this should be how we define the learning rates for fine-tuning the language model. Perhaps I’m reading the paper wrong??? Or perhaps the notebooks aren’t updated to reflect the paper???

There is LM fine-tuning and Classifier fine-tuning … so I guess I’m confused about the discriminative learning rates that should be applied to each insofar as my reading of the paper goes.

Correct me if I’m wrong, I thought we are using LM as backbone to do classification with custom head.

You can see the actual code from the paper in the dl2/imdb_scripts folder.


Thanks for the link! Nice to look at the code.

I’m looking at train_tri_lm.py but I still don’t see where you “first choose the learning rate of the last layer by fine-tuning only the last layer”.

I see where if you are training with discriminative lrs you set lrs = np.array([lr/6,lr/3,lr,lr/2]), but insofar as training only the last layer first, I don’t see it. If I’m reading the code right, it looks like you guys train all the layers initially (see line 130: learner.unfreeze()).

There is code that looks to do what the paper describes, but its commented out (see line 123):

#learner.fit(lrs, 1, wds=wd, use_clr=(6,4), cycle_len=1)

So I’m not sure what to make of this.

Now I understand what’s your problem. It seems train_tri_lm.py use a different way to do discriminative learning rate and I don’t understand the logic behind as well.

Also, is there a list of different argument values you passed to def train_lm()?