“first choose the learning rate n(L) of the last layer by fine-tuning only the last layer and using n(l-1) = n(l)/2.6 as the learning rate for lower layers”
How is this implemented in fast.ai speak where the LM learner where we can apply different learning rates by layer group (of which there is 4)? Would it be something like:
lr = 0.004
learner.fit([lr/2.6*3, lr/2.6*2, lr/2.6, lr], use_clr_beta=(10,10,0.95,0.85), cycle_len=15)
But this is done for the classifier whereas the paper seems to indicate that this should be how we define the learning rates for fine-tuning the language model. Perhaps I’m reading the paper wrong??? Or perhaps the notebooks aren’t updated to reflect the paper???
There is LM fine-tuning and Classifier fine-tuning … so I guess I’m confused about the discriminative learning rates that should be applied to each insofar as my reading of the paper goes.
I’m looking at train_tri_lm.py but I still don’t see where you “first choose the learning rate of the last layer by fine-tuning only the last layer”.
I see where if you are training with discriminative lrs you set lrs = np.array([lr/6,lr/3,lr,lr/2]), but insofar as training only the last layer first, I don’t see it. If I’m reading the code right, it looks like you guys train all the layers initially (see line 130: learner.unfreeze()).
There is code that looks to do what the paper describes, but its commented out (see line 123):
Now I understand what’s your problem. It seems train_tri_lm.py use a different way to do discriminative learning rate and I don’t understand the logic behind as well.