ULMFit Paper: Question on learning rates (section 3.2)

wgpubs · May 22, 2018, 3:22am

Regarding …

“first choose the learning rate n(L) of the last layer by fine-tuning only the last layer and using n(l-1) = n(l)/2.6 as the learning rate for lower layers”

How is this implemented in fast.ai speak where the LM learner where we can apply different learning rates by layer group (of which there is 4)? Would it be something like:

lr = 0.004
learner.fit([lr/2.6*3, lr/2.6*2, lr/2.6, lr], use_clr_beta=(10,10,0.95,0.85), cycle_len=15)

alwc · May 22, 2018, 4:17am

For LM, there’s an example in the imdb.ipynb from Part 2:

In [16]:

lr=3e-3
lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])

So it should be power of 2 instead of multiply of 2.

By the way, did anyone try using this new discriminative rate setup on vision task?

wgpubs · May 22, 2018, 5:00am

Good catch!

But this is done for the classifier whereas the paper seems to indicate that this should be how we define the learning rates for fine-tuning the language model. Perhaps I’m reading the paper wrong??? Or perhaps the notebooks aren’t updated to reflect the paper???

There is LM fine-tuning and Classifier fine-tuning … so I guess I’m confused about the discriminative learning rates that should be applied to each insofar as my reading of the paper goes.

alwc · May 22, 2018, 5:55am

Correct me if I’m wrong, I thought we are using LM as backbone to do classification with custom head.

jeremy · May 22, 2018, 10:51pm

You can see the actual code from the paper in the dl2/imdb_scripts folder.

wgpubs · May 22, 2018, 11:04pm

Thanks for the link! Nice to look at the code.

I’m looking at train_tri_lm.py but I still don’t see where you “first choose the learning rate of the last layer by fine-tuning only the last layer”.

I see where if you are training with discriminative lrs you set lrs = np.array([lr/6,lr/3,lr,lr/2]), but insofar as training only the last layer first, I don’t see it. If I’m reading the code right, it looks like you guys train all the layers initially (see line 130: learner.unfreeze()).

There is code that looks to do what the paper describes, but its commented out (see line 123):

#learner.freeze_to(-1)
#learner.fit(lrs, 1, wds=wd, use_clr=(6,4), cycle_len=1)

So I’m not sure what to make of this.

alwc · May 23, 2018, 1:51am

Now I understand what’s your problem. It seems train_tri_lm.py use a different way to do discriminative learning rate and I don’t understand the logic behind as well.

wgpubs · May 23, 2018, 5:11pm

Also, is there a list of different argument values you passed to def train_lm()?