In the last run of fast.ai, @jeremy talked about usage of differential learning rates, since the early layers need lesser change than the later layers. He even mentioned this fact in lesson 1 while discussing Fergus and Zeiler paper. Does fast.ai-v1 started taking care of this internally? Does it mean, we can send any Pytorch model (even the custom ones we build ourselves) into create_cnn and expect it to work?
Another question really bugs me, how 1cycle scheduling does the warmup cosine-annealing did, as shown in the previous course? Or is it we are eliminating warmup once and for all, but then how does we prevent gradient descent from getting stuck at local minimas?
A bit of context, in part 2 v2 lesson 10, Jeremy and Sebastian Ruder turned part 1 v2 lesson 4 (previous run) into a paper, ULMFiT. They have renamed differential learning rates to discriminative learning rates:
He (Sebastian) said “well, this thing that you called in the lessons differential learning rates, differential kind of means something else. Maybe we should rename it” so we renamed it. It’s now called discriminative learning rate . So this idea that we had from part one where we use different learning rates for different layers, after doing some literature research, it does seem like that hasn’t been done before so it’s now officially a thing — discriminative learning rates. This is something we learnt in lesson 1 but it now has an equation with Greek and everything. [reference: lesson 10 video]
I am not so sure as it’s early days for me and I’m actively reading fastai v1 docs and deconstructing the source code to understand its internals. I don’t want to assume things are the same as v0.7.
May be you should take a look at lesson 10 and 13 (and part 2 v2 if you haven’t)?
My understanding is, I get the most out of these 3 concepts; discriminative LR, 1cycle policy, LR annealing, train phase, SGDR (SGD with warm restarts) and customize LR finder by studying part 2 v2.
I know this is the using the old fastai v0.7 but the concepts are the same.
Jearmy did cover differential learning rates when he passed in slice(lr1,lr2) in the model, lr1 is for earlier layers while lr2 for later ones. He didn’t explain it as much as last time though. cosine annexing, TTA , learning with different sizes yet to be covered
I’m very very sorry. I didn’t notice this post earlier. I’ve notified myself of this, and I can promise the community this won’t be repeated in the future
No worries. Thanks for starting this discussion. I am looking forward to diving deeper into 1Cycle and the answers to your question on Differential LR. It’s one of many things that help build state of the art models with FastAI.
I’m digging into this myself and haven’t figoured out everything so far but just wanted to share what I’ve found.
This article explained what 1cycle policy is and how/why it works. So it starts with a lower LR to warm up and linearly goes up and then back down linearly as well. (Still haven’t seen the cycle multiple yet, which allows more time for later cycles, not sure if this is somehow not needed any more with 1cycle police, compared with SGDR. I can’t find this option in either learner.fit() nor learner.fit_one_cycle())
I remember back in Part1v2, Jeremy was adjusting LR 3-10x each time but talked about maybe linearly adjusting will be even better and was looking into this and might change how the LR adjusts.
The “differential learning rates” still exists but seems to have “updated” to become a slice of LR for different layers.
I was thinking about this yesterday while training a LM with fastai_v1.
I can’t seem to get as good perplexity as I could with my TrainingPhase API, developed for the earlier version by @sgugger. I can see most of the options of the API in the learner (e.g. div, pct) but I am not sure how to set the shape of the learning rate (I was using a combination of linear, cosine/polynomial).
I am probably missing something as I haven’t had the chance to dig into the code, but it would be nice if anyone else had experience with this? Also, has anyone compared 0.7 with 1.0 training accuracy/perplexity? I am sure I should be getting as good or even better results, so I’m hoping to find out what I am missing.