@LessW2020 sorry I didn’t respond earlier. I was busy with a competition and completely forgot about your post.
Actually, I wanted to try freezing and unfreezing with Ranger for the competition, but as I mentioned earlier, I was unsuccessful.
Here is the code. Feel free to play around with it to be able to figure out where the bug is. I am probably only going to revisit it in about a week to try it out for a different competition.
I read the Lookahead paper and found some difficulties and thought this would be the best place to ask.
Problem related to Figure1 of paper. How do I generate such a plot using matplotlib or some other library? They project the weights onto a plane defined by the first, middle and last fast weights. How is this done?
Problem related to Proposition1 of paper. I understand that there is some quadratic loss model that can act as a proxy for neural network optimization but how do I use this information to set the value of alpha found in the Proposition1 when I am using CrossEntropy Loss?
Or is there no need to worry about alpha that much? Test some values to see which works best or use default value.
Great stuff, thanks for posting this! If I have time I’ll build in their exponential warmup and put a toggle so people can test both.
Also, their comments at the end about interpolating between Adam/SGD for late stage convergence is basically…RangerQH
I’ll have to write an article about RangerQH but that’s what I’m using for my production models now (i.e. anytime I’m running over 100+ training iterations).
Hi @kushaj,
Good questions - I have just used their default alpha and seems to work well.
You could certainly experiment with some alpha settings to test, that would be a very interesting avenue to delve into and see if any trends emerge from it.
Best regards,
Less
@sgugger how do we pass in custom eps into the opt function (as we saw that 0.95 and 0.99 worked the best). I’m wondering if that may be missing as I was trying to recreate the ImageWoof results and I could not. The highest I achieved was 68% with your above code wheras we would consistently get 74-78%. Thoughts? If the eps does not do the trick then I’ll post the problem notebook for others to get ideas
You can pass the value for eps in the call to RAdam but it looms like you want to pass betas form your values. They are called mom and sqr_mom. Tab completion is your friend
Still no luck I tried recreating all the hyperparameters and steps for it and could only get up to 69.2%. @LessW2020 or @morgan could either of you double check me here? I ran SSA with Mish:
Having issues with the my first hacky version on proper models/data. Am going to do a re-write for QHAdam based on the fastai2 RAdam version as above. Maybe have a v0 later today, I think I hav a better idea of what I’m doing now so it looks like a fairly straightfoward port to fastai. Will see how that interacts with the fastai2 Lookahead then
Awesome! Any chance you could skip the QH and have ranger? (Just so I can compare). Or tell me where to comment out (let me see if I can figure it out too). I think I may have found something. I believe that on line 11, v should be: v = math.sqrt((1-sqr_mom**step) * (r - 4) / (r_inf - 4) * (r-2) / r*r_inf / (r_inf -2)) for RAdam’s step
Let me see if that helps…
Still do not think it’s quite working right, initial loss is still 2.74. I think I need to figure out where this step goes : v/(1-sqr_avg**step)
Couldn’t seem to find what’s going on with radam…
@morgan does it look like exponential average is being applied to you?
Just checked with RAdam on its own and it performs worse than Ranger, so Lookahead is doing something right…just don’t know if the problem is with the fastai2 RAdam implementation or LookAhead, or their interaction…
Looks that way to me. I’ve been focusing on understanding how to migrate it all over so let me make sure I understand right how to go about this. We should define it pre-step and grab the gradients and affect it during the step? And pass them in as an argument? Or should we generate them during the step.
I don’t understand your remarks: hat vt and hat mt are computed, just not separately. debias1 and debias2 are the two terms you need to divide them with and you can see them used in the line that applies the update on p.