Meet Ranger - RAdam + Lookahead optimizer

ilovescience · September 13, 2019, 6:51pm

@LessW2020 sorry I didn’t respond earlier. I was busy with a competition and completely forgot about your post.

Actually, I wanted to try freezing and unfreezing with Ranger for the competition, but as I mentioned earlier, I was unsuccessful.

Here is the code. Feel free to play around with it to be able to figure out where the bug is. I am probably only going to revisit it in about a week to try it out for a different competition.

ilovescience · October 15, 2019, 6:15am

@LessW2020 Please check this out

kushaj · October 18, 2019, 12:57am

I read the Lookahead paper and found some difficulties and thought this would be the best place to ask.

Problem related to Figure1 of paper. How do I generate such a plot using matplotlib or some other library? They project the weights onto a plane defined by the first, middle and last fast weights. How is this done?

Screenshot from 2019-10-18 06-21-15.png999×605 167 KB
Problem related to Proposition1 of paper. I understand that there is some quadratic loss model that can act as a proxy for neural network optimization but how do I use this information to set the value of alpha found in the Proposition1 when I am using CrossEntropy Loss?
Or is there no need to worry about alpha that much? Test some values to see which works best or use default value.

LessW2020 · October 18, 2019, 10:02pm

Great stuff, thanks for posting this! If I have time I’ll build in their exponential warmup and put a toggle so people can test both.
Also, their comments at the end about interpolating between Adam/SGD for late stage convergence is basically…RangerQH

I’ll have to write an article about RangerQH but that’s what I’m using for my production models now (i.e. anytime I’m running over 100+ training iterations).

Thanks again for the link!
Less

LessW2020 · October 18, 2019, 10:05pm

Hi @kushaj,
Good questions - I have just used their default alpha and seems to work well.
You could certainly experiment with some alpha settings to test, that would be a very interesting avenue to delve into and see if any trends emerge from it.
Best regards,
Less

sgugger · November 5, 2019, 7:28pm

Note that RAdam and Lookahead have both been added to fastai v2. I didn’t create a Ranger shortcut, but you can use

def opt_func(ps, lr=defaults.lr): return Lookahead(RAdam(ps, lr=lr))

for it (and potentially change any default hyper-param). Lookahead can be wrapped around any fastai optimizer for those wanting to experiment.

muellerzr · November 6, 2019, 12:58am

@sgugger how do we pass in custom eps into the opt function (as we saw that 0.95 and 0.99 worked the best). I’m wondering if that may be missing as I was trying to recreate the ImageWoof results and I could not. The highest I achieved was 68% with your above code wheras we would consistently get 74-78%. Thoughts? If the eps does not do the trick then I’ll post the problem notebook for others to get ideas

sgugger · November 6, 2019, 1:00am

You can pass the value for eps in the call to RAdam but it looms like you want to pass betas form your values. They are called mom and sqr_mom. Tab completion is your friend

muellerzr · November 6, 2019, 1:02am

Still no luck I tried recreating all the hyperparameters and steps for it and could only get up to 69.2%. @LessW2020 or @morgan could either of you double check me here? I ran SSA with Mish:

 def opt_func(ps, lr=defaults.lr): return Lookahead(RAdam(ps, wd=1e-2,mom=0.95, eps=1e-6,lr=lr))

learn = Learner(dbunch, xresnet50(sa=True), opt_func=opt_func, loss_func=LabelSmoothingCrossEntropy(),
                metrics=accuracy)

learn.fit_flat_cos(5, 4e-3, pct_start=0.72)

The code for SSA is here: https://github.com/sdoria/SimpleSelfAttention/blob/master/xresnet.py

I replaced act_fn with Mish()

~~Just in case I am trying without bn_wd=False, true_wd=True, and regular wd (1e-2) and seeing if it lines up.~~ Still did better (75.2% on v1)

Also my losses are much higher (2.7 to start and 1.8 at end vs 1.92 and 1.19 for v1).

My notebook is here

morgan · November 6, 2019, 12:02pm

@muellerzr just ran your notebook there and got the same result, 68%

Also, using MishJit code (from @TomB) I got a ~12% speedup. I used the MishJit code in @rwightman’s geffent library

muellerzr · November 6, 2019, 12:09pm

Odd. @morgan how far did you get on your version of Ranger?

morgan · November 6, 2019, 12:20pm

Having issues with the my first hacky version on proper models/data. Am going to do a re-write for QHAdam based on the fastai2 RAdam version as above. Maybe have a v0 later today, I think I hav a better idea of what I’m doing now so it looks like a fairly straightfoward port to fastai. Will see how that interacts with the fastai2 Lookahead then

muellerzr · November 6, 2019, 12:24pm

Awesome! Any chance you could skip the QH and have ranger? (Just so I can compare). Or tell me where to comment out (let me see if I can figure it out too). I think I may have found something. I believe that on line 11, v should be: v = math.sqrt((1-sqr_mom**step) * (r - 4) / (r_inf - 4) * (r-2) / r*r_inf / (r_inf -2)) for RAdam’s step

Let me see if that helps…

Still do not think it’s quite working right, initial loss is still 2.74. I think I need to figure out where this step goes : v/(1-sqr_avg**step)

Couldn’t seem to find what’s going on with radam…

@morgan does it look like exponential average is being applied to you?

morgan · November 6, 2019, 1:07pm

Just checked with RAdam on its own and it performs worse than Ranger, so Lookahead is doing something right…just don’t know if the problem is with the fastai2 RAdam implementation or LookAhead, or their interaction…

morgan · November 6, 2019, 3:09pm

Yep thats exactly what I was puzzling over, but wanted to check that I wasn’t missing any fastai magic

So v_t and m_t below don’t look like they’re being calculated in the current RAdam implementation correct?

muellerzr · November 6, 2019, 3:19pm

Looks that way to me. I’ve been focusing on understanding how to migrate it all over so let me make sure I understand right how to go about this. We should define it pre-step and grab the gradients and affect it during the step? And pass them in as an argument? Or should we generate them during the step.

morgan · November 6, 2019, 3:34pm

I would say pre-step, similar to how the state for average_grad and average_grad are calculated

muellerzr · November 6, 2019, 3:41pm

Sounds good. I’ll try that out shortly and see what happens.

morgan · November 6, 2019, 4:17pm

Something like this for the update to the first moment I guess? I just need the first moment for QHAdam so I’ll test this in parallel to your work

def exp_average(state, p, **kwargs): if 'exp_avg' not in state: state['exp_avg'] = torch.zeros_like(p.data) state['exp_avg'] = (mom * state['exp_avg']) + ((1-mom) * avg_grad) return state

sgugger · November 6, 2019, 4:17pm

I don’t understand your remarks: hat vt and hat mt are computed, just not separately. debias1 and debias2 are the two terms you need to divide them with and you can see them used in the line that applies the update on p.