Meet Ranger - RAdam + Lookahead optimizer

LessW2020 · August 20, 2019, 12:22am

Well surprisingly right after my article on RAdam, I found a new paper on LookAhead optimizer in part by Geoffrey Hinton.
RAdam stabilizes training at the start, LookAhead stabilizes training and convergence during the rest of training…so it was immediately clear that putting the two together might build a dream team optimizer.
I was not disappointed as the first run with Ranger (integration of both) jumped to 93% on the 20 epoch ImageNette test.

I’ve written yet another optimizer article to get into more details:

And put the Ranger source out for anyone to quickly test. I merged these into one codebase to make it easier to integrate into FastAI and general use, but you can also plug in AdamW, SGD into LookAhead directly.
Ranger code is here:

I’m readying a notebook as well based on previous feedback but I kept getting kicked off over and over today at Salamander…so was not able to finish my testing and notebook.
But here’s the basic process:

see new high % start appearing:

Enjoy!

muellerzr · August 20, 2019, 12:27am

Great work @LessW2020! Very very exciting to see all this effort you’ve put into getting this operating and discovering how it all works together even better!!!

LessW2020 · August 20, 2019, 3:27pm

Thanks for the kind words @muellerzr!

Seb · August 20, 2019, 10:59pm

As requested:

A) Ranger
%run train_imagenette.py --epochs 20 --bs 64 --lr 20e-2 --mixup 0 --opt ‘ranger’
effective lr = 0.05

5 runs

Accuracy:
np.mean([93.2,93.6,93.4,93.4,92.8]) = 93.28%
Valid Loss:
np.mean([0.704754,0.708478,0.691563,0.691473,0.686495]) = 0.69655

B) Baseline

%run train_imagenette.py --epochs 20 --bs 64 --lr 12e-3 --mixup 0
eff lr = 3e-3

np.mean([93.4,93,92.6,94,92.8]) = 93.16 %
np.mean([0.696437,0.714027,0.708068,0.690530,0.706667]) = 0.7031458

C) Baseline (eff lr= 1e-2)

np.mean([92.8,91.8,93.2,93.4,92.8]) = 92.8%

Conclusion:
No detected difference in accuracy. Maybe a small signal in valid loss (p=0.25, would still need more data). I’d try with a harder dataset like Imagewoof.

If results are equivalent to Adam, maybe there are advantages to running Ranger:

is it less sensitive to lr?
can we skip warmup?

LessW2020 · August 21, 2019, 1:17pm

Thanks for the great testing @Seb!

I’m coming to the same conclusion that we’ll need to move to a harder dataset. I’m wondering if ImageNette is effectively maxed out for XRes50 in general so we’re not able to as readily see improvements in supporting things like activation functions and optimizers.

Seb · August 21, 2019, 1:23pm

I think it’s a good idea to try things out with xresnet18, takes less time and resources.

LessW2020 · August 21, 2019, 1:26pm

I’m all for faster testing, esp when I keep getting pre-empted on my servers lol.
That said, we’d need a leaderboard for X18 in order to have a baseline to test with.

Johnyquest · August 23, 2019, 1:21am

Man, your articles are soo good, I got a paid Medium membership just for this.

cooli46 · August 23, 2019, 3:51am

Hello,

Thanks for the efforts.

By combining radam and LookAhrad, the simple way should be like the code below:

from optimizer import Lookahead
base_optimizer = Radam
optimizer = Lookahead(base_optimizer=base_optimizer,k=5,alpha=0.5)

what is the difference of your implementation compared to this one?

Thanks.

LessW2020 · August 23, 2019, 3:54am

Hi @cooli46,
Absolutely none - you can definitely just use the wrapper like you show. I show the same thing in the article.
The reasons I integrated were for modularity (a lot of people on Medium just wanted one line plugins it seemed), and ease of making future code enhancements.
I have plans to test out some auto lr stuff to integrate with these to see if that can further improve and remove the lr selection issues.
Hope that helps!

LessW2020 · August 23, 2019, 3:56am

Hi @Johnyquest,
wow, thanks a ton for the feedback - that definitely is motivating! I will keep working on future articles!
Thanks again,
Less

DavideBoschetto · August 26, 2019, 7:11am

Hi!

Good job!

Is the base optimizer AdamW or plain Adam?

fgfm · August 26, 2019, 7:55am

For the time being, I’m using Lookahead as a wrapper, but if you do, you’ll have to define the state_dict method (basically return the state_dict of base_optimizer). Otherwise, you might have a surprise when you try to save your optimizer’s checkpoint

fgfm · August 26, 2019, 7:58am

By the way @LessW2020,
with which type of scheduler have you performed your experiments so far?
I’ve been using a custom version of OneCycleScheduler but I’ve seen posts where the warmup phase should be switched for a flat LR (=max_lr).

Considering the reduced need for warmup with RAdam and Lookahead, we might need to consider new schedulers to make the most out of those two!

vijaykn · August 27, 2019, 8:08pm

Is there is a keras/ tensorflow implementation for this?
This looks so interesting.

Seb · August 27, 2019, 8:59pm

mgrankin has tried another scheduler here: https://github.com/mgrankin/over9000

fgfm · August 27, 2019, 9:09pm

Yes, thanks @Seb, I’ve been playing around with his implementation since I found the other post
The best part is that the flat + cosine phases can be obtained from a Onecycle scheduler where div_factor=1 and final_div << 1

Seb · August 29, 2019, 12:36am

Now that we’ve played around with adding a bunch of ideas together in the mish thread, I think it’s a good idea to go back to individual ideas.

I think Ranger needs a second look. Using --sched_type flat_and_anneal --ann_start 0.72 --mom .95,
I get an improvement of 2% (p<0.05, 5 runs) over Adam +OneCycle on 5 epochs, Imagewoof 128px.

(I used the full-sized dataset and the increased channel count on my xresnet, so my baseline is 66%)

On @grankin 's github, adam and ranger did 61.2% and 59.4% but use ann_start=0.50. So we missed some potential on Ranger by annealing too early it seems.

It could also be that it is Lookahead that is bringing all the improvement (or Radam?)
And we’ll need to try with more epochs.

One thing I like with Ranger is that it seems to run epochs as fast as Adam. Worth testing more!

grankin · August 29, 2019, 6:44am

I’m trying to make Lookahead work with Pickle atm.
I’ve added the code def __getstate__(self): return self.__dict__ to Lookahead to override parent method. That should return base_optimizer in dict and then base_optimizer should correctly save it’s state. I can’t figure what is missing here.

fgfm · August 29, 2019, 7:37am

Hey @grankin!

So the thing is, that your class is inheriting from torch.optim.optimizer.Optimizer. However, you do inherit the methods, but you’ll want to be careful with what they’re applied to.

In your case, my best guess is:

what you wrote above, does not change (is exactly like) the inherited method of the Optimizer class. What you actually want is to get the state of base_optimizer, otherwise the method is applied to self which is quite different from self.base_optimizer. So overriding the method like below should work:

def __getstate__(self): return self.base_optimizer.__getstate__()

you’ll probably need to double-check your other methods for this

For further details, you can check my implementation of the wrapper if that can help: https://github.com/frgfm/Holocron/blob/master/holocron/optim/lookahead.py

Cheers