Well surprisingly right after my article on RAdam, I found a new paper on LookAhead optimizer in part by Geoffrey Hinton.
RAdam stabilizes training at the start, LookAhead stabilizes training and convergence during the rest of training…so it was immediately clear that putting the two together might build a dream team optimizer.
I was not disappointed as the first run with Ranger (integration of both) jumped to 93% on the 20 epoch ImageNette test.

I’ve written yet another optimizer article to get into more details:

And put the Ranger source out for anyone to quickly test. I merged these into one codebase to make it easier to integrate into FastAI and general use, but you can also plug in AdamW, SGD into LookAhead directly.
Ranger code is here:

I’m readying a notebook as well based on previous feedback but I kept getting kicked off over and over today at Salamander…so was not able to finish my testing and notebook.
But here’s the basic process:

see new high % start appearing:

Enjoy!

33 Likes

Great work @LessW2020! Very very exciting to see all this effort you’ve put into getting this operating and discovering how it all works together even better!!!

1 Like

Thanks for the kind words @muellerzr!

As requested:

A) Ranger
%run train_imagenette.py --epochs 20 --bs 64 --lr 20e-2 --mixup 0 --opt ‘ranger’
effective lr = 0.05

5 runs

Accuracy:
np.mean([93.2,93.6,93.4,93.4,92.8]) = 93.28%
Valid Loss:
np.mean([0.704754,0.708478,0.691563,0.691473,0.686495]) = 0.69655

B) Baseline

%run train_imagenette.py --epochs 20 --bs 64 --lr 12e-3 --mixup 0
eff lr = 3e-3

np.mean([93.4,93,92.6,94,92.8]) = 93.16 %
np.mean([0.696437,0.714027,0.708068,0.690530,0.706667]) = 0.7031458

C) Baseline (eff lr= 1e-2)

np.mean([92.8,91.8,93.2,93.4,92.8]) = 92.8%

Conclusion:
No detected difference in accuracy. Maybe a small signal in valid loss (p=0.25, would still need more data). I’d try with a harder dataset like Imagewoof.

If results are equivalent to Adam, maybe there are advantages to running Ranger:

• is it less sensitive to lr?
• can we skip warmup?
1 Like

Thanks for the great testing @Seb!

I’m coming to the same conclusion that we’ll need to move to a harder dataset. I’m wondering if ImageNette is effectively maxed out for XRes50 in general so we’re not able to as readily see improvements in supporting things like activation functions and optimizers.

1 Like

I think it’s a good idea to try things out with xresnet18, takes less time and resources.

1 Like

I’m all for faster testing, esp when I keep getting pre-empted on my servers lol.
That said, we’d need a leaderboard for X18 in order to have a baseline to test with.

1 Like

Man, your articles are soo good, I got a paid Medium membership just for this.

2 Likes

Hello,

Thanks for the efforts.

By combining radam and LookAhrad, the simple way should be like the code below:

what is the difference of your implementation compared to this one?

Thanks.

3 Likes

Hi @cooli46,
Absolutely none - you can definitely just use the wrapper like you show. I show the same thing in the article.
The reasons I integrated were for modularity (a lot of people on Medium just wanted one line plugins it seemed), and ease of making future code enhancements.
I have plans to test out some auto lr stuff to integrate with these to see if that can further improve and remove the lr selection issues.
Hope that helps!

1 Like

Hi @Johnyquest,
wow, thanks a ton for the feedback - that definitely is motivating! I will keep working on future articles!
Thanks again,
Less

Hi!

Good job!

For the time being, I’m using Lookahead as a wrapper, but if you do, you’ll have to define the `state_dict` method (basically return the state_dict of `base_optimizer`). Otherwise, you might have a surprise when you try to save your optimizer’s checkpoint

1 Like

By the way @LessW2020,
with which type of scheduler have you performed your experiments so far?
I’ve been using a custom version of OneCycleScheduler but I’ve seen posts where the warmup phase should be switched for a flat LR (=max_lr).

Considering the reduced need for warmup with RAdam and Lookahead, we might need to consider new schedulers to make the most out of those two!

1 Like

Is there is a keras/ tensorflow implementation for this?
This looks so interesting.

1 Like

mgrankin has tried another scheduler here: https://github.com/mgrankin/over9000

1 Like

Yes, thanks @Seb, I’ve been playing around with his implementation since I found the other post
The best part is that the flat + cosine phases can be obtained from a Onecycle scheduler where div_factor=1 and final_div << 1

2 Likes

Now that we’ve played around with adding a bunch of ideas together in the mish thread, I think it’s a good idea to go back to individual ideas.

I think Ranger needs a second look. Using --sched_type flat_and_anneal --ann_start 0.72 --mom .95,
I get an improvement of 2% (p<0.05, 5 runs) over Adam +OneCycle on 5 epochs, Imagewoof 128px.

(I used the full-sized dataset and the increased channel count on my xresnet, so my baseline is 66%)

On @grankin 's github, adam and ranger did 61.2% and 59.4% but use ann_start=0.50. So we missed some potential on Ranger by annealing too early it seems.

It could also be that it is Lookahead that is bringing all the improvement (or Radam?)
And we’ll need to try with more epochs.

One thing I like with Ranger is that it seems to run epochs as fast as Adam. Worth testing more!

3 Likes

I’m trying to make Lookahead work with Pickle atm.
I’ve added the code `def __getstate__(self): return self.__dict__` to Lookahead to override parent method. That should return base_optimizer in dict and then base_optimizer should correctly save it’s state. I can’t figure what is missing here.

Hey @grankin!

So the thing is, that your class is inheriting from `torch.optim.optimizer.Optimizer`. However, you do inherit the methods, but you’ll want to be careful with what they’re applied to.

In your case, my best guess is:

• what you wrote above, does not change (is exactly like) the inherited method of the Optimizer class. What you actually want is to get the state of base_optimizer, otherwise the method is applied to `self` which is quite different from `self.base_optimizer`. So overriding the method like below should work:
``````def __getstate__(self): return self.base_optimizer.__getstate__()
``````
• you’ll probably need to double-check your other methods for this

For further details, you can check my implementation of the wrapper if that can help: https://github.com/frgfm/Holocron/blob/master/holocron/optim/lookahead.py

Cheers