Meet RAdam - imo the new state of the art AI optimizer

Yes, i tried different lr rates, several times, results was close.
It should be logs - will look at it again…
At my home box i train without fp16.
And i tried it on colab, with fp16, same results.
Will check it…

Hi @LessW2020,

Thank you for reporting back the results of your experiment! Glad to see that this summer largely benefited optimizers’ design :slight_smile:

On my side, I haven’t yet been able to get improvements using only RAdam compared to Adam (I used a personal reimplementation of lr_find for those two so I guess I’m not likely to observe performance’s sensitivity to LR). But I’m going to give a try to Lookahead coupled with RAdam.

Besides, I was wondering if you had a specific idea about combining RAdam and Lookahead?
From my understanding, since Lookahead is just a wrapper class for another optimizer, I don’t see the need (at least in this implementation) of combining both in a single class. Something like:

optimizer = RAdam(model_params)
if lookahead_wrapper:
    optimizer = Lookahead(optimizer, alpha=0.5, k=6)

would do just fine. But perhaps you have an idea for future improvements when combining both?

Thanks in advance!

Hi @fgfm,
You are correct - lookahead is a wrapper and will wrap any optimizer underneath (I show this in my article actually :slight_smile:

The reasons I integrated lookahead and RAdam into a single class (Ranger) was:
1 - Ease of use/FastAI integration - to make it easy to integrate into FastAI (didn’t want to have the intermediation of lr pass through, as FastAI changes lr extensively via fit_one_cycle) and wasn’t sure how well that would be handled otherwise. Similarly, based on comments from my RAdam article, it was clear that people just wanted a simple plugin to test with.
2 - Future work - I am planning to test out an automated lr seeking code that adjusts based on the loss surface, and that wouldn’t be easy with the wrapper setup. One class makes it a lot easier to code and test.
3 - Minimal downside - From a theory standpoint, I’m hard pressed to explain why one would not want to use RAdam and LookAhead given their advancements (i.e. it’s very hard to see how they would perform worse than say plain Adam, and in most cases should perform better) so why not just put them into one class together.

Anyway, that’s why I bundled them together into Ranger. As I can get time/money to dev/test, I’ll integrate the lr seeker as well and see how that performs.
You can definitely use it as a wrapper and test out other underlying optimizers and if you find a better combo than RAdam+Lookahead, please post about it!
Best regards,
Less

It would be very interesting if you folks could also try Novograd:

4 Likes

Wow, thanks for the pointer to this paper. This looks very compelling as I like what they are doing with the gradient normalization.
At risk of turning into a walking Optimizer since that’s all I’ve been working on lately, I’ve got it running now and fingers crossed I don’t get pre-empted will have some results in the next hour:
edit - 93.4% on first 20 epoch…so that’s quite competitive. Trying a range of lr now to get a better feel for it.

2 Likes

Thanks to everyone’s work testing these new optimizers. I have been switching among Adam, RAdam, and Novograd, with and without the Lookahead wrapper. It’s all anecdotal, but I can’t see any significant difference in their rates of convergence or asymtotic losses. The only standout conclusion is that RAdam’s loss graph is smoother.

One observation: I see a large variation in loss over the same training period, about ±15%, depending on the random initialization of the model. So it will be important to average many runs to assess the relative performance of these optimizers.

My (hobby) test case is time series of 5000 stock prices x 16 symbols, trained with 20000 passes at lr=.001 followed by 20000 at lr=.0001. No RNN, all “parallel” operations.

I’ll keep testing, and hope we can figure out the useful domains of these new optimizers.

4 Likes

I got about 2.5 hours of testing in before getting pre-empted.
Summary so far is Novograd produces nice curves and ends up about the same place as Ranger (RAdam+Lookahead).
I tested Novograd + Lookahead and the training got more erratic.

One takeaway so far is that for Ranger or Novograd, they all seem to do better with higher learning rates vs Adam. 5e-2 or so seems to be a good spot for it at least on ImageNette.

I’ll try tomorrow on ImageWoof and see if a harder dataset pulls out more differences.

2 Likes

I’ll have to start a new thread but Novograd is doing great on ImageWoof. It seems a harder dataset is what is needed to really show the differences for the optimizers.
I just got pre-empted again but here’s the initial results. I’m testing a few other lr if I can get back on and will then run for 5 and hopefully submit for a new leaderboard score:

I ran it 2 extra epochs (does not count) but mostly to see if it would continue and it appears to still be improving.

When you are doing a direct comparison, how do you determine the learning rate for each Adam, RAdam, Ranger?

I’m testing them now with the same hyper params.
I’m trying to fix everything except for the optimizer to remove the effect that I might have just picked a bad lr for the respective optimizer but I’m now wondering it’s the wrong approach.

Interesting result so far!

Before concluding that Novograd is better than Adam in this case, you might want to rerun the baseline with Jeremy’s intended effective learning rate. (as I described elsewhere[1], the baseline was run at lr = 0.75e-3 and not the intended lr = 3e-3 because of an oversight in the code).

I don’t have data for 128 px, but on imagewoof 256px, I reworked the baseline at 83.9% rather than 81.8%
(the 85.7% being my entry to the leaderboard, but I don’t like it too much because it runs slower).

Also, I suggest running on vast.ai when salamander is not happening. Cheaper, faster, no pre-emption; but you usually “lose” your machine and files if you stop, and machines are not always available.

[1] ImageNette/Woof Leaderboards - guidelines for proving new high scores?

Hi @Johnny,
Unfortunately you can’t run all the different optimizers under the same lr or you will hold most of them back.
So far all of the new ones work better with a higher lr than Adam. In general, 8x higher lr (5e-2 for example vs 3e-3 for Adam).
Unfortunately finding a good lr is a bit of trial and error - I just run the lr finder, pick a spot from that and test. Then I’ll run +/- 10x from there (i.e. 5e-1, 5e-2, 5e-3) to try and do a bounds test and then have a pretty good idea what works for testing.
With Ranger, the other wildcard is the k parameter - 5 or 6 is the default but I believe they also used 20 at one point in the paper and so running with 10 or 20 for k as a test would be a good thing to.
Please post any results you get!
Best regards,
Less

2 Likes

Thanks @LessW2020!

Unfortunately you can’t run all the different optimizers under the same lr or you will hold most of them back.

That’s also my suspicion as my results right now is such that Adam > RAdam > Ranger
using default k=5 for Ranger.

I’m training the Yelp Review Polarity binary classification task.

Adam: accuracy 0.973587
RAdam: accuracy 0.972625
RAdam + LookAhead: 0.971488

I will try to increase the learning rate for RAdam and Ranger to see if it changes anything.

1 Like

Well hours and dollars later, nothing really promising - here’s the results:

Novog (5e-2 lr):
20 epoch
80.6
80.6
80.2
81.2
82.4
Average of 5: 81%

RangerNovo (Novograd + LookAhead):
81.80

Ranger (RAdam + LookAhead)
80.00

Adam:
80.8
81.0
80.9% total

So, Adam and Novograd end up in a virtual tie (and @Seb was right, the leaderboards seem to be off as it has 78.4).
RangerNovo may have some promise but I got pre-empted before I could run more than 1 time.

4 Likes

Thanks for all the runs!

1 Like

Currently working on this
Trying it with Yelp Polarity dataset to fine tune with ULMFiT

Will share results soon

I reread the Novograd paper and noted that they used LabelSmoothing…re-running with the following changes and I’m getting much better results:
1 - LabelSmoothing + Mixup
2 - no weight decay

I’m getting around 82% on multiple runs, or about 1% better than Adam and +1% than w/o labelsmoothing. I got worse results when I used labelsmoothing + regular Adam by comparison.

So Novograd still has some promise. I’m also testing one more (Stochastic Adam) that doesn’t beat on the 20 epoch but the training curve is soooo smooth that I want to test running up to 40 epochs for all of these and see if stochastic Adam outperforms or not.
On other news - the code for AutoOpt will be released tomorrow and that is going to be very, very exciting - 100% fully automated learning rate and momentum handling, and they showed it finds the equivalent best params vs manually doing an intensive grid search for each data set.
Who knew the optimizer space would be such a hotbed of activity this summer lol.

1 Like

This was posted in another thread:

RAdam + LARS + Lookahead beats Adam on both Imagenette and Imagewoof (5 epochs, 128px) if we change the learning rate schedule.

2 Likes

Just saw that, this is great news. I’m going to try and test same setup on 20 epochs shortly to see if this can beat on both 5 and 20!

1 Like

hi @LessW2020, I was wondering if the LookAhead algorithm ruins the built-up momentum of the RAdam / Adam? As RAdam is good at the beginning of training steps I think it’s worth trying to zero out the moving average of gradient and square of gradient (m ,v) at every LookAhead update.

The current approach forces the optimizer to move towards the original direction (before LookAhead update, because of the nature of momentum-like approach) while a non-informative prior might be a better estimate of the actual optimal direction on the new weight position (no favored direction to move). And RAdam can handle cold re-starts.

What do you think?

1 Like

Hi @guyko,
It’s definitely worth testing - let me try and setup a quick run with that and will let you know! I think I will also test increasing K for this setup (i.e. instead of 5, maybe test with 10 and 20…).
This is how we find improvements so thanks for the idea!

2 Likes