Meet RAdam - imo the new state of the art AI optimizer

Thanks for the kind words @Ajaxon6255!
Please give Ranger a try and see if the addition of LookAhead improves your results!

Just a tip, you can read an article in medium without limitation, if you don’t login :slight_smile:

1 Like

I wrote a post where I explain why it looks like current baseline on the leaderboard is lower than it should be. It wasn’t quite about number of GPUs after all.

1 Like

Yes, i find it too.
I forget about it. When i tried default baseline, i used real lr - changed this line. :grinning:

So did you get same results as on the leaderboard even after correcting the learning rate? I didn’t. Did you run it more than once?

I’ve run some initial tests with Radam vs Adam on Imagenette.

  1. I first reran the Adam baseline because of the glitch with learning rate that I detailed elsewhere [1]

%run train_imagenette.py --epochs 5 --bs 64 --lr 12e-3 --mixup 0

(note that the effective learning rate is 3e-3)

Results (10 runs): [86,85.4,84.4,85.2,84.8,85,85.6,85.4,85.4,86.4]
Mean: 85.3%

  1. Radam, learning rate 3e-2 (as in the OP’s article)

%run train_imagenette.py --epochs 5 --bs 64 --lr 12e-2 --mixup 0 --opt ‘radam’
(effective lr = 3e-2)

Results: [85.2,85.2,83.6,85.4,85.6, 86, 84.6, 84.6,82.8,85.4]
Mean: 84.8%

  1. I tried Radam with wd = 1e-3 instead of 1e-2, cout=1000 to be close to OP’s parameters (based on the Medium article).
    Note that normally I specify cout=10, but OP didn’t do so. I don’t think it makes a difference (though losses are higher), but you never know.

Results: [84.2, 86.2, 83.2, 83.8,85.2, 85.2, 84.4,82.6,84.8,84.4]
Mean: 84.4%

Conclusion: I haven’t been able to get better results with RAdam than with Adam on Imagenette (128px) so far. I am able to get 86% on some of my best runs, but on average Radam is doing worse that Adam when running for 5 epochs.

[1] ImageNette/Woof Leaderboards - guidelines for proving new high scores?

Thanks for the testing Seb!

If you have time and aren’t paying for GPU, can you test with Ranger (RAdam + LookAhead) and 20 epochs?
Ranger/RAdam usually start a bit slower so I think the 20 epochs is the more interesting/definitive test overall. I got higher than the leaderboard with both, but now it’s unclear what the leaderboard ranking is with the whole GPU mixup.

I’ve been trying to do my testing on Salamander since Sunday, but every day for the past 3 days I keep getting pre-empted after 10-30 mins in, just as I’m getting the first runs going, and tired of wasting money like that.

1 Like

I wrote my Ranger results in your other thread.

(I switch to vastai when I want a break from Salamander and vice versa)

Yes, i tried different lr rates, several times, results was close.
It should be logs - will look at it again…
At my home box i train without fp16.
And i tried it on colab, with fp16, same results.
Will check it…

Hi @LessW2020,

Thank you for reporting back the results of your experiment! Glad to see that this summer largely benefited optimizers’ design :slight_smile:

On my side, I haven’t yet been able to get improvements using only RAdam compared to Adam (I used a personal reimplementation of lr_find for those two so I guess I’m not likely to observe performance’s sensitivity to LR). But I’m going to give a try to Lookahead coupled with RAdam.

Besides, I was wondering if you had a specific idea about combining RAdam and Lookahead?
From my understanding, since Lookahead is just a wrapper class for another optimizer, I don’t see the need (at least in this implementation) of combining both in a single class. Something like:

optimizer = RAdam(model_params)
if lookahead_wrapper:
    optimizer = Lookahead(optimizer, alpha=0.5, k=6)

would do just fine. But perhaps you have an idea for future improvements when combining both?

Thanks in advance!

Hi @fgfm,
You are correct - lookahead is a wrapper and will wrap any optimizer underneath (I show this in my article actually :slight_smile:

The reasons I integrated lookahead and RAdam into a single class (Ranger) was:
1 - Ease of use/FastAI integration - to make it easy to integrate into FastAI (didn’t want to have the intermediation of lr pass through, as FastAI changes lr extensively via fit_one_cycle) and wasn’t sure how well that would be handled otherwise. Similarly, based on comments from my RAdam article, it was clear that people just wanted a simple plugin to test with.
2 - Future work - I am planning to test out an automated lr seeking code that adjusts based on the loss surface, and that wouldn’t be easy with the wrapper setup. One class makes it a lot easier to code and test.
3 - Minimal downside - From a theory standpoint, I’m hard pressed to explain why one would not want to use RAdam and LookAhead given their advancements (i.e. it’s very hard to see how they would perform worse than say plain Adam, and in most cases should perform better) so why not just put them into one class together.

Anyway, that’s why I bundled them together into Ranger. As I can get time/money to dev/test, I’ll integrate the lr seeker as well and see how that performs.
You can definitely use it as a wrapper and test out other underlying optimizers and if you find a better combo than RAdam+Lookahead, please post about it!
Best regards,
Less

It would be very interesting if you folks could also try Novograd:

4 Likes

Wow, thanks for the pointer to this paper. This looks very compelling as I like what they are doing with the gradient normalization.
At risk of turning into a walking Optimizer since that’s all I’ve been working on lately, I’ve got it running now and fingers crossed I don’t get pre-empted will have some results in the next hour:
edit - 93.4% on first 20 epoch…so that’s quite competitive. Trying a range of lr now to get a better feel for it.

2 Likes

Thanks to everyone’s work testing these new optimizers. I have been switching among Adam, RAdam, and Novograd, with and without the Lookahead wrapper. It’s all anecdotal, but I can’t see any significant difference in their rates of convergence or asymtotic losses. The only standout conclusion is that RAdam’s loss graph is smoother.

One observation: I see a large variation in loss over the same training period, about ±15%, depending on the random initialization of the model. So it will be important to average many runs to assess the relative performance of these optimizers.

My (hobby) test case is time series of 5000 stock prices x 16 symbols, trained with 20000 passes at lr=.001 followed by 20000 at lr=.0001. No RNN, all “parallel” operations.

I’ll keep testing, and hope we can figure out the useful domains of these new optimizers.

4 Likes

I got about 2.5 hours of testing in before getting pre-empted.
Summary so far is Novograd produces nice curves and ends up about the same place as Ranger (RAdam+Lookahead).
I tested Novograd + Lookahead and the training got more erratic.

One takeaway so far is that for Ranger or Novograd, they all seem to do better with higher learning rates vs Adam. 5e-2 or so seems to be a good spot for it at least on ImageNette.

I’ll try tomorrow on ImageWoof and see if a harder dataset pulls out more differences.

2 Likes

I’ll have to start a new thread but Novograd is doing great on ImageWoof. It seems a harder dataset is what is needed to really show the differences for the optimizers.
I just got pre-empted again but here’s the initial results. I’m testing a few other lr if I can get back on and will then run for 5 and hopefully submit for a new leaderboard score:

I ran it 2 extra epochs (does not count) but mostly to see if it would continue and it appears to still be improving.

When you are doing a direct comparison, how do you determine the learning rate for each Adam, RAdam, Ranger?

I’m testing them now with the same hyper params.
I’m trying to fix everything except for the optimizer to remove the effect that I might have just picked a bad lr for the respective optimizer but I’m now wondering it’s the wrong approach.

Interesting result so far!

Before concluding that Novograd is better than Adam in this case, you might want to rerun the baseline with Jeremy’s intended effective learning rate. (as I described elsewhere[1], the baseline was run at lr = 0.75e-3 and not the intended lr = 3e-3 because of an oversight in the code).

I don’t have data for 128 px, but on imagewoof 256px, I reworked the baseline at 83.9% rather than 81.8%
(the 85.7% being my entry to the leaderboard, but I don’t like it too much because it runs slower).

Also, I suggest running on vast.ai when salamander is not happening. Cheaper, faster, no pre-emption; but you usually “lose” your machine and files if you stop, and machines are not always available.

[1] ImageNette/Woof Leaderboards - guidelines for proving new high scores?

Hi @Johnny,
Unfortunately you can’t run all the different optimizers under the same lr or you will hold most of them back.
So far all of the new ones work better with a higher lr than Adam. In general, 8x higher lr (5e-2 for example vs 3e-3 for Adam).
Unfortunately finding a good lr is a bit of trial and error - I just run the lr finder, pick a spot from that and test. Then I’ll run +/- 10x from there (i.e. 5e-1, 5e-2, 5e-3) to try and do a bounds test and then have a pretty good idea what works for testing.
With Ranger, the other wildcard is the k parameter - 5 or 6 is the default but I believe they also used 20 at one point in the paper and so running with 10 or 20 for k as a test would be a good thing to.
Please post any results you get!
Best regards,
Less

2 Likes

Thanks @LessW2020!

Unfortunately you can’t run all the different optimizers under the same lr or you will hold most of them back.

That’s also my suspicion as my results right now is such that Adam > RAdam > Ranger
using default k=5 for Ranger.

I’m training the Yelp Review Polarity binary classification task.

Adam: accuracy 0.973587
RAdam: accuracy 0.972625
RAdam + LookAhead: 0.971488

I will try to increase the learning rate for RAdam and Ranger to see if it changes anything.

1 Like