Meet RAdam - imo the new state of the art AI optimizer

Hi @LessW2020. Thank you very much for sharing this. Could you share your ImageNette notebook where you conducted the experiments? It would be very helpful to start hacking around it.

1 Like

If you want to get started now, you can use this script which is the one used for the baseline:

Then in a notebook:
%run train_imagenette.py --epochs 5 --bs 64 --lr 3e-3 --mixup 0 --size 128

Should give you the baseline result.

Then you can add radam.py from the github, import it in train_imagenette and add it as an option for the opt_func.

I haven’t tried it yet, but it should work.

2 Likes

Thanks @Seb for your help :slight_smile:

Hi Seb,
Thanks for the feedback! For reference, I ran the 5 epoch multiple times with multiple beats, and also beat on the 20 epoch.
Now, that said, the bigger question I think is we really need to formulate a standard setup for the leaderboards.
i.e. what is the criterion to finalize a new high score entry b/c it’s very vague right now…obviously you need multiple runs but how many? 5, 10, 20? And are people really going to run 20 * 20 epochs(or whatever the required amount is) if they are paying for server time (like me?)
Also, 4 GPU really shouldn’t be the benchmark if GPU’s matter at all - in my mind it should not matter how many GPU but if it does…nearly everyone is running on one GPU, so not sure why that’s even listed given how impractical it is for most.
Anyway, let’s start a new thread on this b/c it keeps popping up that there is no real set of rules for how to verify things.

3 Likes

Awesome, I’m really happy to hear this @sammy500 - thanks for posting your feedback!

we can use this thread (or starting a new one is fine I guess)

In the examples the authors train neural networks from scratch; I wonder how it will behave in fine-tuning scenarious, e.g. with BERT fine-tuning.

1 Like

Hi Chris,
I thought they allowed free viewings (certain amount per month), maybe that changed for articles they recommend - anyway, here’s a link that will let you in!

4 Likes

Thanks very much @LessW2020 - yes they do allow a small number of free viewings, and typically I use them up pretty quickly, already long past my limit this month :smile:

1 Like

Tried RAdam.
Now only on 5 epochs.
On imagenette, size 128 have small improvements.
On woof, same 128 have better results.
What i find - with RAdam you can take higher lr.
Also i check it with my custom model - it have improvements too.
Will try it on 20 epoch later.
And, by the way - i try script from examples on 1 GPUs and get same results as on leaderboard.

1 Like

Hello I am new to deep learning and I just finished lesson 2 of the deep learning course.
I want to ask how can I test RAdam like you guys did? You provided their GitHub page but I don’t know what to do with that. I’m sorry if my question is too basic, but I greatly appreciate your help. I really want to learn.

1 Like

Hi @minh - here’s a quick summary of how to use RAdam:
1 - copy the radam.py from the github to your local directory, ala nbs/dl1 where your notebooks are.
*(better but more complicated is to git clone it to your local drive and then reference the path… this way you are in sync but if you are working on a paid server such as Salamander I just copy the file over).

2 - In your notebook use:
from radam import RAdam #make RAdam available
optar = Partial(RAdam) #create a partial function to point to RAdam
learn = cnn_learner(data, models.xresnet50, opt_func=optar, metrics=error_rate) #when you create your learner, specify the new optimizer:

Now you have learner running RAdam :slight_smile:
Hope that helps!

11 Likes

Hi @a_yasyrev,
Thanks for the testing feedback :slight_smile:
Glad to hear RAdam is helping and also thanks for confirming 1GPU results are same as leaderboard 4 GPU.
Good luck with 20 epoch testing and please post results when you can.

Tried it on a tabular model for a running kaggle competition.
Didn’t work out for me, yet. I m using slice(lr), pct_start=0.2 on fit_one_cycle with the default optimizer. My results got worse when I tried RAdam with those settings. When I used the default params on fit_one_cycle with RAdam I got much better results. But they were still slightly below than using my original settings.

If we don’t need a warmup, I wonder if it means we need to adapt the one cycle learning rate scheduler.

Also, I had one case where my result was worse after upping the learning rate. It was some transfer learning on an image dataset. A bit anecdotal, but I wonder what people experienced in terms of results being independent from lr. It would be great to not have to worry about lr as much!

One other anecdotal result was that the learning rate finder curve was shifted to the right with RAdam.

I still want to run some tests on Imagenette with the modifications I suggested above, but that might have to wait until next week. Or maybe someone will run it before me.

1 Like

I think so - I also think the lr finder itself might need to be modified. I’m actually going to see if I can blend an autolr optimizer I’ve been working with and combine it with RAdam’s rectifier.

First results on epochs 20.
On woof.
lr = 1e-1 (!!!) - 0.798
lr = 1e-1 - 0.790
lr = 1e-2 - 0.806
lr = 3e-2 - 0.800
lr = 3e-2 - 0.802

Now started train on 80 epochs - it takes time )

adam
with adam

radam
with RAdam

Had a very similar shift in lr finder on a different dataset.

I spent this morning merging Lookahead + RAdam into a single optimizer (Ranger). The merge may not have been needed as you can link both together by passing one into the other, but I felt it would make it easier to have one single optimizer to integrate with FastAI.
Lookahead is Hinton’s paper from last month that they showed outperformed SGD - it basically maintains a slow average that periodically merges in with the regular optimizer weights (Adam was what they used but I’m using RAdam)…analogy is a buddy system where one explores while the other has a rope to pull them back if it turns out to be a bad path.

Anyway, first results are impressive - got 93% for 20 epochs vs 92% current leaderboard and I had 92.4-92.5% with RAdam alone. That’s a first run only with a guesstimate LR, but makes me feel confident I’m on the right track.
Training looks even more stable than with RAdam alone.
Was doing the next run and got kicked off via pre-emption yet again on Salamander, so I’ll continue with more runs later as I have to get other real life things done…but wanted to post that for anyone that didn’t have improvements with RAdam, hold out for Ranger here as it brings another arrow to the game.

11 Likes