Meet RAdam - imo the new state of the art AI optimizer

LessW2020 · August 15, 2019, 9:39pm

Hi all,
I’m very excited to introduce you to a paper by Liu, Jiang, He et al that I think will instantly improve your AI results (seriously) vs Adam.
I tested it on ImageNette and quickly got new high accuracy scores for the 5 and 20 epoch 128px leaderboard scores, so I know it works.
The simple summary - the authors investigated why all adaptive optimizers require a warmup and found it was a result of excessive variance of the adaptive momentum in the early stages. They then developed a dynamic algorithm to adjust the adaptive momentum based on the variance and show that with this, RAdam readily outperforms vanilla Adam with or without a warmup while providing much greater robustness to varying learning rates.
This quick image from their paper should get your interest:

I’ve written a complete summary and overview along with example of using in FastAI here:

Here’s the link the paper directly and more importantly to their github with the unassuming one word readme (“RAdam”). I would recommend jumping directly to their github and trying it out as it really appears to be new state of the art for optimizers now;

Paper:

I have to say after having tested a lot of papers this year, it seems that most over-promise and under-deliver on unseen datasets.
However, it appears RAdam delivers since I jumped to new high accuracy relative to the ImageNette leaderboard with it with minimal work other than plugging it in, and I’ve run a lot of tests trying to beat it with various papers before
Enjoy and please post any results if you test it out!

Seb · August 16, 2019, 12:15am

Thanks for sharing this with us.

I would be careful when testing on Imagenette/Imagewoof

There is a lot of variance from run to run (especially after 5 epochs only), so I would run things more than once.
Ideally, I’d like to see mean accuracy, variance, sample size, and a good p value. I’m a bit rusty in stats, but I think you can use this calculator: https://www.medcalc.org/calc/comparison_of_means.php

If differences between accuracy means are too small, you could look at validation loss instead (less variance) if increasing sample size is too costly.

I would rerun the baseline (again, multiple runs) yourself, on your own machine, the same way you are running your new optimizer or model.
Jeremy’s baseline, for example, is run on 4 GPUs, which affects things (e.g. learning rate is modified IIRC). I don’t remember the details, but I remember that I was beating the baselines on Imagewoof just by rerunning them on 1 GPU.

Edit to add: Also, you are running with lr = 3e-2 when the original lr was 3e-3.

sammy500 · August 16, 2019, 12:34am

Wow, thanks for sharing!! I implemented in a tabular and NLP model and saw immediate results.

Chris_Palmer · August 16, 2019, 5:14am

Thanks for sharing @LessW2020 - I would love to read your Medium article but as I don’t have a subscription its locked away from me. Is it possible to make it available as a PDF?

minh · August 16, 2019, 7:22am

Interesting article @LessW2020. Thanks for sharing.

Sayak · August 16, 2019, 1:27pm

Hi @LessW2020. Thank you very much for sharing this. Could you share your ImageNette notebook where you conducted the experiments? It would be very helpful to start hacking around it.

Seb · August 16, 2019, 1:49pm

If you want to get started now, you can use this script which is the one used for the baseline:

github.com

fastai/fastai/blob/master/examples/train_imagenette.py

from fastai.script import *
from fastai.vision import *
from fastai.callbacks import *
from fastai.distributed import *
from fastprogress import fastprogress
from torchvision.models import *
from fastai.vision.models.xresnet import *
from fastai.vision.models.xresnet2 import *
from fastai.vision.models.presnet import *

torch.backends.cudnn.benchmark = True
fastprogress.MAX_COLS = 80

def get_data(size, woof, bs, workers=None):
    if   size<=128: path = URLs.IMAGEWOOF_160 if woof else URLs.IMAGENETTE_160
    elif size<=224: path = URLs.IMAGEWOOF_320 if woof else URLs.IMAGENETTE_320
    else          : path = URLs.IMAGEWOOF     if woof else URLs.IMAGENETTE
    path = untar_data(path)

    n_gpus = num_distrib() or 1

This file has been truncated. show original

Then in a notebook:
%run train_imagenette.py --epochs 5 --bs 64 --lr 3e-3 --mixup 0 --size 128

Should give you the baseline result.

Then you can add radam.py from the github, import it in train_imagenette and add it as an option for the opt_func.

I haven’t tried it yet, but it should work.

Sayak · August 16, 2019, 2:01pm

Thanks @Seb for your help

LessW2020 · August 16, 2019, 3:06pm

Hi Seb,
Thanks for the feedback! For reference, I ran the 5 epoch multiple times with multiple beats, and also beat on the 20 epoch.
Now, that said, the bigger question I think is we really need to formulate a standard setup for the leaderboards.
i.e. what is the criterion to finalize a new high score entry b/c it’s very vague right now…obviously you need multiple runs but how many? 5, 10, 20? And are people really going to run 20 * 20 epochs(or whatever the required amount is) if they are paying for server time (like me?)
Also, 4 GPU really shouldn’t be the benchmark if GPU’s matter at all - in my mind it should not matter how many GPU but if it does…nearly everyone is running on one GPU, so not sure why that’s even listed given how impractical it is for most.
Anyway, let’s start a new thread on this b/c it keeps popping up that there is no real set of rules for how to verify things.

LessW2020 · August 16, 2019, 3:08pm

Awesome, I’m really happy to hear this @sammy500 - thanks for posting your feedback!

Seb · August 16, 2019, 3:11pm

we can use this thread (or starting a new one is fine I guess)

noisefield · August 16, 2019, 7:45pm

In the examples the authors train neural networks from scratch; I wonder how it will behave in fine-tuning scenarious, e.g. with BERT fine-tuning.

LessW2020 · August 17, 2019, 2:51am

Hi Chris,
I thought they allowed free viewings (certain amount per month), maybe that changed for articles they recommend - anyway, here’s a link that will let you in!

Chris_Palmer · August 17, 2019, 3:37am

Thanks very much @LessW2020 - yes they do allow a small number of free viewings, and typically I use them up pretty quickly, already long past my limit this month

a_yasyrev · August 17, 2019, 12:45pm

Tried RAdam.
Now only on 5 epochs.
On imagenette, size 128 have small improvements.
On woof, same 128 have better results.
What i find - with RAdam you can take higher lr.
Also i check it with my custom model - it have improvements too.
Will try it on 20 epoch later.
And, by the way - i try script from examples on 1 GPUs and get same results as on leaderboard.

minh · August 17, 2019, 2:11pm

Hello I am new to deep learning and I just finished lesson 2 of the deep learning course.
I want to ask how can I test RAdam like you guys did? You provided their GitHub page but I don’t know what to do with that. I’m sorry if my question is too basic, but I greatly appreciate your help. I really want to learn.

LessW2020 · August 17, 2019, 4:48pm

Hi @minh - here’s a quick summary of how to use RAdam:
1 - copy the radam.py from the github to your local directory, ala nbs/dl1 where your notebooks are.
*(better but more complicated is to git clone it to your local drive and then reference the path… this way you are in sync but if you are working on a paid server such as Salamander I just copy the file over).

2 - In your notebook use:
from radam import RAdam #make RAdam available
optar = Partial(RAdam) #create a partial function to point to RAdam
learn = cnn_learner(data, models.xresnet50, opt_func=optar, metrics=error_rate) #when you create your learner, specify the new optimizer:

Now you have learner running RAdam
Hope that helps!

LessW2020 · August 17, 2019, 4:50pm

Hi @a_yasyrev,
Thanks for the testing feedback
Glad to hear RAdam is helping and also thanks for confirming 1GPU results are same as leaderboard 4 GPU.
Good luck with 20 epoch testing and please post results when you can.

joatom · August 17, 2019, 8:48pm

Tried it on a tabular model for a running kaggle competition.
Didn’t work out for me, yet. I m using slice(lr), pct_start=0.2 on fit_one_cycle with the default optimizer. My results got worse when I tried RAdam with those settings. When I used the default params on fit_one_cycle with RAdam I got much better results. But they were still slightly below than using my original settings.

Seb · August 17, 2019, 10:40pm

If we don’t need a warmup, I wonder if it means we need to adapt the one cycle learning rate scheduler.

Also, I had one case where my result was worse after upping the learning rate. It was some transfer learning on an image dataset. A bit anecdotal, but I wonder what people experienced in terms of results being independent from lr. It would be great to not have to worry about lr as much!

One other anecdotal result was that the learning rate finder curve was shifted to the right with RAdam.

I still want to run some tests on Imagenette with the modifications I suggested above, but that might have to wait until next week. Or maybe someone will run it before me.