Gradient Centralization, Ranger optimizer updated with it

LessW2020 · April 12, 2020, 4:11am

HI all,
A new paper came out last week on Gradient Centralization. I integrated and tested it inside of Ranger today and got new accuracy records it, so imo GC delivers and I think is worth checking out.

What is Gradient Centralization? = “GC can be viewed as a projected gradient descent method with a constrained loss function. The Lipschitzness of the constrained loss function and its gradient is better so that the training process becomes more efficient and stable.”
It basically sets the gradient mean to zero and thus stabilizes optimization much like how BN resets the mean for activations.
Source and paper is here:

I’ve updated Ranger to use it by default and have exposed options so you can turn it on or off (to compare) as well as control using gc for only conv layers or all conv and fc layers (recommended).

The new Ranger version is here:

The GC researchers github is here and also has SGD-gc and Adam-gc:

Per the GC paper, not only does GC improve training results but also appears to help with generalization. Anyway was really happy to see a paper deliver results when applied on larger datasets. The authors state that it helps reduce Nan errors while traing as well, so an additional benefit.

muellerzr · April 12, 2020, 4:18am

Did you test this on ImageWoof or anything? I don’t see any baselines (besides the papers you mentioned) otherwise great work!

lgvaz · April 12, 2020, 4:39pm

As the paper suggests, it requires only one line of code to add GC to any optimizer, so it should be extremely easy to add it to fast ai.

Is this line the only thing necessary?

d_p.add_(-d_p.mean(dim = tuple(range(1,len(list(d_p.size())))), keepdim = True))

I’m currently reading the paper but I’m failing to form an intuition on why this work. @LessW2020 can you give a “Explain like I’m five” explanation?

In the paper they say: we only need to compute the mean of the column vectors of the weight matrix, and then remove the mean from each column vector.

Does this mean we’re normalizing the weights that have the same color in the following pic? (for a FC layer)

lgvaz · April 12, 2020, 4:40pm

The intuition that I’m forming (and it can be completely wrong), is that this normalization makes sure no batch has the ability to change the weights by much…

a_yasyrev · April 12, 2020, 8:23pm

Did some tests.
Added new Ranger to my woof pipeline.
On best version, with sa, mish, maxblurpool results worse than old version (no GC).
I decided check without maxblurpool. Its different.
On size 192, ep 5 results the same, on 20 epochs 86.1 vs 86.3 on DC True.

nestorDemeure · April 13, 2020, 10:00am

My personal intuition on gc (taken from the paper) is that is can be reinterpreted as a constraint on the weight space making the optimization problem simpler and adding a regularization (the proofs being that vision are in the paper).

a_yasyrev · April 16, 2020, 3:31pm

After first short test (only 5 epochs and couple runs on each variants) decided do more tests.
Here is first results, only 5 epochs. After 20 epochs will ends, will upload notebooks.
Used size 192, 5 runs. Compare Ranger with GC vs w/o GC vs OLD version Ranger, that i used in my previous experiments.
On fastai2 cant run new version, so used v1.
Xresnet50 from fastai v1:
OLD - 61.67% std 0.0112.
GC True - 63.34% std 0.0025.
GC False - 63.69% std 0.0172.
Xresnet + Mish + Sa:
OLD - 69.49% std 0.0060.
GC True - 70.56% std 0.0044.
GC False - 70.11% std 0.0094.
Xresnet + Mish + Sa + my resnet trick (still no name ):
OLD - 70.98% std 0.0080.
GC True - 70.28% std 0.0073.
GC False - 71.18% std 0.0157.

a_yasyrev · April 16, 2020, 7:06pm

One more 5 epochs.
Xresnet + Mish + Sa + my resnet trick + MaxBlurPool:
Old - 76.75% std 0.0066.
GC True - 74.65% std 0.0068.
GC False - 75.03% std 0.0081.

@LessW2020: old one sometimes better then new
Thanks! Good job!!!

It is interesting.
Will run 20 epochs.

ilovescience · July 1, 2020, 6:04am

I am hijacking this thread, but is this a new optimizer worth looking into?:

ilovescience · October 16, 2020, 4:04am

Another new optimizer at NeurIPS2020:

muellerzr · October 16, 2020, 4:22am

I’ll add that they have a hybrid optimizer in there too, combining Ranger + GC + theirs

a_yasyrev · October 19, 2020, 12:50pm

Did short test on woof.
Changed Ranger to RangerAdaBelief.
One epoch become slower (0:58 to 1:10) on same comp.
Tried only 5 epochs runs.
Got 69.50% vs 68.00% on same model (size 192).
Best result with eps = 1e-12, not on 1e-8 as mentioned in their github.
Will run longer tests…

muellerzr · October 19, 2020, 1:03pm

Did you run with decoupled weight decay? Seems to help with it (I was trying 1e-2). Also an increase in time is expected too.

a_yasyrev · October 19, 2020, 1:46pm

Checked code - it is True by default (but in readme sad opposite).
So i used it.
Will check without it.

a_yasyrev · October 19, 2020, 5:00pm

Without decoupled weight decay one epoch about 1 sec faster, results very close.
Need long run tests.

a_yasyrev · October 21, 2020, 5:52am

Did long (20 and 80 epochs) runs.
Correction - Speed is the same with and without weight_decouple.
And it close to “default” Ranger, in my case 1:07 (ranger) vs 1:10 (new one).
On 20 epochs (3 times) - Ranger vs RangerAdaBelief vs RangerAdaBelief weight_decouple True:
87.03% 0.0024 – 87.00% 0.0023 – 86.54% 0.0048
On 80 epochs run (2 times):
89.30% 0.0009 – 89.58% 0.0001 – 89.59% (2 times exectly SAME results!).
Of course it just short test. There is a lot of hyperparameters to tweak…
I used eps = 1e-12
betas=(0.9, 0.999)

joaogui1 · October 23, 2020, 6:07pm

Adabelief seemed to do well after many epochs, maybe to some hyperparameter searching on 5-20 epochs and then compare on 80-100 epochs?

krishnakalyan3 · October 23, 2020, 7:12pm

@a_yasyrev if possible can you please share the integrate RangerAdaBelief / your experiments?.

muellerzr · October 23, 2020, 7:18pm

I may not be him, but this is what I’m doing (fastai v1):

!pip install ranger-adabelief==0.0.9
from ranger_adabelief import RangerAdaBelief
opt_func = partial(RangerAdaBelief, betas=(0.9,0.999), eps=1e-8)
learn = Learner(data, model, opt_func=opt_func)

coincheung · February 16, 2021, 2:24am

Hi all,

I have a question about changing sgd into ranger: Do we need to re-adjust the learning rate if we change sgd directly into ranger? And since adam does not work well with weight decay, some people proposed to use adamW, does ranger work well with weight decay ? I noticed that ranger beats sgd with 20epoch training scheme, is there any benchmark on more epoches and larger datasets such as imagenet ?