Gradient Centralization, Ranger optimizer updated with it

HI all,
A new paper came out last week on Gradient Centralization. I integrated and tested it inside of Ranger today and got new accuracy records it, so imo GC delivers and I think is worth checking out.

What is Gradient Centralization? = “GC can be viewed as a projected gradient descent method with a constrained loss function. The Lipschitzness of the constrained loss function and its gradient is better so that the training process becomes more efficient and stable.”
It basically sets the gradient mean to zero and thus stabilizes optimization much like how BN resets the mean for activations.
Source and paper is here:

I’ve updated Ranger to use it by default and have exposed options so you can turn it on or off (to compare) as well as control using gc for only conv layers or all conv and fc layers (recommended).

The new Ranger version is here:

The GC researchers github is here and also has SGD-gc and Adam-gc:

Per the GC paper, not only does GC improve training results but also appears to help with generalization. Anyway was really happy to see a paper deliver results when applied on larger datasets. The authors state that it helps reduce Nan errors while traing as well, so an additional benefit.

14 Likes

Did you test this on ImageWoof or anything? I don’t see any baselines (besides the papers you mentioned) :slight_smile: otherwise great work!

2 Likes

As the paper suggests, it requires only one line of code to add GC to any optimizer, so it should be extremely easy to add it to fast ai.

Is this line the only thing necessary?

d_p.add_(-d_p.mean(dim = tuple(range(1,len(list(d_p.size())))), keepdim = True))

I’m currently reading the paper but I’m failing to form an intuition on why this work. @LessW2020 can you give a “Explain like I’m five” explanation? :sweat_smile:

In the paper they say: we only need to compute the mean of the column vectors of the weight matrix, and then remove the mean from each column vector.

Does this mean we’re normalizing the weights that have the same color in the following pic? (for a FC layer)

3 Likes

The intuition that I’m forming (and it can be completely wrong), is that this normalization makes sure no batch has the ability to change the weights by much…

1 Like

Did some tests.
Added new Ranger to my woof pipeline.
On best version, with sa, mish, maxblurpool results worse than old version (no GC).
I decided check without maxblurpool. Its different.
On size 192, ep 5 results the same, on 20 epochs 86.1 vs 86.3 on DC True.

1 Like

My personal intuition on gc (taken from the paper) is that is can be reinterpreted as a constraint on the weight space making the optimization problem simpler and adding a regularization (the proofs being that vision are in the paper).

2 Likes

After first short test (only 5 epochs and couple runs on each variants) decided do more tests.
Here is first results, only 5 epochs. After 20 epochs will ends, will upload notebooks.
Used size 192, 5 runs. Compare Ranger with GC vs w/o GC vs OLD version Ranger, that i used in my previous experiments.
On fastai2 cant run new version, so used v1.
Xresnet50 from fastai v1:
OLD - 61.67% std 0.0112.
GC True - 63.34% std 0.0025.
GC False - 63.69% std 0.0172.
Xresnet + Mish + Sa:
OLD - 69.49% std 0.0060.
GC True - 70.56% std 0.0044.
GC False - 70.11% std 0.0094.
Xresnet + Mish + Sa + my resnet trick (still no name :grinning:):
OLD - 70.98% std 0.0080.
GC True - 70.28% std 0.0073.
GC False - 71.18% std 0.0157.

3 Likes

One more 5 epochs.
Xresnet + Mish + Sa + my resnet trick + MaxBlurPool:
Old - 76.75% std 0.0066.
GC True - 74.65% std 0.0068.
GC False - 75.03% std 0.0081.

@LessW2020: old one sometimes better then new :grinning:
Thanks! Good job!!!

It is interesting.
Will run 20 epochs.

2 Likes

I am hijacking this thread, but is this a new optimizer worth looking into?:

1 Like

Another new optimizer at NeurIPS2020:

3 Likes

I’ll add that they have a hybrid optimizer in there too, combining Ranger + GC + theirs :slight_smile:

2 Likes

Did short test on woof.
Changed Ranger to RangerAdaBelief.
One epoch become slower (0:58 to 1:10) on same comp.
Tried only 5 epochs runs.
Got 69.50% vs 68.00% on same model (size 192).
Best result with eps = 1e-12, not on 1e-8 as mentioned in their github.
Will run longer tests…

2 Likes

Did you run with decoupled weight decay? Seems to help with it (I was trying 1e-2). Also an increase in time is expected too.

Checked code - it is True by default (but in readme sad opposite).
So i used it.
Will check without it.

1 Like

Without decoupled weight decay one epoch about 1 sec faster, results very close.
Need long run tests.

Did long (20 and 80 epochs) runs.
Correction - Speed is the same with and without weight_decouple.
And it close to “default” Ranger, in my case 1:07 (ranger) vs 1:10 (new one).
On 20 epochs (3 times) - Ranger vs RangerAdaBelief vs RangerAdaBelief weight_decouple True:
87.03% 0.0024 – 87.00% 0.0023 – 86.54% 0.0048
On 80 epochs run (2 times):
89.30% 0.0009 – 89.58% 0.0001 – 89.59% (2 times exectly SAME results!).
Of course it just short test. There is a lot of hyperparameters to tweak…
I used eps = 1e-12
betas=(0.9, 0.999)

4 Likes

Adabelief seemed to do well after many epochs, maybe to some hyperparameter searching on 5-20 epochs and then compare on 80-100 epochs?

@a_yasyrev if possible can you please share the integrate RangerAdaBelief / your experiments?.

I may not be him, but this is what I’m doing (fastai v1):

!pip install ranger-adabelief==0.0.9
from ranger_adabelief import RangerAdaBelief
opt_func = partial(RangerAdaBelief, betas=(0.9,0.999), eps=1e-8)
learn = Learner(data, model, opt_func=opt_func)
1 Like