HI all,
A new paper came out last week on Gradient Centralization. I integrated and tested it inside of Ranger today and got new accuracy records it, so imo GC delivers and I think is worth checking out.
What is Gradient Centralization? = “GC can be viewed as a projected gradient descent method with a constrained loss function. The Lipschitzness of the constrained loss function and its gradient is better so that the training process becomes more efficient and stable.”
It basically sets the gradient mean to zero and thus stabilizes optimization much like how BN resets the mean for activations.
Source and paper is here:
I’ve updated Ranger to use it by default and have exposed options so you can turn it on or off (to compare) as well as control using gc for only conv layers or all conv and fc layers (recommended).
The GC researchers github is here and also has SGD-gc and Adam-gc:
Per the GC paper, not only does GC improve training results but also appears to help with generalization. Anyway was really happy to see a paper deliver results when applied on larger datasets. The authors state that it helps reduce Nan errors while traing as well, so an additional benefit.
The intuition that I’m forming (and it can be completely wrong), is that this normalization makes sure no batch has the ability to change the weights by much…
Did some tests.
Added new Ranger to my woof pipeline.
On best version, with sa, mish, maxblurpool results worse than old version (no GC).
I decided check without maxblurpool. Its different.
On size 192, ep 5 results the same, on 20 epochs 86.1 vs 86.3 on DC True.
My personal intuition on gc (taken from the paper) is that is can be reinterpreted as a constraint on the weight space making the optimization problem simpler and adding a regularization (the proofs being that vision are in the paper).
After first short test (only 5 epochs and couple runs on each variants) decided do more tests.
Here is first results, only 5 epochs. After 20 epochs will ends, will upload notebooks.
Used size 192, 5 runs. Compare Ranger with GC vs w/o GC vs OLD version Ranger, that i used in my previous experiments.
On fastai2 cant run new version, so used v1.
Xresnet50 from fastai v1:
OLD - 61.67% std 0.0112.
GC True - 63.34% std 0.0025.
GC False - 63.69% std 0.0172.
Xresnet + Mish + Sa:
OLD - 69.49% std 0.0060.
GC True - 70.56% std 0.0044.
GC False - 70.11% std 0.0094.
Xresnet + Mish + Sa + my resnet trick (still no name ):
OLD - 70.98% std 0.0080.
GC True - 70.28% std 0.0073.
GC False - 71.18% std 0.0157.
Did short test on woof.
Changed Ranger to RangerAdaBelief.
One epoch become slower (0:58 to 1:10) on same comp.
Tried only 5 epochs runs.
Got 69.50% vs 68.00% on same model (size 192).
Best result with eps = 1e-12, not on 1e-8 as mentioned in their github.
Will run longer tests…
Did long (20 and 80 epochs) runs.
Correction - Speed is the same with and without weight_decouple.
And it close to “default” Ranger, in my case 1:07 (ranger) vs 1:10 (new one).
On 20 epochs (3 times) - Ranger vs RangerAdaBelief vs RangerAdaBelief weight_decouple True:
87.03% 0.0024 – 87.00% 0.0023 – 86.54% 0.0048
On 80 epochs run (2 times):
89.30% 0.0009 – 89.58% 0.0001 – 89.59% (2 times exectly SAME results!).
Of course it just short test. There is a lot of hyperparameters to tweak…
I used eps = 1e-12
betas=(0.9, 0.999)
I have a question about changing sgd into ranger: Do we need to re-adjust the learning rate if we change sgd directly into ranger? And since adam does not work well with weight decay, some people proposed to use adamW, does ranger work well with weight decay ? I noticed that ranger beats sgd with 20epoch training scheme, is there any benchmark on more epoches and larger datasets such as imagenet ?