A new paper came out last week on Gradient Centralization. I integrated and tested it inside of Ranger today and got new accuracy records it, so imo GC delivers and I think is worth checking out.
What is Gradient Centralization? = “GC can be viewed as a projected gradient descent method with a constrained loss function. The Lipschitzness of the constrained loss function and its gradient is better so that the training process becomes more efficient and stable.”
It basically sets the gradient mean to zero and thus stabilizes optimization much like how BN resets the mean for activations.
Source and paper is here:
I’ve updated Ranger to use it by default and have exposed options so you can turn it on or off (to compare) as well as control using gc for only conv layers or all conv and fc layers (recommended).
The new Ranger version is here:
The GC researchers github is here and also has SGD-gc and Adam-gc:
Per the GC paper, not only does GC improve training results but also appears to help with generalization. Anyway was really happy to see a paper deliver results when applied on larger datasets. The authors state that it helps reduce Nan errors while traing as well, so an additional benefit.