I’m currently completing a rewrite of Ranger (duly named, Ranger21) to use some of the newest innovations that have happened since Ranger was released 1.5 years ago.
Specifically, I have added the following:
1 - Choice of engine - madgrad (dual averaging) or AdamW style for the primary moment calculations
2 - Stable weight decay. Most of the adam variants are arguably various patches to work around the core issue that without normalizing the decay relative to the variance, you are creating a ‘moving target’ for the optimizer…this has been a nice improvement over standard adam style weight decay and AdamW style decay.
3 - Positive negative momentum - adding noise to the optimization helps it to settle into wider minima and thus better generalization. The problem is randomly adding noise will almost certainly result in worse outcomes…the trick is the noise has to be anisotropic and parameter dependent.
Thus, positive negative momentum applies this to create suitable noise and better results in my testing.
4 - Gradient centralization - brought forward because it continues to work well with either engine.
5 - RAdam replaced with linear warmup - based on an earlier paper, RAdam simplifies to a linear warmup for same improvements with much easier calculations.
6 - Cosine decay at end of run - basically fit flat cosine but built into the optimizer directly. Just spec what % to begin the decay (.65 is generally solid start).
Of interest MSFT research did a paper on lr scheduling and found that fit flat cosine (technically they used linear decay as they didn’t see much difference in decay style) worked the best of the different lr schedules. They called it hyperknee, but the schedule is the same regardless of name.
In general for transformers, madgrad is usually a better option and for CNNs, the AdamW style engine is on average better. But that’s only based on my initial testing…
It’s still in ‘alpha’ but if you want to give it a whirl it’s here: