Looks like the idea is really simple, just add a lower bound and upper bound to clamp the learning rates. If the lower bound is zero, the upper bound is infinity, then it becomes Adam; if the lower bound equals the upper bound, then it becomes SGD. In AdaBound, the lower bound and upper bound are changing so that it kinds of transforming from Adam to SGD.

The performance and effectiveness of this method is still to be tested as it’s brand new, but it seems not hard to implement in fastai. Do you think it’s worth adding?

Has anyone else experimented with either of these?

I’ve tried running both of these today (from PyTorch codes here and here).
So far, running in their default configurations, AdaBound seems to train significantly slower than Adam on a couple different problems I’ve tried. Maybe the learning-rate-clamping is too much by default?

AdaFactor, for me so far, turns out to be either a bit slower than Adam (but not as slow as AdaBound) or the same as Adam.

This is with no learning rate schedule by the way – for Adam it’s easy to overwrite the LR via a look-up-table of scheduled values, but for these other optimizers the LR is its own function.

PS- It would make me happy if the next such variant were named “Adaboy.”

@drscotthawley you should try gradually annealing eps. Start at 1e-8, and cosine anneal it to 1.0. If it works well, we can write a paper and call a Adaboy. OK?

I find that gradually cosine-annealing eps fro 1e-8 to 1 seems to have the effect of sometimes “hurting” Adam at low loss values, but does not help anything else – maybe I did it wrong? As for which method performs better, on the two problems I tried (XOR & MNIST), results are…mixed. Change the learning rate, and the picture of which method outperforms which other method reverses.

If I understand your idea of annealing epsilon, you’re trying to keep the step size from getting too big when other quantities are small? And yet I’d imagine that such “flat regions” are where you’d want to try to have a larger coefficient to multiply by the (small) gradient. …So, yea, it’s not clear to me why this ought to work: Please enlighten me!

…Wait, is your idea that, by cosine-annealing eps or not, that one gets results which are closer to AdaBound or AdaFactor, respectively? I see that in the XOR case, but not for MNIST.

One other thing worth noting: AdaFactor will give NaNs if the batch size isn’t “fairly large”. e.g. on MNIST, a batch size of 32 will get NaNs right away, whereas a batch size of 200, and…it seems to do ok.

Ah the problem is that your dataset is too small. You really need to use something like Imagenet to see this issue - or at least, I don’t know how to see it otherwise. On Imagenet, using Adam works great for the first 2/3 of epochs, but then the last few epochs doesn’t quite get to the same place as SGD. Same problem if your momentum is too high, BTW.

The point of annealing eps is that with low eps, you have Adam, with high eps, you have SGD. So by annealing you get the same benefit of AdaBound, a little simpler.

Oh, ok. I was using small datasets in order to experiment quickly. (I’m not yet setup to train ImageNet in 18 minutes like you are!) Is Imagenette a sufficient size for this? I just downloaded it and have started a run on my GPU.

I’ll also try enabling the eps annealing on my larger [audio] project for my next run. (Trying to finish a paper on that!) That was the project I mentioned for which AdaBound and AdaFactor were really slow at early times compared to Adam. …If the eps annealing helps Adam there, then that would be great.

Update: I got sick and have’t worked on this much since. So far all the eps-annealing has done for me is to slow down Adam.

I’ve found instructions on getting reduced versions of ImageNet, but not the full thing, yet. Asked for help in a different thread (How does one "Download ImageNet"?)