Ah the problem is that your dataset is too small. You really need to use something like Imagenet to see this issue - or at least, I don’t know how to see it otherwise. On Imagenet, using Adam works great for the first 2/3 of epochs, but then the last few epochs doesn’t quite get to the same place as SGD. Same problem if your momentum is too high, BTW.
The point of annealing eps is that with low eps, you have Adam, with high eps, you have SGD. So by annealing you get the same benefit of AdaBound, a little simpler.