I had some questions about the recent fast.ai blog post, “AdamW and Super-convergence is now the fastest way to train neural nets”, and in the absence of a comments topic this Deep Learning one seemed most appropriate.
Mostly I am curious about the conclusion that amsgrad is just noise. @sgugger it appears that this is true for the image classification tasks, but for the NLP tasks it seems like there was a substantial improvement.
Would you mind elaborating? I am also curious if you did any comparisons with vanilla SGD.