SGD optimizations

Here is my mindmap of SGD optimizations.

It seems like Nadam is the best. Although, we use Adam in most examples. Curious what you guys think about Nadam?

Also, I am a bit confused why is learning rate annealing helping with adaptive learning rate optimizations? Isn’t the algorithm supposed to handle that if it is adaptive?


1 Like

Hi Sravya,

Great visualization.

Regarding your question on LR, I had the same thing in mind as to the necessity of annealing with adaptive learning rate. But I do know from personal experience that even with Nadam, using the ReduceLROnPlateau callback significantly reduced the error rate.

Haven’t found any good explanation for this.