@jeremy, I’m pretty sure there’s an error in the graddesc excel notebook for the adagrad example.
A known drawback of adagrad is the effective learning rates continuously decrease because the denominator continuously grows. From Sebastian Ruder (http://ruder.io/optimizing-gradient-descent/index.html#adagrad)
Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.
…In its update rule, Adagrad modifies the general learning rate ηη at each time step tt for every parameter θiθi based on the past gradients that have been computed for θi
So, we expect the denominator in cells F1 and G1 to increase more and more each time we run the macro. However, they decrease instead of increase, which causes the effective learning rates in cells F2 and G2 to grow instead of shrink.
The problem seems to stem from the spreadsheet only considering the L2 norm from the most recent mini-batch, instead of considering the L2 norm over its entire history.
Please tell me if I’m wrong… Thanks!