In lesson 4 we talked about a bunch of different approaches to dynamic learning rates. I’m trying to make sure I understand them and have a few questions.
Looking at the implementation of Adagrad here, it seems to me that the learning rate for a parameter gets updated based on the magnitude of the change to that parameter over an epoch. Sebastian’s comments suggest something different:
“we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance… Adagrad  is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters”
Based on the implementation, it seems like how often a parameter gets updated isn’t as important as the error gradient for any given update. If we’re at a point with a high learning rate, the parameter may be adjusted by a larger amount in one update than a point with a low learning rate is over multiple updates, so the frequency with which a parameter is updated seems misleading. I might be misunderstanding Sebastian’s terminology here.
Additionally it seems like a parameter that has been adjusted by a lot will have its learning rate shrunk down the most, and a parameter that hasn’t been adjusted much will use the default learning rate. I can see how this would prevent overshooting the optimum and also allow the areas with the lowest gradient to continue to train quickly. However in my head, if the error gradient is high, then we’re far from the optimum, so reducing the learning rate will make it take longer to reach.
I’m reading over the paper - do you know of any simple explanations of the technique?
Finally are there techniques that try to search for a global minima, or at least find out how local a local minima is and pick a best-of out of any others that get found?
p.s. I’ve put my notes up on the wiki here, feel free to add/remove/edit/fix!