Here are my thoughts and I would really appreciate if you could please chime in whether they are correct or not.
- AdaGrad helps us navigate the error surface. When one parameter has a derivative that is orders of magnitude larger than another one, chances are we will not be optimizing across that dimension (meaning we might be jumping from one side of the ravine to the other).
- I came across this super interesting quora answer, which looks at the situation from another direction - AdaGrad allows us make better use of sparse features. Meaning, if we have some feature in the training data that doesn’t show up too often, it will have very minimal impact on the weights vs a feature that shows up regularly. We might expect that this feature will update some set of weights that are normally not touched. When that happens, we want this update’s magnitude to be increased since we come across this feature so rarely and we still would like it to be incorporated into the corpus of knowledge that we are collecting (our weights) if the feature is meaningful albeit rare.
Would really welcome your thoughts on this matter. Are my arguments sound or can the reasoning be refined?