" Now what do you want to do if you have a number that’s first positive then negative then small then high? You probably want to be more careful… you probably don’t want to take a big step because you can’t really trust it. So when the variance of the gradient is high, you’re going to divide the learning rate by a big number. Whereas if the learning rate is of the same size all the time, then we probably feel very good about the step, so we’re dividing it by a smaller amount."
In the workbook (graddesc.xlsm) however, it looks like adam penalizes (by reducing the lr of) weights that have high magnitude rather than those that have high variance.
For example, consider two weights, the first with a history of derivatives that is very constant, the second with a history of derivatives that has a very high variance:
W1: [500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500]
W2: [500, -500, 500, -500, 500, -500, 500, -500, 500, -500, 500, -500]
W1 has a variance of zero, and W2 has a very high variance. Yet, since the magnitude of the history of derivatives is the same, I think both weights are penalized by Adam identically, right?