Does Adam penalize variance of derivatives? Or magnitude of derivatives?

I’m somewhat confused by @jeremy 's explanation of Adam in the Lesson 5 video (https://youtu.be/J99NV9Cr75I?t=2h6m49s). He says:

" Now what do you want to do if you have a number that’s first positive then negative then small then high? You probably want to be more careful… you probably don’t want to take a big step because you can’t really trust it. So when the variance of the gradient is high, you’re going to divide the learning rate by a big number. Whereas if the learning rate is of the same size all the time, then we probably feel very good about the step, so we’re dividing it by a smaller amount."

In the workbook (graddesc.xlsm) however, it looks like adam penalizes (by reducing the lr of) weights that have high magnitude rather than those that have high variance.

For example, consider two weights, the first with a history of derivatives that is very constant, the second with a history of derivatives that has a very high variance:
W1: [500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500]
W2: [500, -500, 500, -500, 500, -500, 500, -500, 500, -500, 500, -500]

W1 has a variance of zero, and W2 has a very high variance. Yet, since the magnitude of the history of derivatives is the same, I think both weights are penalized by Adam identically, right?

I was having the same doubt as you and after some reading I think understand what is the source of misunderstanding.

First of all here extract of the original Adam paper:

Note the authors define the moving average of the squared gradient as a estimate of the 2nd raw moment(uncentered variance). Contrary to the centered version, which most people are more used to, the uncentered version do not subtract the mean from each observation before squaring it. In practical terms is the same as using the centered version but assuming the mean to be qual to zero.

So when variance is mentioned in the Adam context is understood that we are considering the gradient mean to be equal to zero. Which makes sense, since there is no reason to think positive or negative gradients would be more probable than one-another.

The problem is that according my interpretation of Jeremy’s explanation, it seems that the gradient’s mean is being calculated using previous observations, since he says that if the gradient is jumping around we are going to get a higher variance. However we don’t. The variance is going to be equal, whether we have jumping gradients or a fixed one, only the absolute value is important.

However the first moment term is in fact sensitive about this jumping around, making the overall step size smaller. As an example I made a spreadsheet using your W1 and W2 gradient historiy.

So in other words. No, Adam does not threat W1 and W2 equal. But not because of its variance (as it was wrongly implied in lesson 5) but because of its first Momentum.

Sorry about the long text and I hope that you understand the concept better now :slight_smile:

Note that this is my interpretation and could very well be wrong. So I would be very grateful if either @jeremy, @rachel, or anyone really, could give his opinion if what I wrote is right or not.

@Dreyer, This should be “but because of its first moment”, correct?

Thanks for a great explanation, that helps a lot!