# Does Adam penalize variance of derivatives? Or magnitude of derivatives?

I’m somewhat confused by @jeremy 's explanation of Adam in the Lesson 5 video (https://youtu.be/J99NV9Cr75I?t=2h6m49s). He says:

" Now what do you want to do if you have a number that’s first positive then negative then small then high? You probably want to be more careful… you probably don’t want to take a big step because you can’t really trust it. So when the variance of the gradient is high, you’re going to divide the learning rate by a big number. Whereas if the learning rate is of the same size all the time, then we probably feel very good about the step, so we’re dividing it by a smaller amount."

In the workbook (graddesc.xlsm) however, it looks like adam penalizes (by reducing the lr of) weights that have high magnitude rather than those that have high variance.

For example, consider two weights, the first with a history of derivatives that is very constant, the second with a history of derivatives that has a very high variance:
W1: [500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500]
W2: [500, -500, 500, -500, 500, -500, 500, -500, 500, -500, 500, -500]

W1 has a variance of zero, and W2 has a very high variance. Yet, since the magnitude of the history of derivatives is the same, I think both weights are penalized by Adam identically, right?

I was having the same doubt as you and after some reading I think understand what is the source of misunderstanding.

First of all here extract of the original Adam paper:

Note the authors define the moving average of the squared gradient as a estimate of the 2nd raw moment(uncentered variance). Contrary to the centered version, which most people are more used to, the uncentered version do not subtract the mean from each observation before squaring it. In practical terms is the same as using the centered version but assuming the mean to be qual to zero.

So when variance is mentioned in the Adam context is understood that we are considering the gradient mean to be equal to zero. Which makes sense, since there is no reason to think positive or negative gradients would be more probable than one-another.

The problem is that according my interpretation of Jeremy’s explanation, it seems that the gradient’s mean is being calculated using previous observations, since he says that if the gradient is jumping around we are going to get a higher variance. However we don’t. The variance is going to be equal, whether we have jumping gradients or a fixed one, only the absolute value is important.