Does Adam penalize variance of derivatives? Or magnitude of derivatives?

Dreyer · March 5, 2018, 9:41pm

I was having the same doubt as you and after some reading I think understand what is the source of misunderstanding.

First of all here extract of the original Adam paper:

Note the authors define the moving average of the squared gradient as a estimate of the 2nd raw moment(uncentered variance). Contrary to the centered version, which most people are more used to, the uncentered version do not subtract the mean from each observation before squaring it. In practical terms is the same as using the centered version but assuming the mean to be qual to zero.

So when variance is mentioned in the Adam context is understood that we are considering the gradient mean to be equal to zero. Which makes sense, since there is no reason to think positive or negative gradients would be more probable than one-another.

The problem is that according my interpretation of Jeremy’s explanation, it seems that the gradient’s mean is being calculated using previous observations, since he says that if the gradient is jumping around we are going to get a higher variance. However we don’t. The variance is going to be equal, whether we have jumping gradients or a fixed one, only the absolute value is important.

However the first moment term is in fact sensitive about this jumping around, making the overall step size smaller. As an example I made a spreadsheet using your W1 and W2 gradient historiy.

So in other words. No, Adam does not threat W1 and W2 equal. But not because of its variance (as it was wrongly implied in lesson 5) but because of its first Momentum.

Sorry about the long text and I hope that you understand the concept better now

Note that this is my interpretation and could very well be wrong. So I would be very grateful if either @jeremy, @rachel, or anyone really, could give his opinion if what I wrote is right or not.