Question about ADAM: Lesson 5

Hey guys,

Just finished Lesson 5 and wanted to ask 2 (probably silly) questions about the last bit - the AdamW portion of it:

  1. I’ve always thought of moving average as counting an average value across a window of time. The term “exponentially weighted moving average” was used to describe term that the learning rate is divided by in the lecture (ie. the second moment). Why is that an average? Isn’t that just adding a weight of the last step of the gradient squared?

ie. I thought a moving average is (x1 + x2 + x3 / 3) then (x2 + x3 + x4 / 3) as we move down a list of x’s. This seems to me to be just (x3 + x4 / 3)?

  1. Similarly, I was confused at first by the intuition regarding variance in the gradient squared, think I figured it out but wanted to clarify: so, is my train of thought here is that if the step of the gradient squared before the current gradient squared is much larger, the learning rate step will be lower hence capturing the intuition of the high variance.

The reverse is if the previous gradient squared was much smaller, then the learning rate step would be much bigger. If the next gradient squared isn’t more consistent after the current then, the learning rate step will be lowered?

Just wanted to clarify, appreciate any help :slight_smile: