I’ve some doubts about how rmsprop it is calculated in the spreadsheet graddesc.xlsm and in the formula. In the spreadsheet I see calculations like this:
But when I look in some articles the square is not from the previous of “exponentially average of the square of gradients”, but of the current calculated. I recalculate with the second option and it got a worse result than what Jeremy show is the video.
Is it matter? It’s maybe I don’t understand the calculations.
Interesting observation, I hadn’t noticed this. If we use the current gradient as is described here http://ruder.io/optimizing-gradient-descent/index.html#rmsprop, we should be dividing by sqrt(J7) to compute H7. I don’t think it should matter much.
Edit: In the Adam sheet, you’ll see that Jeremy uses the current square of the gradient.
I am fairly sure you are correct in your observation. The only time it is noticeable is when you clear out the values and initialize those values to 0 (J3 and K3). If you do this, it matters. Thanks for pointing it out. I came to ask the same question as I am finally starting to get to this level of comprehension.