Weight decay effect on adam/momentum


I don’t fully understand what’s the problem with weight decay and momentum and/or adam. I’m thinking of the end of Lesson 5, specifically, at this point

I understand that the L2 term ends up in the (exponentially weighted) moving averages of the gradients, and that of the square of gradients. However, I don’t get the “when there is a lot of variation we end up decreasing the amount of weight decay, and if there is little variation we end up increasing the amount of weight decay”. I was kind of assuming the the “amount of weight decay” is fixed and set by the paramter that multiplies the sum of the squares of the parameters (the 0.0005 factor) but, seemingly, that’s not what Jeremy calls the amount of weight decay here.

I kind of get the final solution to the “problem”, but I don’t see the connection between variation of the gradient and “amount of large decay”.

Any clue?




Also interested in a math proof of this :slight_smile: