Hello, I was recently reviewing optimizers, and while I mostly(intuitively) understand SGD,RMSProp, and Adam, I am having a bit of an issue understanding the dampening terms. I understand that they will decrease the contribution of the most recent gradients, though I am not sure on the “why” portion. I believe I found the paper that the term was introduced in, though the experiments only seemed to apply to RNN: http://www.icml-2011.org/papers/532_icmlpaper.pdf (still reading this myself). The authors seem to be mostly concerned with maintaining the RNN’s hidden state, which to me seemed to be a fairly specific concern to RNNs.
Was there somewhere else that it was found that dampening helps CNNs? Or should I try some experiments myself?
Example of dampening:
def average_grad(p, mom, dampening=False, grad_avg=None, **kwargs):
“Keeps track of the avg grads ofp
instate
withmom
.”
if grad_avg is None: grad_avg = torch.zeros_like(p.grad.data)
damp = 1-mom if dampening else 1.
grad_avg.mul_(mom).add_(damp, p.grad.data)
return {‘grad_avg’: grad_avg}