Dampening in Optimizers

marii · May 28, 2020, 11:31pm

Hello, I was recently reviewing optimizers, and while I mostly(intuitively) understand SGD,RMSProp, and Adam, I am having a bit of an issue understanding the dampening terms. I understand that they will decrease the contribution of the most recent gradients, though I am not sure on the “why” portion. I believe I found the paper that the term was introduced in, though the experiments only seemed to apply to RNN: http://www.icml-2011.org/papers/532_icmlpaper.pdf (still reading this myself). The authors seem to be mostly concerned with maintaining the RNN’s hidden state, which to me seemed to be a fairly specific concern to RNNs.

Was there somewhere else that it was found that dampening helps CNNs? Or should I try some experiments myself?
Example of dampening:

def average_grad(p, mom, dampening=False, grad_avg=None, **kwargs):
“Keeps track of the avg grads of p in state with mom.”
if grad_avg is None: grad_avg = torch.zeros_like(p.grad.data)
damp = 1-mom if dampening else 1.
grad_avg.mul_(mom).add_(damp, p.grad.data)
return {‘grad_avg’: grad_avg}

marii · May 29, 2020, 12:07am

Found my own answer in the last class for anyone who reads this: https://course.fast.ai/videos/?lesson=11
@1:30 of the 2019 course. Names of variables have changed. I forgot this part of the formulas from last year… it has been awhile.

Mostly comes down to the debias making the gradient due to momentum the same “size” as the gradient without momentum. The video is great as explaining, feel free to watch.