I am reading chapter 16, and I have a question about momentum. The book explains the steps as the following:
weight.avg = beta * weight.avg + (1-beta) * weight.grad
However, the code for getting average_grad
is
def average_grad(p, mom, grad_avg=None, **kwargs):
if grad_avg is None: grad_avg = torch.zeros_like(p.grad.data)
return {'grad_avg': grad_avg*mom + p.grad.data}
which does not have (1-beta)
. Is this intentional? I tried training 10 epochs with (1-beta)
and without, and I found (1-beta)
worked better. I also looked at the source code for fastai, but it is missing (1-beta)
.
def average_grad(p, mom, dampening=False, grad_avg=None, **kwargs):
"Keeps track of the avg grads of `p` in `state` with `mom`."
if grad_avg is None: grad_avg = torch.zeros_like(p.grad.data)
damp = 1-mom if dampening else 1.
grad_avg.mul_(mom).add_(p.grad.data, alpha=damp)
return {'grad_avg': grad_avg}
Does anybody know whether this is intentional or a mistake?
If it is intentional, does anybody know why?