I think there may be a bug in computing the SGD with momentum.
Specifically, I think the smoothed version of gradient should be computed by initializing it to the first value of the gradients (and not to zero as prescribed in the course).
def average_grad(p, mom, grad_avg=None, **kwargs):
if grad_avg is None:
grad_avg = p.grad.data
else
grad_avg = grad_avg*mom + p.grad.data * (1 - mom)
return {'grad_avg': grad_avg}
Also, 1- mom
is missing (at least in the course notebook)
By doing so, you do not need to divide by 1-beta**(i+1)
(which seem to have been done in the examples in the chapter 16 in order to ensure the the first value of the smoothed the first averaged gradient matches the first sample of gradient)