Momentum (16_accel_sgd)

I think there may be a bug in computing the SGD with momentum.
Specifically, I think the smoothed version of gradient should be computed by initializing it to the first value of the gradients (and not to zero as prescribed in the course).

    def average_grad(p, mom, grad_avg=None, **kwargs):
    if grad_avg is None: 
        grad_avg = p.grad.data
    else 
        grad_avg = grad_avg*mom + p.grad.data * (1 - mom)
return {'grad_avg': grad_avg}

Also, 1- mom is missing (at least in the course notebook)
By doing so, you do not need to divide by 1-beta**(i+1) (which seem to have been done in the examples in the chapter 16 in order to ensure the the first value of the smoothed the first averaged gradient matches the first sample of gradient)

Concerning the first comment, the implementation in the notebook does the same as what you have suggested. In the notebook, if grad_avg is None, then grad_avg is set to 0s of the same shape as p.grad.data. Then the function multiplies [0,0,0,...,0]*mom + p.grad.data. The first term disappears and the function returns only p.grad.data.

On the second comment, I just saw this issue was discussed in GitHub and turns out this was intentional. See #174.

Cheers