Trying to create update function with SGD with momentum

I seem to be getting stuck with this implementation. From what I am seeing my parameter values don’t seem to be updating and I get a very slow loss decrease. What could I be doing wrong? Does p.sub_ actually change the weights since it is in a for loop?

#we now want to do the update with momentum
#momentum takes derivative, multiplys it by 0.1, then takes the previous update,
#multiplies it by 0.9 and we add the two together
#alpha = 0.1, beta = 0.9;  p-=grad*0.1 + p*0.9
p_delta = {}
def update(x,y,lr):
    wd = 1e-5
    y_hat = model(x)
    # weight decay
    w2 = 0.
    for p in model.parameters(): w2 += (p**2).sum()
    # add to regular loss
    loss = loss_func(y_hat, y) + w2*wd
    loss.backward()
    with torch.no_grad():
        i = 0
        for p in model.parameters():
            #p.grad is the slope of the line of that parameter
            if i not in p_delta:#check if key exists
                p_delta[i] = torch.zeros_like(p)
            p_update = (lr *p.grad) + (p_delta[i]*0.9)
            p_delta[i] = p_update.clone()
            p.sub_(p_update)
            p.grad.zero_()
            print((p_delta[i]))
            i+=1
    return loss.item()

EDIT: I have updated my code, I think the code in the excel spreadsheet is incorrect. Jeremy seems to show: lr* ((p.grad*0.1) + (p_delta[i]*0.9)) but many tutorials seem to show: (lr *p.grad) + (p_delta[i]*0.9) If we implement Jeremy’s code the loss actually is slower than vanilla GD. The part of the video is here: https://youtu.be/CJKnDu2dxOE?t=6581 Can anyone clarify? or tell me if I am on the right (or wrong) track?

EDIT2: WELP, looking at the loss graph of the sgd with momentum, it looks very similar to: lr* ((p.grad) + (p_delta[i]*0.9)) so… what gives? Why do some tutorials show multiplying lr*grad then adding it to the previous updates * alpha versus the way Jeremy showed?

Hi there,
I also could not implement the SGD with momentum, although I used the docs in PyTorch as reference.

As far as I understood, the idea is to detach each parameter in order to somehow store the old gradients, which should be accounted for when upgrading the parameters.

Below is my (not working) code:

def update(x,y,lr):
    state = defaultdict(dict)
    wd = 1e-5
    momentum = 0.9
    y_hat = model(x)
    # weight decay
    w2 = 0.
    bufs = []
    for p in model.parameters(): 
        w2 += (p**2).sum()
    # add to regular loss
    loss = loss_func(y_hat, y) + w2*wd
    loss.backward()
    for p in model.parameters():
        buf = torch.clone(p).detach()
        bufs.append(buf)

    with torch.no_grad():
        for p in model.parameters():            
            p.sub_(lr * p.grad)
            p.grad.zero_()
    return loss.item(), bufs

Maybe a late reply, and wondering if you actually got it. I am trying the same thing.

I think the problem with which formula to use is explained here.

Explaination Momentum

The way I understand i that when the current graiden tis not multiplied by 0.1 you have to multiply the
lr by it.