Autograd does not appear to be setting gradient: Lesson 4 / 04_mnist_basics.ipynb

mulholio · November 22, 2020, 8:42am

Hey there,

I’m trying to reimplement the creation of a SGD MNIST digit classifier from lesson 4. I’m trying to do so without exactly copying the lesson code exactly and using PyTorch directly rather than via fastai. However, I’ve got stuck on a bug where grad seems to be set to None after I call loss.backward().

Here’s the error I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-22a0da261727> in <module>
     23 
     24         with torch.no_grad():
---> 25             weights -= weights.grad * LR
     26             bias -= bias * LR
     27 

TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

If I run the code again, I get a different error, namely:

RuntimeError                              Traceback (most recent call last)
<ipython-input-25-455a55143419> in <module>
      7         predictions = xb@weights + bias
      8         loss = get_loss(predictions, yb)
----> 9         loss.backward()
     10 
     11         with torch.no_grad():

/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    116                 products. Defaults to ``False``.
    117         """
--> 118         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    119 
    120     def register_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     91     Variable._execution_engine.run_backward(
     92         tensors, grad_tensors, retain_graph, create_graph,
---> 93         allow_unreachable=True)  # allow_unreachable flag
     94 
     95 

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

And here’s what I think are the relevant parts of my code:

# Skipped DataLoader setup for brevity

def get_accuracy(precictions, actual):
    return (precictions >= 0.5).float() == actual

def get_loss(predictions, actual):
    normalised = predictions.sigmoid()
    return torch.where(actual == IS_7, 1 - normalised, normalised).mean()

def init_params(size, variance=1.0):
    return torch.randn(size, dtype=torch.float, requires_grad=True) * variance

weights = init_params((IMG_SIZE, 1))
bias = init_params(1)

for epoch in range(1):
    #  Iterate over dataset batches
    # xb is a tensor with the independent variables for the batch (tensor of pixel values)
    # yb         ""           dependent             ""            (which digit it is)
    for xb, yb in dl:
        print(xb.shape)
        predictions = xb@weights + bias
        loss = get_loss(predictions, yb)
        loss.backward()

        with torch.no_grad():
            weights -= weights.grad * LR # <-- Error here: unsupported operand type(s) for *: 'NoneType' and 'float'
            bias -= bias * LR
        
            weights.grad.zero_()
            bias.grad.zero_()

Some useful notes:

I also tried to use .data instead of with torch.no_grad() but that didn’t help. with seems to be the preferred method from PyTorch (https://pytorch.org/tutorials/beginner/pytorch_with_examples.html)
Calling @ for matrix multiplication in the predictions vs torch.mm makes no difference.
I previously made a mistake with my tensor setup but I think that’s all fixed now. weights.shape, bias.shape outputs (torch.Size([784, 1]), torch.Size([1]))

Pomo · November 23, 2020, 12:54am

Hi James. Your question stumped me for quite a while. The problem is in the init_params() function. The multiplication by variance changes the tensor that becomes weights from a leaf to a non-leaf. (PyTorch tracks the gradients of any operations done after the autograd tensor is created.) Then loss.backward() will not generate gradients for the non-leaf weights.

Try this:

def init_params(size, variance=1.0):
return (torch.randn(size, dtype=torch.float)*variance).requires_grad_()

I was only able to figure this out because the latest PyTorch throws a warning when you access .grad for a non-leaf tensor. Previous versions would just silently give None for the gradient.

Thanks for the mental workout.

mulholio · November 23, 2020, 8:02am

That’s it! More bugs to fix now but that looks like it did the trick.

Nil · April 27, 2021, 3:21pm

Hi Guys, it appears I have the same situation and I tried to fix it the same way you explained it but it somehow doesn’t work for me.

So my code looks like this:

def linear1(xb): return xb@weights + bias

def loss_mnist(preds, train_y):
    return torch.where(train_y==1, 1-preds,preds).mean()

def apply_step(params, prn=True):
    preds=linear_model(xb,params)
    loss = loss_mnist(preds,yb)
    loss.backward()
    params.data -= lr * params.grad.data
    params.grad = none
    if prn: print(loss.item())
    return preds

def calc_grad(xb,yb,model):
    preds = model(xb)
    loss = loss_mnist(preds,yb)
    loss.backward()

def train_epoch(model,lr,params):
    for xb,yb in dl:
        calc_grad(xb,yb,model)
        for p in params:
            p.requires_grad_()
            print(p.data)
            print(p)
            print(p.data.grad)
            print(lr)
            p.data -= p.grad*lr
            p.grad.zero_()
    
def init_params(size, variance=1.0):
    return (torch.randn(size, dtype=torch.float)*variance).requires_grad_()

def batch_accuracy(xb,yb):
    preds = xb.sigmoid()
    correct = (preds>0.5) == yb
    print(correct.float().mean())
    return correct.float().mean()

def validate_epoch(model):
    accs = [batch_accuracy(model(xb),yb) for xb,yb in valid_dl]
    return round(torch.stack(accs).mean().item(),4)

I initialise the params here:

linear_model = nn.Linear(300*300,1)

lr = 1.0

weights = init_params((300*300,1))
bias = init_params(1)
params = weights,bias
params

and call the train_epoch() def here:

train_epoch(linear_model,lr,params)
validate_epoch(linear_model)

And I get these results:

tensor([[-0.0119],
        [-0.8714],
        [ 0.8034],
        ...,
        [ 1.0555],
        [ 1.2800],
        [-0.5259]])
tensor([[-0.0119],
        [-0.8714],
        [ 0.8034],
        ...,
        [ 1.0555],
        [ 1.2800],
        [-0.5259]], requires_grad=True)
None
1.0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-63-ef4915de697b> in <module>
----> 1 train_epoch(linear_model,lr,params)
      2 validate_epoch(linear_model)

<ipython-input-61-53d3568e96ab> in train_epoch(model, lr, params)
     27             print(p.data.grad)
     28             print(lr)
---> 29             p.data -= p.grad*lr
     30             p.grad.zero_()
     31 

TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

Maybe I just don’t see it but if anyone could give me a hint I would be very glad and thankful