Subtle coding pitfall to avoid - lesson4/notebook 4

Symptom:

Accuracy between epochs while training remains almost the same (instead of steadily increasing before stabilizing to > 96%). Accuracy never reaches beyond 51% even after training for 20 epochs.

TLDR; If you don’t use double parantheses in the init_params(()) call given the way it is defined, the parameters don’t get tracked correctly by requires_grad_().

Debugging:

  1. Since accuracy was not increasing, the model wasn’t learning. Learning doesn’t happen if calculation of loss and gradients doesn’t happen correctly.

  2. I checked the loss after each training batch within a single epoch and it wasn’t steadily decreasing. This meant either the mnist_loss() function wasn’t getting calculated right or the gradients weren’t being updated correctly.

  3. I verified that the predictions and loss value was getting correctly computed. This must mean the gradients weren’t right. Upon checking the calc_grad() function, sure enough the gradients were off.

  4. Used the trick shown in the Jeremy’s notebook - I called calc_grad() twice successively and the gradients weren’t added up as was claimed in the notebook. This meant either gradient calculation was incorrect (unlikely since its a fairly stable library implementation) or I messed up in marking the right variables for gradient calculation().

  5. I checked requires_grad_() and it was appropriately called in init_params().

  6. I next checked the call to init_params() (after a ton of other debugging and several hours going through everything again and again). Found the mistake -> need to call init_params( ( ) ) with double parantheses but I was using single paranthesis. This was messing up how the parameters need to be set up for requires_grad_() (check details below).

Details:

Below is a fairly innocuous looking function def

def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()

And the call happens like so,

weights = init_params((28*28, 1)) -> double paranthesis (a Python feature)

The thing to notice here is that in the call above, double parantheses need to be used. Read the below link for the difference between double and single paranthesis in a function call in Python:

Without the right way of calling init_params(), the requires_grad_() function call within init_params doesn’t keep track of the parameters correctly (I don’t know why or how). But calling it with double parantheses fixes the issue.

1 Like

If you use “double parenthesis” then you are creating a tuple, so it’s passing one parameter to the function, and that parameter is size, with a value of (784, 1).

Otherwise it’s passing 784 to size and 1 to std.

1 Like

Do you happen to know what this semantic difference means to the requires_grad_() call?

you’re calling it with different arguments and getting back different shaped tensors.
call it both ways and look at the results.

init_params((5, 1)) returns

tensor([[-0.5261],
        [ 0.1705],
        [-0.3321],
        [ 1.7996],
        [-1.1580]], requires_grad=True)

init_params(5, 1) returns

tensor([-1.8087,  0.4371,  0.5020, -2.4613,  1.0316], requires_grad=True)

it makes no difference to requires_grad, you’ve got auto-grad for both tensors, but the subsequent code using that tensor needs it to be the correct shape.

Thank you @joedockrill. It all makes so much sense now.

1 Like