Could someone help me verify/troubleshoot my implementation of backpropagation

I have a neural network with 2 layers.

  • The first layer is a linear combination/dot product plus a ReLU: l_1 = \text{max}(\vec{w}_1 \cdot \vec{x} + b_1)
  • The second layer is simply the linear combination/dot product: l_2 = \vec{w}_2 \cdot l_1 + b_2

All in all, it comes out to look like this.

\vec{w}_2 \cdot \text{max}(\vec{w}_1 \cdot \vec{x} + b_1) + b_2

My loss function is MSE and hence comes out looking like this, where N is the number of samples, and y_i is the corresponding target.

\frac{1}{N} \sum^N_{i=1} (y_i - (\vec{w}_2 \cdot \text{max}(\vec{w}_1 \cdot \vec{x}_i) + b))^2

After doing a bunch of maths, I think I’m quite sure that the gradients for w_1 is given by the following simplified formula, where l_2 is the output of the second layer.

\frac{\partial \text{MSE}}{\partial \vec{w}_1} = -\frac{2}{N} \sum^N_{i=1} (y_i - l_2) \vec{w}_2^T \vec{x}^T_i

Now I’m having a bit of trouble implementing this backpropagation formula in Python. Below is my implementation.

w1_gs = (2/trn_x.shape[0]) * (l2[:, 0] - trn_y[None, ...]).T * (trn_x.unsqueeze(0) * w2.unsqueeze(-1)).sum(0)
w1_gs.max(), w1_gs.min()

The maximum and minimum gradients that are output are 0.01 and -0.01.

When I try to verify my gradients by using PyTorch’s backpropagation algorithm, I get different maximum and minimum values for the gradients — ~43 and ~31 respectively.

w1_ = w1.clone().requires_grad_(True)
l1 = relu(lin(trn_x, w1_, b1))
l2 = lin(l1, w2, b2)
loss = mse(l2, trn_y)
loss.backward()
w1_.grad.max(), w1_.grad.min()

I don’t know whether there is a problem in my implementation of the backpropagation algorithm (and if there is, exactly where), or if there is a problem in how I’m calculating the gradients using PyTorch.

I would really appreciate it if somebody could help me verify let me know where I’m going wrong. If you need any more information, want to see my working for deriving the backpropagation formula, or need access to a notebook for experimenting, do let me know.

I’m also new to this so some of the questions may not make sense, but hopefully help in some way. I am not verifying the algorithm, just the code. Some notes and questions:

  • I believe it should start with w1_gs = (-2..., missing the - at the moment
  • Do you need to unsqueeze? I’m not super familiar but looking online I believe there may be something wrong there and you’re trying to fix an issue with the dimensions from the previous element not fitting well?
  • The trn_y and l2 also seem to be swapped

With some help from AI, something like this?

w1_gs = (-2 / trn_x.shape[0]) * ((trn_y - l2[:, 0])[:, None] * w2.T).T @ trn_x

untested…

Do you have a collab or notebook you could share?

I appreciate the response!

For points 1 and 3, I seem I mistakenly wrote the wrong backprop formula above. While that formula is correct…

\frac{\partial \text{MSE}}{\partial \vec{w}_1} = -\frac{2}{N} \sum^N_{i=1} (y_i - l_2) \vec{w}_2^T \vec{x}^T_i

…it can be further simplified by removing the minus sign. This simplified formula is the one I’m trying to implement.

\frac{\partial \text{MSE}}{\partial \vec{w}_1} = \frac{2}{N} \sum^N_{i=1} (l_2 - y_i) \vec{w}_2^T \vec{x}^T_i

As for the second point, I do think I need to unsqueeze else the dimensions will not be compatable for broadcasting. trn_x has shape torch.Size([50000, 784] while w2 has shape torch.Size([50, 1]. Unsqueezing both gives us a shape of torch.Size([1, 50000, 784]) and torch.Size([50, 1, 1]) respectively.

I tested your code snippet that you generated, and it does seem to be outputting something more reasonable (~-81 and ~71). Perhaps I shouldn’t add torch.sum in my implementation. Though I am trying to wrap my head around as to why. However, it still differs from the max/min gradients generated by PyTorch, though I’m not sure if I’m calculating them with PyTorch correctly.

You access the notebook here. “Gradients and backward pass” is the relevant section (ignore the initial two SymPy cells).

I may not be able to check your reply tomorrow, so if I don’t, I’ll get back to you on Friday.

Hey! You seem to have swapped the matmul operands. I should be x@w, not w@x.

E.g., in your first layer, w has shape (n_input, n_hidden) and x has shape (n_datapoints, n_input). So w@x shouldn’t even compute because the shapes don’t match (n_hidden =/= n_datapoints).

You’re right! I should write that the other way round.

\text{max}(\vec{x} \cdot \vec{w}_1 + b_1) \cdot \vec{w}_2 + b_2

My implementation however does indeed use x@w. :smile:

Dabbled around with your generated snippet and now I realize why there is no torch.sum: it’s because the @ operator is being used to handle the dot product.

I’ve found that the shape of the matrix by the snippet returned is 50 by 784, which is incorrect as the shape of w1 is 784 by 50. That’s why the matrices need to be unsqueezed.

I simplified my original snippet from…

w1_gs = (2/trn_x.shape[0]) * (l2[:, 0] - trn_y[None, ...]).T * (trn_x.unsqueeze(0) * w2.unsqueeze(-1)).sum(0)

…to…

w1_gs = (2/trn_x.shape[0]) * ((l2 - trn_y[:, None] * w2.T)[:, None, :] * trn_x[..., None]).sum(0)

But it still differs from the gradient’s calculated by PyTorch and I don’t know why.

For context, this is how Jeremy does it in the lesson:

    # forward pass:
    l1 = lin(inp, w1, b1)
    l2 = relu(l1)
    out = lin(l2, w2, b2)
    diff = out[:,0]-targ
    loss = diff.pow(2).mean()
    
    # backward pass:
    out.g = 2.*diff[:,None] / inp.shape[0]
    lin_grad(l2, out, w2, b2)
    l1.g = (l1>0).float() * l2.g
    lin_grad(inp, l1, w1, b1)
def lin_grad(inp, out, w, b):
    # grad of matmul with respect to input
    inp.g = out.g @ w.t()
    w.g = (inp.unsqueeze(-1) * out.g.unsqueeze(1)).sum(0)
    b.g = out.g.sum(0)

His implementation calculates gradients that are close to that of PyTorch’s.

I decided to break down my massive formula and implement it bit by bit, and I finally obtained the same gradients as that of PyTorch. It also helped me better understand Jeremy’s implementation, as mine ended up being similar to his.

My step-by-step implementation of the algorithm to obtain the weights of w1/w_1:

diff = (trn_y[:, None] -l2)
loss.g = (2 / trn_x.shape[0]) * diff
diff.g = loss.g * -1
l2.g = diff.g @ w2.T
l1.g = l2.g * (l1 > 0).float()
w1.g = (l1.g[:, None, :] * trn_x[..., None]).sum(0)

Note for next time: don’t implement an algorithm all in one line; break it down and implement it bit by bit.

Turns out that I was multiplying w2.T in the wrong place, that I was multiplying w2.T when I should rather be taking the dot product, and that I left out the crucial derivative for the ReLU (\begin{cases} 0 &l_1 ≤ 0 \\ 1 &l_1 > 0 \end{cases} // (l1 > 0).float()
).

1 Like

oops, sorry I missed your response! I’m glad you found the answer :slight_smile:

1 Like