# Could someone help me verify/troubleshoot my implementation of backpropagation

I have a neural network with 2 layers.

• The first layer is a linear combination/dot product plus a ReLU: l_1 = \text{max}(\vec{w}_1 \cdot \vec{x} + b_1)
• The second layer is simply the linear combination/dot product: l_2 = \vec{w}_2 \cdot l_1 + b_2

All in all, it comes out to look like this.

\vec{w}_2 \cdot \text{max}(\vec{w}_1 \cdot \vec{x} + b_1) + b_2

My loss function is MSE and hence comes out looking like this, where N is the number of samples, and y_i is the corresponding target.

\frac{1}{N} \sum^N_{i=1} (y_i - (\vec{w}_2 \cdot \text{max}(\vec{w}_1 \cdot \vec{x}_i) + b))^2

After doing a bunch of maths, I think Iâ€™m quite sure that the gradients for w_1 is given by the following simplified formula, where l_2 is the output of the second layer.

\frac{\partial \text{MSE}}{\partial \vec{w}_1} = -\frac{2}{N} \sum^N_{i=1} (y_i - l_2) \vec{w}_2^T \vec{x}^T_i

Now Iâ€™m having a bit of trouble implementing this backpropagation formula in Python. Below is my implementation.

w1_gs = (2/trn_x.shape[0]) * (l2[:, 0] - trn_y[None, ...]).T * (trn_x.unsqueeze(0) * w2.unsqueeze(-1)).sum(0)
w1_gs.max(), w1_gs.min()


The maximum and minimum gradients that are output are 0.01 and -0.01.

When I try to verify my gradients by using PyTorchâ€™s backpropagation algorithm, I get different maximum and minimum values for the gradients â€” ~43 and ~31 respectively.

w1_ = w1.clone().requires_grad_(True)
l1 = relu(lin(trn_x, w1_, b1))
l2 = lin(l1, w2, b2)
loss = mse(l2, trn_y)
loss.backward()


I donâ€™t know whether there is a problem in my implementation of the backpropagation algorithm (and if there is, exactly where), or if there is a problem in how Iâ€™m calculating the gradients using PyTorch.

I would really appreciate it if somebody could help me verify let me know where Iâ€™m going wrong. If you need any more information, want to see my working for deriving the backpropagation formula, or need access to a notebook for experimenting, do let me know.

Iâ€™m also new to this so some of the questions may not make sense, but hopefully help in some way. I am not verifying the algorithm, just the code. Some notes and questions:

• I believe it should start with w1_gs = (-2..., missing the - at the moment
• Do you need to unsqueeze? Iâ€™m not super familiar but looking online I believe there may be something wrong there and youâ€™re trying to fix an issue with the dimensions from the previous element not fitting well?
• The trn_y and l2 also seem to be swapped

With some help from AI, something like this?

w1_gs = (-2 / trn_x.shape[0]) * ((trn_y - l2[:, 0])[:, None] * w2.T).T @ trn_x


untestedâ€¦

Do you have a collab or notebook you could share?

I appreciate the response!

For points 1 and 3, I seem I mistakenly wrote the wrong backprop formula above. While that formula is correctâ€¦

\frac{\partial \text{MSE}}{\partial \vec{w}_1} = -\frac{2}{N} \sum^N_{i=1} (y_i - l_2) \vec{w}_2^T \vec{x}^T_i

â€¦it can be further simplified by removing the minus sign. This simplified formula is the one Iâ€™m trying to implement.

\frac{\partial \text{MSE}}{\partial \vec{w}_1} = \frac{2}{N} \sum^N_{i=1} (l_2 - y_i) \vec{w}_2^T \vec{x}^T_i

As for the second point, I do think I need to unsqueeze else the dimensions will not be compatable for broadcasting. trn_x has shape torch.Size([50000, 784] while w2 has shape torch.Size([50, 1]. Unsqueezing both gives us a shape of torch.Size([1, 50000, 784]) and torch.Size([50, 1, 1]) respectively.

I tested your code snippet that you generated, and it does seem to be outputting something more reasonable (~-81 and ~71). Perhaps I shouldnâ€™t add torch.sum in my implementation. Though I am trying to wrap my head around as to why. However, it still differs from the max/min gradients generated by PyTorch, though Iâ€™m not sure if Iâ€™m calculating them with PyTorch correctly.

You access the notebook here. â€śGradients and backward passâ€ť is the relevant section (ignore the initial two SymPy cells).

I may not be able to check your reply tomorrow, so if I donâ€™t, Iâ€™ll get back to you on Friday.

Hey! You seem to have swapped the matmul operands. I should be x@w, not w@x.

E.g., in your first layer, w has shape (n_input, n_hidden) and x has shape (n_datapoints, n_input). So w@x shouldnâ€™t even compute because the shapes donâ€™t match (n_hidden =/= n_datapoints).

Youâ€™re right! I should write that the other way round.

\text{max}(\vec{x} \cdot \vec{w}_1 + b_1) \cdot \vec{w}_2 + b_2

My implementation however does indeed use x@w.

Dabbled around with your generated snippet and now I realize why there is no torch.sum: itâ€™s because the @ operator is being used to handle the dot product.

Iâ€™ve found that the shape of the matrix by the snippet returned is 50 by 784, which is incorrect as the shape of w1 is 784 by 50. Thatâ€™s why the matrices need to be unsqueezed.

I simplified my original snippet fromâ€¦

w1_gs = (2/trn_x.shape[0]) * (l2[:, 0] - trn_y[None, ...]).T * (trn_x.unsqueeze(0) * w2.unsqueeze(-1)).sum(0)


â€¦toâ€¦

w1_gs = (2/trn_x.shape[0]) * ((l2 - trn_y[:, None] * w2.T)[:, None, :] * trn_x[..., None]).sum(0)


But it still differs from the gradientâ€™s calculated by PyTorch and I donâ€™t know why.

For context, this is how Jeremy does it in the lesson:

    # forward pass:
l1 = lin(inp, w1, b1)
l2 = relu(l1)
out = lin(l2, w2, b2)
diff = out[:,0]-targ
loss = diff.pow(2).mean()

# backward pass:
out.g = 2.*diff[:,None] / inp.shape[0]
l1.g = (l1>0).float() * l2.g

def lin_grad(inp, out, w, b):
# grad of matmul with respect to input
inp.g = out.g @ w.t()
w.g = (inp.unsqueeze(-1) * out.g.unsqueeze(1)).sum(0)
b.g = out.g.sum(0)


His implementation calculates gradients that are close to that of PyTorchâ€™s.

I decided to break down my massive formula and implement it bit by bit, and I finally obtained the same gradients as that of PyTorch. It also helped me better understand Jeremyâ€™s implementation, as mine ended up being similar to his.

My step-by-step implementation of the algorithm to obtain the weights of w1/w_1:

diff = (trn_y[:, None] -l2)
loss.g = (2 / trn_x.shape[0]) * diff
diff.g = loss.g * -1
l2.g = diff.g @ w2.T
l1.g = l2.g * (l1 > 0).float()
w1.g = (l1.g[:, None, :] * trn_x[..., None]).sum(0)


Note for next time: donâ€™t implement an algorithm all in one line; break it down and implement it bit by bit.

Turns out that I was multiplying w2.T in the wrong place, that I was multiplying w2.T when I should rather be taking the dot product, and that I left out the crucial derivative for the ReLU (\begin{cases} 0 &l_1 â‰¤ 0 \\ 1 &l_1 > 0 \end{cases} // (l1 > 0).float()
).

1 Like