I have a neural network with 2 layers.

- The first layer is a linear combination/dot product plus a ReLU: l_1 = \text{max}(\vec{w}_1 \cdot \vec{x} + b_1)
- The second layer is simply the linear combination/dot product: l_2 = \vec{w}_2 \cdot l_1 + b_2

All in all, it comes out to look like this.

My loss function is MSE and hence comes out looking like this, where N is the number of samples, and y_i is the corresponding target.

After doing a bunch of maths, I *think* Iâ€™m quite sure that the gradients for w_1 is given by the following simplified formula, where l_2 is the output of the second layer.

Now Iâ€™m having a bit of trouble implementing this backpropagation formula in Python. Below is my implementation.

```
w1_gs = (2/trn_x.shape[0]) * (l2[:, 0] - trn_y[None, ...]).T * (trn_x.unsqueeze(0) * w2.unsqueeze(-1)).sum(0)
w1_gs.max(), w1_gs.min()
```

The maximum and minimum gradients that are output are `0.01`

and `-0.01`

.

When I try to verify my gradients by using PyTorchâ€™s backpropagation algorithm, I get different maximum and minimum values for the gradients â€” ~`43`

and ~`31`

respectively.

```
w1_ = w1.clone().requires_grad_(True)
l1 = relu(lin(trn_x, w1_, b1))
l2 = lin(l1, w2, b2)
loss = mse(l2, trn_y)
loss.backward()
w1_.grad.max(), w1_.grad.min()
```

I donâ€™t know whether there is a problem in my implementation of the backpropagation algorithm (and if there is, exactly where), or if there is a problem in how Iâ€™m calculating the gradients using PyTorch.

I would really appreciate it if somebody could help me verify let me know where Iâ€™m going wrong. If you need any more information, want to see my working for deriving the backpropagation formula, or need access to a notebook for experimenting, do let me know.