I have a neural network with 2 layers.
- The first layer is a linear combination/dot product plus a ReLU: l_1 = \text{max}(\vec{w}_1 \cdot \vec{x} + b_1)
- The second layer is simply the linear combination/dot product: l_2 = \vec{w}_2 \cdot l_1 + b_2
All in all, it comes out to look like this.
My loss function is MSE and hence comes out looking like this, where N is the number of samples, and y_i is the corresponding target.
After doing a bunch of maths, I think I’m quite sure that the gradients for w_1 is given by the following simplified formula, where l_2 is the output of the second layer.
Now I’m having a bit of trouble implementing this backpropagation formula in Python. Below is my implementation.
w1_gs = (2/trn_x.shape[0]) * (l2[:, 0] - trn_y[None, ...]).T * (trn_x.unsqueeze(0) * w2.unsqueeze(-1)).sum(0)
w1_gs.max(), w1_gs.min()
The maximum and minimum gradients that are output are 0.01
and -0.01
.
When I try to verify my gradients by using PyTorch’s backpropagation algorithm, I get different maximum and minimum values for the gradients — ~43
and ~31
respectively.
w1_ = w1.clone().requires_grad_(True)
l1 = relu(lin(trn_x, w1_, b1))
l2 = lin(l1, w2, b2)
loss = mse(l2, trn_y)
loss.backward()
w1_.grad.max(), w1_.grad.min()
I don’t know whether there is a problem in my implementation of the backpropagation algorithm (and if there is, exactly where), or if there is a problem in how I’m calculating the gradients using PyTorch.
I would really appreciate it if somebody could help me verify let me know where I’m going wrong. If you need any more information, want to see my working for deriving the backpropagation formula, or need access to a notebook for experimenting, do let me know.