Understanding linear layer gradient

I was watching lesson 8 and then https://explained.ai/matrix-calculus/index.html, but one thing is still not clear to me, lin_grad function.

My understanding of backprop is that it’s an iterative approach of calculating partial derivatives. Let’s say we have f = mse(relu(x@w + b)). We can replace functions and say
u = x@w + b
v = relu(u)
z = mse(v)

As well, df/dx = df/dz * dz/dv * dv/du * du/dx. As we run backprop, after
mse_grad, we will have inp.g = dz/dv, after relu_grad inp.g = dz/dv*dv/du and after lin_grad inp.g = dz/dv*dv/du*du/dx. Is anything wrong so far?

d/db
d(lin_grad)/db = 1 (vector). I get the part where we multiply that 1 with out.g, but why do we have to do .sum(0)? Is that because vector is broadcasted on dimension 0 when doing the forward pass?

d/dw
I do understand that we have to perform out.g(some_operation)x, but don’t fully understand whether it should be * or @. I know that I can’t do *, but I’d like to know some theoretical understanding instead of trying to make dimensions match.

d/dx
This component is completely unclear to me. My expectation is that it should be out.g(some_operation)*w.t. Again, I get that dimensions don’t allow that, but I’d like some theoretical understanding of why is that.

It is entirely possible that all answers are in https://explained.ai/matrix-calculus/index.html#sec4.3, but I am still unable to connect the dots.

In case somebody else bumps into the same problem, reading http://cs231n.stanford.edu/handouts/linear-backprop.pdf helped me.

The reality is that the code in the notebook is an “optimised” version of the code. It makes sense to write it as is because it uses less memory and is faster, but definitely does not show how you’d calculate it step by step.
The paper linked explains how to calculate element by element of the result and then figuring out that it can be decomposed into a matrix multiplication.

2 Likes

I’ve spent more time with this and figured I’d make it easier for others on the same path. If you want to fully understand how Jeremy’s code connects to the math linked, here’s the guide https://biasandvariance.com/batched-backpropagation-connecting-math-and-code/.

It goes step by step and links to relevant pieces of math as it unrolls. I hope it helps!

4 Likes

very useful. Thanks Mario!!!