I was watching lesson 8 and then https://explained.ai/matrix-calculus/index.html, but one thing is still not clear to me, lin_grad function.
My understanding of backprop is that it’s an iterative approach of calculating partial derivatives. Let’s say we have
f = mse(relu(x@w + b)). We can replace functions and say
u = x@w + b
v = relu(u)
z = mse(v)
df/dx = df/dz * dz/dv * dv/du * du/dx. As we run backprop, after
mse_grad, we will have
inp.g = dz/dv, after relu_grad
inp.g = dz/dv*dv/du and after lin_grad
inp.g = dz/dv*dv/du*du/dx. Is anything wrong so far?
d(lin_grad)/db = 1 (vector). I get the part where we multiply that 1 with
out.g, but why do we have to do
.sum(0)? Is that because vector is broadcasted on dimension 0 when doing the forward pass?
I do understand that we have to perform
out.g(some_operation)x, but don’t fully understand whether it should be * or @. I know that I can’t do *, but I’d like to know some theoretical understanding instead of trying to make dimensions match.
This component is completely unclear to me. My expectation is that it should be
out.g(some_operation)*w.t. Again, I get that dimensions don’t allow that, but I’d like some theoretical understanding of why is that.
It is entirely possible that all answers are in https://explained.ai/matrix-calculus/index.html#sec4.3, but I am still unable to connect the dots.