Hi @sgugger or other mathematicians,
In the attached pdf, I’ve tried to derive the gradient update rules as coded in the notebooks covering fully-connected layers. I struggle with connecting the trace of a matrix product to the chain-rule derivative of a matrix product. I also hedge and just do rough derivatives of the loss wrt other parameters, then add transposes as necessary to make the dimensions make sense.
Please critique to help me understand this. I’m trying to keep it simple enough to re-derive if I ever forget the transposes and operations involved in the gradient updates.
DerivingGradients4FClayers.pdf (309.6 KB)