Weights grads matrices are just 1D vectors broadcasted ?!

Brainkite · November 15, 2019, 3:26pm

During Lesson 8, my biggest DL fantasy of how deep learning really works shattered.
When Jeremy showed us this line:

w.g = (inp.unsqueeze(-1) * out.g.unsqueeze(1)).sum(0)

“Wait what?! thats a vector !”
“That means we are scaling complete rows of the weigth matrix by the same value?!”
(well I did not phrased it like this at the time, I just paused the video and froze for a moment)

Since the beguinning I thought grads were processed for each and every value in the weight matrix and they were all corrected independantly.

Does this would even make sense?
Would this be far too much computation?
This certainly would be far more difficult to train… is it?

One thing is certain:
I really should take Rachel’s Linear Algebra course

marco_b · November 16, 2019, 7:07pm

Don’t worry you weren’t going mad!

Not only w but also all other quantities here (inp, out and out.g) are matrices!
Their first dimension is the batch_size (or in this case the whole dataset size) and that’s actually the dimension you’re summing over!

So when you do .unsqueeze(-1) on inp you get a 3D tensor and when you .sum(0) you get back to a matrix, with one element for each w element and you’re simply summing all the gradients coming from all samples!

TLDR, yes they are matrices, so you’re indeed computing gradients for each and every value of w!
(you could add a few print statements in the lin_grad function to check what’s going on in case of doubts!)

Brainkite · November 16, 2019, 11:33pm

Damn I forgot we were dealing here with batches!
Okay I understant now the matrix operation to get the weights grads matrix.

What messed with my mind was seeing this:

Values of w1.g being all the same along the columns.

But maybe this makes sens because MNIST images are here flattened so values along the cols represent always the same pixel location in the image…

marco_b · November 17, 2019, 5:27pm

I think that’s correct, in particular the images are flattened in that example and some of the initial / final pixels for every sample will always be black as they correspond to the corners…

The center part of the image should have different elements though, if everything is being processed correctly!