Lesson 5: DotProduct

My understanding is that matrix dot product between two 2x2 matrices will result in another 2x2 matrix.

However, the DotProduct example in Lesson 5 results in 2 element vector.
a = T([[1.,2],[3,4]])
b = T([[2.,2],[10,10]])
(a*b).sum(1)
Result:
6
70
[torch.FloatTensor of size 2]

The code example uses (a*b).len(1) to compute Dot Product:

class DotProduct(nn.Module):
def forward(self, u, m): return (u*m).sum(1)

Shouldn’t it be
[a b]. [w x] = [aw+by ax+bz]
[c d] [y z] [cw+dy cx+dz]

or in numpy: np.dot(a, b) resulting in:
array([[ 2., 4.],
[30., 40.]])

They are doing element-wise operation, then summing them on the dimension 1.

In other words, a * b will end up looking like [[2, 4], [30, 40]] then .sum(1) will make it [6, 70]

I posted my incomplete notes here https://medium.com/@hiromi_suenaga/deep-learning-2-part-1-lesson-5-dd904506bee8 if you are interested. It makes things easier for me to search when I’m reviewing :slight_smile:

1 Like

Thanks - this is what I understood. Calling it DotProduct threw me off. Is the naming of that class incorrect?

A dot product is anything of this nature: a*b + c*d + e*f + ... and in this case that is exactly what happens. It’s not a matrix-matrix multiplication but a dot product between the rows of both matrices.

1 Like

I confess that it’s really puzzling to me how this unusual non-standard DotProduct can do the job it’s supposed to do here. As others have pointed out, (a*b).sum(1) does a line-by-line dot product of the two matrices a and b, resulting in a 1-D vector, not a matrix multiply which would yield a 2-d matrix. So this DotProduct only works in Numpy or PyTorch if the input matrices have identical dimensions, which they do when they are batched up in Jeremy’s code for the EmbeddingDot class in the notebook.

But the problem is that in Jeremy’s Excel model (“Why a matrix factorization and not a neural net?” at 00:12:15 in the lesson 5 video), what he does is precisely a matrix multiply of users and movies, not this weird line-by-line DotProduct. The loss is then calculated over the entire 2-d matrix, not over a 1-d vector, and that is as it should be. So that’s completely different from what the PyTorch EmbeddingDot class does, and I can’t figure out why the latter is correct, even though it seems to yield half-way decent results as measured by RMSE. The only explanation I could come up with is that because EmbeddingDot cycles through many batches over 3 epochs, it ends up comparing enough different user-movie pairs in its 1-d vector to get decent results.

Can anyone explain where I’m going wrong here? Or should the EmbeddingDot code be revised to use the actual PyTorch matrix multiply torch.mm(a,b), as others here have proposed?

1 Like

I spent some time on this tonight and figured out the answer to this. To my knowledge, nobody has sufficiently answered this question, so I’ll answer here:

It is true that we are element-wise multiplying two matrices and then summing along each row to produce a (j x 1) dimensional vector. For clarity, let’s see what we are actually figuring out…

note: for clarity, I’ll assume everything is a matrix rather than a tensor

self.u = a matrix with dimension (n_users x n_factors)

each row of self.u represents a single user and each entry in that row represents the value of some latent factor (ie. how much does user j like ‘action movies’? etc…)

self.m = a matrix with dimension (n_movies x n_factors)

each row of self.m represents a single movie and each entry in that row represents the value of some latent factor (ie. how much is movie j an ‘action movie’? etc…)

So what are we calculating when we take the “dot product” of the two rows? We are doing the calculation that Jeremy mentioned in the video… almost. We are indeed multiplying the corresponding elements and then summing them to yield (u*m).sum(1)

each row of (u * m).sum(1) represents the prediction of a SINGLE movie for a SINGLE user. If each matrix has dimension (j x k) then each element of (u*m).sum(1) is the prediction of how much user j will like movie j.

So the resultant output is not what we see in the excel worksheet (an entire matrix of predictions of each movie for each user), but instead is a prediction of a single movie for a single user. Furthermore, we aren’t necessarily predicting what two different users will think of the same movie for each mini-batch (we could be… see below).

But the key is in the mini-batches!

Because we shuffle our data for each epoch / each mini-batch, over time we will pair different movies to different users, so we will then be able to build out (implicitly) predictions for each user to each movie. We optimize our weights based on the error of predictions for each mini-batch, and continue moving forward. Eventually, we will (theoretically) test all movie-user combinations.

The last question is… why is this better than matrix multiplication? For this task… it probably doesn’t matter since the network is so shallow. But on the backward pass when our gradients are calculated and our weights are updated, taking the gradient of element-wise products (and sums) is cheaper than taking the gradients of full matrix products. So if we built a deep neural net, we would see savings on the backward pass through the network and would speed up our performance.

As a side note, I think some of the LSTM logic for RNNs is similarly based on trying to avoid derivatives of full matrix multiplications and instead taking element-wise products when possible.

Please let me know if this helps (or if I’ve made any mistakes); cheers!

Edit: I can confirm that the mini-batches do indeed shuffle users vs. movies. After running the code:
data = ColumnarModelData.from_data_frame(path, val_idxs, x, y, ['userId', 'movieId'], bs=64)
we can look at some attributes and compare them to the dataframe ratings.

If you look at the dataframe ratings you will see that the every id in the userid column is repeated for every movieid for which they have provided a rating.

Running the line of code data.trn_ds.cats shows us a similar setup of userid paired with movieid. So when we shuffle our data and draw new mini-batches we are drawing new userid movieid pairs, so we do indeed get different user-movie pairs to produce predictions.

3 Likes

Thanks, @Hanzy for your useful post. I was confused and became really clear after reading your post. Thanks for sharing!