I spent some time on this tonight and figured out the answer to this. To my knowledge, nobody has sufficiently answered this question, so I’ll answer here:
It is true that we are element-wise multiplying two matrices and then summing along each row to produce a (j x 1) dimensional vector. For clarity, let’s see what we are actually figuring out…
note: for clarity, I’ll assume everything is a matrix rather than a tensor
self.u = a matrix with dimension (n_users x n_factors)
each row of self.u represents a single user and each entry in that row represents the value of some latent factor (ie. how much does user j like ‘action movies’? etc…)
self.m = a matrix with dimension (n_movies x n_factors)
each row of self.m represents a single movie and each entry in that row represents the value of some latent factor (ie. how much is movie j an ‘action movie’? etc…)
So what are we calculating when we take the “dot product” of the two rows? We are doing the calculation that Jeremy mentioned in the video… almost. We are indeed multiplying the corresponding elements and then summing them to yield (u*m).sum(1)
each row of (u * m).sum(1) represents the prediction of a SINGLE movie for a SINGLE user. If each matrix has dimension (j x k) then each element of (u*m).sum(1) is the prediction of how much user j will like movie j.
So the resultant output is not what we see in the excel worksheet (an entire matrix of predictions of each movie for each user), but instead is a prediction of a single movie for a single user. Furthermore, we aren’t necessarily predicting what two different users will think of the same movie for each mini-batch (we could be… see below).
But the key is in the mini-batches!
Because we shuffle our data for each epoch / each mini-batch, over time we will pair different movies to different users, so we will then be able to build out (implicitly) predictions for each user to each movie. We optimize our weights based on the error of predictions for each mini-batch, and continue moving forward. Eventually, we will (theoretically) test all movie-user combinations.
The last question is… why is this better than matrix multiplication? For this task… it probably doesn’t matter since the network is so shallow. But on the backward pass when our gradients are calculated and our weights are updated, taking the gradient of element-wise products (and sums) is cheaper than taking the gradients of full matrix products. So if we built a deep neural net, we would see savings on the backward pass through the network and would speed up our performance.
As a side note, I think some of the LSTM logic for RNNs is similarly based on trying to avoid derivatives of full matrix multiplications and instead taking element-wise products when possible.
Please let me know if this helps (or if I’ve made any mistakes); cheers!
Edit: I can confirm that the mini-batches do indeed shuffle users vs. movies. After running the code:
data = ColumnarModelData.from_data_frame(path, val_idxs, x, y, ['userId', 'movieId'], bs=64)
we can look at some attributes and compare them to the dataframe ratings
.
If you look at the dataframe ratings
you will see that the every id
in the userid
column is repeated for every movieid
for which they have provided a rating
.
Running the line of code data.trn_ds.cats
shows us a similar setup of userid
paired with movieid
. So when we shuffle our data and draw new mini-batches we are drawing new userid
movieid
pairs, so we do indeed get different user-movie pairs to produce predictions.