I’m looking at a topic that’s a little different than the IMDB embedding example from the course. In my dataset, we just have a boolean for if each user “bought” each item, and the end goal is a recommendation to existing users for what other items they’d like to buy, based on their history. So rather than a sparse matrix with various values, you can think if it as a matrix of a 1 for every user-item pair that was purchased, and everything else 0s.
If I apply the default technique as in the course, I train a model that uniformly predicts 1 all the time. (This makes sense, as every example I feed it in training is a 1.) Is there a standard technique to train an embedding model on the 0s as well?
MSELossFlat effectively skips NaNs when computing the loss, so I’m imagining filling all the NaNs with 0s in the big items x users matrix, but it seems like there’s probably a more elegant solution.
I’d imagine that items bought by a single user would be only a small fraction of all the items, so your big matrix would have mostly zeros.
What I tried on the Movie Lens dataset is to select a random set of the not seen user-movie pairs that has the same size as the original ratings and concatenate it together.
It trained to 83% accuracy so it does seem to learn something. The code is here. It’s pretty much the same as the course notebook just with changing all the ratings to 1 and adding the negative samples.
Some other things I would try: add more negatives, maybe rotate them epoch to epoch, give higher probability of being chosen to more popular movies, change the loss function to cross entropy - MSE doesn’t make much sense when we are only dealing with 0 or 1 values.
Indeed, the overwhelming number of user-item pairs have zero values. I’m trying a very similar technique to what you suggest, so that’s comforting. As a first pass, from the training set of N user-item pairs that exist, I randomly generate N user-item pairs and concatenate them with the original data with target = 0. Good callouts for iterating on that—I’ll have to think about if the negative examples should be evenly distributed across the matrix or have some weight based on item popularity.
Another thought is that it seems like it might be simple to modify the default loss functions with something like a pre-process step of
t[isnan(t)] = 0, so that the input data size stays the same but the mean over the rows has 0s instead of nulls for the missing entries.