I’ve been playing around with Collaborative Filtering and the MovieLens dataset and I was wondering…

If I only wanted to get the best embeddings for each of the movies (later to be used for projections and distance calculations) – does generalization or over-fitting matter?

If I never plan on using new data, and I just want the most accurate vector representation for each movie in the dataset, then I feel like it doesn’t matter if I overfit and the lower the error the better.

… or am I misunderstanding / missing something?

What I really want is the best vector representation of a movie as possible, to calculate the similarity between movies based on the [Euclidean] distance between embeddings (is there a better distance metric to use in this context?). If there is a more efficient way to do this (dl or ml solution welcome), please let me know this as well (or instead )

So I posted this question on Reddit as well and got the following answer (paraphrased):

Over-fitting does not matter a ton in this situation, but it still matters as the model could end up just creating an ID for each movie, rendering the embeddings useless in any other use besides identifying the movie it represents (so no comparison or distance).

Cosine Similarity would be the best algo to compare the distance / similarity between embeddings.

All of this makes sense intuitively to me, but I’m wondering if anyone has anything else to add (and provide an answer for anyone else that was curious )

Hi @zache
That’s very interesting! I’m trying to implement collab filtering to biological data and it always seems to overfit a bit and ratings predictions isn’t the best. However, the embeddings do show some interesting patterns. Could you please post the Reddit topic?