So I work at a place that is trying to crack rec sys for a bunch of users who are being served a stack of content cards of various categories
We have two types of users
dense (for which we have features like region,time of day active etc + interactions data on the cards that are rewarded as either 1 or 0 based on a like)
sparse (same features as above but very little/no interaction data)
The dense interaction data at the user level can also get me their inclination at a category level, like an avg of 0.5 reward for sports, 0.3 for news etc
We also have content data (text body) + metadata (category like news or sports, time of publishing etc)
Whats the best way to approach this? Do I approach it as an NCF problem or as tabular data problem? I’m not sure how to incorporate all this data in the former, and how well suited the latter is to the problem space)
The biggest challenge with using collaborative filtering models in practice is the bootstrapping problem. The most extreme version of this problem is when you have no users, and therefore no history to learn from. What products do you recommend to your very first user?
But even if you are a well-established company with a long history of user transactions, you still have the question: what do you do when a new user signs up? And indeed, what do you do when you add a new product to your portfolio? There is no magic solution to this problem, and really the solutions that we suggest are just variations of use your common sense. You could assign new users the mean of all of the embedding vectors of your other users, but this has the problem that that particular combination of latent factors may be not at all common (for instance, the average for the science-fiction factor may be high, and the average for the action factor may be low, but it is not that common to find people who like science-fiction without action). Better would probably be to pick some particular user to represent average taste.
Better still is to use a tabular model based on user meta data to construct your initial embedding vector. When a user signs up, think about what questions you could ask them that could help you to understand their tastes. Then you can create a model where the dependent variable is a user’s embedding vector, and the independent variables are the results of the questions that you ask them, along with their signup metadata. We will see in the next section how to create these kinds of tabular models. (You may have noticed that when you sign up for services such as Pandora and Netflix, they tend to ask you a few questions about what genres of movie or music you like; this is how they come up with your initial collaborative filtering recommendations.)
Thanks a lot! Will look it up ASAP. It’s late night where I’m from. But just as a follow up question, what if one my features itself is a vector, not a categorical or numerical variable, but an actual array of n-dimensional size. Can that be used as an input feature as well?