Collaborative Filtering question - Latent Factors

I’ve been looking at ch8 of the book and have been scratching my head about ‘Latent factors’ … they seem like magic… starting off as random values and ending up with all sorts of insight …
Q: how do you know how many latent factors to pick?
Q: how do you know what any particular factor represents?
Q: are they somehow like the hidden units in a NN ?
hopefully the penny will drop for me soon

Hi Aaron,

Here are my thoughts on your Qs.

A: Nobody knows the answer to this and it would vary from problem to problem. fastai has some reasonable defaults, but will almost assuredly not be optimal for any particular dataset. You could use a hyper-tuning framework to search for an “optimal” dimensionality of the embeddings.

A: You don’t! You can try to interpret them by plotting where the observations fall in any given dimension, but it’s inherently subjective.

A: Not really. They are both numbers once the training process has concluded and can be interpreted as higher-level “features” of your data, but embeddings/Latent factors are trainable parameters. Hidden units in a NN are the sum product of input features and trainable parameters.

1 Like

Hi Patrick - interesting, now I’m not feeling so dumb :slight_smile:

So regarding how many latent factors to pick, you say nobody knows, but what I read in the book was:

Step 1 of this approach is to randomly initialize some parameters. These
parameters will be a set of latent factors for each user and movie. We will
have to decide how many to use. We will discuss how to select this shortly,
but for illustrative purposes, let’s use 5 for now.

which sort of got me thinking there was a way to come up with that number, but I didn’t see it discussed again …

So if as you say, there’s no rule of thumb for choosing n_factors, my third question was just asking if there was, intuitively, some similarity between these ‘latent factors’ (which are hidden, inferred and learned) and the hidden layers/units of a NN ,. because with those there seems to be no rule of thumb (that I’m aware of) that says how many of them you should have either … I mean, a hidden unit is also a hidden, inferred, learned kinda thing no?

The book went on to talk about how to interpret the embeddings and biases - apparently the biases are the things to look at (no intuition as to why) and then a brief mention of PCA, which identified the two most important latent factors (I think?), which in turn allowed you to see some sort of clustering relationship … but even that was pretty subjective - so I see what you are saying there…

I think Jeremey’s final comment:

No matter how many models I train, I never stop getting moved and surprised by how
these randomly initialized bunches of numbers, trained with such simple mechanics,
manage to discover things about my data all by themselves. It almost seems like
cheating that I can create code that does useful things without ever actually telling it
how to do those things!

sums up how magical (and unintuitive) this is… so maybe I shouldn’t even try to understand it too deeply?


Thanks for update and quick reply. I’ll be sure to keep an eye on this thread. Looking for the same issue.

Hey mlabs,

I was similarly amazed at how embeddings can encapsulate so much important information such as geographic proximity of towns, it is extremely cool.

My tuppenceworth follows, I would welcome correction so that I can learn more as well. From what I understand in chapter 8, the learned embeddings are trained using gradient descent to minimize a loss function defined in the Learner object. The magic of pytorch is it manages to abstract away the backward training / weight adjustment step of the process and all gradients are automatically calculated. This process is similar to the training of a traditional feed-forward neural network in that the weights are adjusted in a backward training step in the direction that most reduces the loss/cost.

So really the reason I wrote this was to say I believe you are right that both learned embeddings and Neural networks are doing something similar here and it is learning via gradient descent which is amazing!

1 Like