If I understand it correctly, the embedding layer learns to represent words / movies / whatever in a vector space of arbitrary dimensionality. The vectors we arrive at have some really nice properties, meaning that similar concepts on some level should be grouped together.

How is this achieved? How does the learning happen? In the movielens example, we are basically saying: “here are our users and movies algo, learn to represent them somehow through embeddings so that the representations will lend themselves well to predicting to what extent a user will like a move they have not seen”?

I learned about distributed representations from Geoffrey Hinton’s kinship paper But there we were taking a one-hot encoding of a person to their representation in a low dimensional space. And yes, the whole idea was that if a person was similar to another person in some useful way, on that axis they would end up being grouped together.

Is the embedding layer just that? Is it just a fancy way of going from ‘objects’ denoted with their id (integer) to their distributed representation in terms of inferred microfeatures and the reason it seems magical is that somehow the step of one-hot encoding happens behind the curtain?

What is the difference?! How does the embedding layer correspond to distributed representations / word2vec? To me it seems like they might be just elaboration on the same concept but not sure what (if) a difference exists.

Found quite nice description of this in TF docs here. Guess there is no magic to the embedding layer! It just seemed a bit bizarre as it does the one hot encoding of inputs behind the curtain and the keras docs are rather sparse.

Here is what I understand.
It could be text, movie or a user. Everything can be represented in a higher dimensional terms. That is if you take text, then we have wordtovec or glove representations. In text those numbers may represent, if that word can be used in anger conversation/ can it occur in the middle of sentence etc. Think of this as PCA where we are trying to incorporate as much data as possible in orthogonal vectors.

When it comes to movies, users, it’s the same principle – it’s all about how can you represent a movie or a user in a vectorised space. Since we do not have any glove representations for movie etc, we select the number of attributes (we do not know what exactly it represents) that we want and run an optimization algorithm like SGD so as to minimize the error.

Say you decide to use 50 latent features. Are you assuming that you’ll get 50 ‘orthogonal vectors’? but how can you be sure? You could have chosen 100 latent features or 10…