If I understand it correctly, the embedding layer learns to represent words / movies / whatever in a vector space of arbitrary dimensionality. The vectors we arrive at have some really nice properties, meaning that similar concepts on some level should be grouped together.
How is this achieved? How does the learning happen? In the movielens example, we are basically saying: “here are our users and movies algo, learn to represent them somehow through embeddings so that the representations will lend themselves well to predicting to what extent a user will like a move they have not seen”?
I learned about distributed representations from Geoffrey Hinton’s kinship paper But there we were taking a one-hot encoding of a person to their representation in a low dimensional space. And yes, the whole idea was that if a person was similar to another person in some useful way, on that axis they would end up being grouped together.
Is the embedding layer just that? Is it just a fancy way of going from ‘objects’ denoted with their id (integer) to their distributed representation in terms of inferred microfeatures and the reason it seems magical is that somehow the step of one-hot encoding happens behind the curtain?
What is the difference?! How does the embedding layer correspond to distributed representations / word2vec? To me it seems like they might be just elaboration on the same concept but not sure what (if) a difference exists.