How embedding works (States of Germany example)?

I want to understand how embedding works and how to calculate them.

In the example about embeddings for states of Germany, i don’t understand how the state embedding is made?
To clarify my question, let’s make an analogy with the example about the movie rating in the Collabsection in the book - Basically we gave 5 latent features for the user ID (its embedding) and 5 features to the movie ID, and we used the data about the movie ratings to learn those latent features (with SGD as optimizer).

what im missing in the analogy is if the user ID is states of Germany, what’s the equivalent for the movie ID and the ratings?

You can take a look at the Rossmann Store Sales | Kaggle competition here. Actually, you want to predict the Rossmann sales based on some tabular data ( number of customer, open date, Competitior Distance, Promo, Promo Interval, … ) . Then, you create a model by using Entity Embedding ( mean for each categorical feature, you will create a latent vector, then stack every latent features together to have a long input features vector ).

To have the map of state as you mentioned, I think we can calculate the PCA with dimension 2 for the input embedding vector, then do some kind of clustering (or grouping the states closest to each other by calculate their Euclidean distance). Then we get, states in the same cluster is also states that is closest to each other geographically.

I wrote a blog long time ago with old version of fastai, explaining how tabular model of fastai works, you can take a look at here: Reverse Tabular module of fast.ai v1 | Kaggle

Hope it helps

1 Like

Thanks! that was very helpful.
Regarding PCA, it assumes gussian distribution of the samples, i wonder why the kaggle data is guassian, or how can i know that my data is guassian?
Also, what other methods for non-gussian data there are to compute principal components?

Do you mean doing Gaussian Mixture Model for clustering https://www.analyticsvidhya.com/blog/2019/10/gaussian-mixture-models-clustering/#:~:text=Gaussian%20Mixture%20Models%20(GMMs)%20assume,to%20a%20single%20distribution%20together ? I’m not sure the data is Gaussian, it’s very rare a real data like this can be Gaussian.

And i’m not sure for calculating PCA the data must be Gaussian.

For clustering you can try other methods, you can check here: The 5 Clustering Algorithms Data Scientists Need to Know | by George Seif | Towards Data Science