Success with categorical entity embeddings?

(Philip Lynch) #1

Has anyone had success using embeddings for categorical variables (such as rossman notebook, lesson 14)?

I have been testing the approach on a predictive problem at work with 56 continuous and 8 categorical variables (with a total of 68 categories), but haven’t seen any improvement over just using XGBoost with one hot encoded variables. I’ve tested a ton of different architectures (# layers, dropout %, BN) and the results all tend to be the same.

My theory is that the continuous variables in my problem are providing most of the predictive power and that the slight reduction in dimensions for my categorical variables doesn’t do much. I was a little discouraged that this approach didn’t work here, so was hoping other people might have other success stories.


Hey Phil,

I have nothing concrete to offer here other than Lesson 14 got me excited to try this approach on high dimensional categorical variables for use in downstream (linear) models. Have you looked at the variable importance rankings from xgboost to test your theory? One thing to note is that the xgboost algorithm tends to favor variables with lots of split points so I think it’s importance rankings are somewhat susceptible to that weakness. Another thing to note is that tree-based algorithms don’t actually need categorical variables to be one-hot encoded so long as a sufficient number of trees are grown of sufficient depth. Of course, they still need to be integerized. I’d have to think about it a bit more, but perhaps that gives us a clue of why entity embeddings aren’t all that useful for downstream tree-based algorithms in your case. Although, I seem to recall the Rossman 3rd place winners published some metrics in their paper to the contrary. It might be worth building a few quick and dirty models of another type (lasso, random forest?) to see if the embeddings offer improvement.


(Philip Lynch) #3

I actually tested the xgboost model without only the continuous variables and it performed the same, so I’m thinking its just a case where the categorical variables don’t add much value. I should have checked that from the get go (like you mentioned with the variable importance), but I got caught up in wanting to try the approach.

(keyu nie) #4

Maybe you should mix with Deep and Wide together

(Siddharth) #5

How wilL Deep and Wide help here?

(Rudy Gilman) #6

Same issue with me. Doing a regression problem at work. NN ala Rossman got RMSE to about .40, xgboost got down to .35. Ever have any more luck on your end?

(Philip Lynch) #7

I realized that on the problem I was working on, I had already encoded a lot of the categorical information into other variables.

For example, I made a variable that was value of last year’s continuous variable I was forecasting (positive real numbers), and zero otherwise. Xgboost was easily able to segment out holiday or not effects there as well as bigger/smaller holidays based on the value. So when I went to use embeddings for holidays in my model, I already was capturing most of the information already, so it didn’t help.

I had similar variables for special events and day of week, so at the end of the day, I think I had already done enough feature engineering to capture what the embeddings were meant to.

(Rudy Gilman) #8

Got it. Yea, sounds like same situation as me. Thanks for the info!