Categorical Embedding - Grouping Low Counts?

Tchotchke · March 16, 2018, 1:38pm

With respect to Entity Embedding of categorical variables, is anyone aware of a reference that discusses what to do with low-count values of categorical variables? For example, if some values in a category only appear a few times (e.g., < 10), should those all be grouped together in one category?

Intuitively, it makes sense to me to do that grouping, as that is what we do for embedding of words in NLP problems - where, for example, we only consider words that appear more than 10 times and the others are grouped together. However, I have not seen a discussion of this approach with respect to embedding of categorical variables.

What do people think of that approach?

Supersak80 · January 6, 2019, 9:04am

I’m interested in this as well. Have you had any insights as to how to handle this?

Tchotchke · January 6, 2019, 6:29pm

I did not ever find a good citation, but I did try it out on the TalkingData Kaggle competition and found that it led to a noticeable improvement in results. I’ll have to go back and find the exact improvement, but I think it was on the order of 5-10%

Supersak80 · January 6, 2019, 9:34pm

Okay, good to know, thanks for the response. In a similar vein, have you seen anything regarding the handling of categorical variables with very high cardinality - on the order of thousands, either tens of thousands or hundreds of thousands? This is in conjunction with having low counts for high cardinality variables.