Categorical Embedding - Grouping Low Counts?


(Matthew Teschke) #1

With respect to Entity Embedding of categorical variables, is anyone aware of a reference that discusses what to do with low-count values of categorical variables? For example, if some values in a category only appear a few times (e.g., < 10), should those all be grouped together in one category?

Intuitively, it makes sense to me to do that grouping, as that is what we do for embedding of words in NLP problems - where, for example, we only consider words that appear more than 10 times and the others are grouped together. However, I have not seen a discussion of this approach with respect to embedding of categorical variables.

What do people think of that approach?