Embeddings of a Multi-Label Categorical Feature

There is a categorical column in data where each row has multiple values for the column.

column_x

  1. a,b
  2. a,c,d,e
  3. b
  4. e,c,d,f,j,k,l
  5. g,k,l,p

The unique values for the column is > 1000. There exist other feature as well. And dependent driver is binary,

How do I create embedding for this column?

We can get embedding for individual value and then aggregate them. But How do I get embedding for individual element?

I would say create an indicator matrix with columns for a-z, i.e. row 1 will have a 1 for column a and column b and the rest zeros. Similarly row 2 will have a 1 at columns a, c, d, e and the rest zeros. Then you can have an embedding layer for each of those columns.

Would love to hear other suggestions too.

I got the same question, did you know how to do with it now???

I created a identical row for every value in categorical column. So, every row will have only 1 value instead of multiple values. Then create an embedding vector on the larger data.

While scoring the model on original data, aggregate the vectors using mean, sum, etc.

This approach result in information loss. But this is the best I have now.

Would love to hear other suggestions too.

Did you figured out a solution for this at some point? Or it is still a problem?