Category embeddings across columns

Simon-lawyer · June 15, 2022, 4:45pm

Hi everyone,

I’m a legal researcher dipping my toe into the world of ML. I have a question about using categorical embeddings across different columns: is it possible/easy to implement for a novice?

Take this example. In some courts a matter might be heard by a panel of (for example) one, two, or three judges. When there is only one judge, I can load in a column of the judge and ask Fast.ai to create an embedding for each separate judge identity. But what do I do when the number of judges increases and I want draw on each different judge’s embedding across different columns? Put differently, if Person A might show up in column 1, 2, or 3, how do I tell Fast.ai that Person A is the same person?

My question is similar to this one asked in 2019.

Thanks!

zonkyo · June 15, 2022, 5:33pm

Hi!

There are several options you could follow.

Keep all the judges in one column; if there is only one judge, this one can be easily embedded. If there are several, you can either separate them with a comma, sort them lexicographically, and then use an embedding. This would handle one, two, three, well, any number of judges. Since this would implicitly count how many different strings there are (how many unique row values) and encode it either with a (by you) given number of values OR with, quote straight from fastai-docs

Through trial and error, this general rule takes the lower of two values:

A dimension space of 600

A dimension space equal to 1.6 times the cardinality of the variable to 0.56.
Why lexicographically ordering? Otherwise Judge Santa and Judge Claus can be assigned once as Judge 1 and Judge 2, the other time as Judge 2 and Judge 1 and this would result in different embeddings. This might be interesting provided they have different powers depending on the position they take, I know nothing about law :). If this could happen, then the order would be interesting, otherwise, see below

if you are sure that there is a fixed max of three judges, you could also go with three columns and use an embedding for every column. Potential advantage: smaller embedding space. Handling of single judges? Set the others to None/NaN, provided you use FillMissing as one of the processors, there should not be any problem. Then you could have Person A in any column, and every column would have its own embedding for Person A. It most likely would not be the same but this Person A in column X would always get the same embedding in column X.
You could also go ahead and use directly a One-Hot-Encoding on your DataFrame and skip the Embedding. This might cause your DataFrame to explode which may or may not slow down the network.

I honestly would most likely prefer the first option with ordering; so, assuming you did order the judges and wrote the columns nicely, you could go on and start with

# data is in df, lawyers are in column 'lawyers'  as a string, string, string OR string, string OR string 
# then str.split(",") will split into [1,2,3], [1,2], [1], sorted will sort and the ", ".join joins the stuff.
df = df.assign(lawyers=data.lawyers.str.split(",").apply(lambda x: sorted(x)).apply(lambda x: ", ".join(x))

# next part is split in categorical and continuous columns, which will allow to embed your categories
continiuous_columns, categorical_columns = cont_cat_split(df, dep_var=LIST_OF_DEPENDENT_VARS)

to = TabularPandas(df,
  procs=[FillMissing,Categorify], # this is where the magic happens,
  cont_names=continuous_columns,
  cat_names=categorical_columns,
  y_names=LIST_OF_DEPENDENT_VARS,
  splits=splits
)

Does this help?

Simon-lawyer · June 15, 2022, 5:43pm

This is great.

I’ve tried #2 and #3 (good to know that I’m on the right path!). An advantage of your proposed solution #1 is that it might capture how a panel of specific judges might be more than the sum of their individual parts. Let me try implementing.

Thank you for taking the time!