Entity Embeddings Parameter Tying

liammmm · February 21, 2020, 6:37am

I have a tabular dataset with two categorical columns Person1 and Person2, which I am each embedding as 50 dimensional vectors. Ideally, whether a person shows up as Person1 or Person2 in a particular row shouldn’t make a difference, so I would like these two variables to have the same embedding. Is it possible to tell the tabular_learner that these embeddings should share parameters?

muellerzr · February 21, 2020, 6:42am

Yes, you can pass in an embedding dictionary into it. It’s emb_szs

liammmm · February 21, 2020, 6:47am

I know, I can make them both the same size (ie 50), but I actually want them to have the same weights also both initially and with each update of SGD.

muellerzr · February 21, 2020, 6:55am

You could define a relationship between the two columns and combine them into one. IE A and B turns into AB if both are present, A if just one, etc. Otherwise others may chime in on the weights themselves. (You may need to adjust the TabularModel to get the weight idea working)

liammmm · February 21, 2020, 7:14am

A person, lets call him Zach, may show up in the Person1 column 100 times, and the Person2 column 100 times. Another person, lets call him Liam, may similarily show up 100 times in each column. But together they may only show up at the same time only once or twice in the whole data set. As we would have very few examples of each pairing, I dont think your solution would be particuarly ideal.

My current approach is the to show my model 2 copies of the dataset, one unchanged, and one with Person1 and Person2 swapped. This is my best approximation, but the model will still come up with different embeddings for Person1 and Person2. Thus Zach has a different embedding depending on whether he shows up in the Person1 or Person2 column, which isnt ideal, because there is only one Zach, and he is a unique snowflake.

nestorDemeure · February 21, 2020, 7:38am

If you look at the tabular model implementation, and in particular where category id are turned into embeddings, you should be able to copy/modify the code so that you use a single embedding dictionary for both columns.

liammmm · February 22, 2020, 1:22am

Thanks nestorDemeure.

Id guess the relevant line in the TabularModel implementation code is:

    self.embeds = nn.ModuleList([embedding(ni, nf) for ni,nf in emb_szs])

Are you able to point me in the right direction to how I might modify this?

jeremy · February 22, 2020, 4:28am

If you want, for instance, the first embedding to be the same as the second, just add this line after the above:

self.embeds[0] = self.embeds[1]

liammmm · February 22, 2020, 5:51am

Thanks Jeremy!

Is it enough to simply write:

learn = tabular_learner(data, layers=layers, emb_szs=emb_szs,emb_drop=emb_drop,ps=ps, metrics=accuracy)
learn.model.embeds[0]=learn.model.embeds[1]
learn.fit_one_cycle(epochs, lr,wd=wd,callbacks=[SaveModelCallback(learn, every='improvement', monitor='valid_loss', name=name)])

Or do I actually need to define a new Class TabularModel2 which is identical to TabularModel, except with that additional line you suggested and then define a new tabular_learner function which takes TabluarModel2 in place of TabularModel?

ecatkins · February 27, 2021, 3:28am

Hey Liam, did you end up working through this? I’m trying to implement something similar myself

The other thing I’m interested in, is how you handle if the vocabs of (to use your example) Person1 and Person2 aren’t exactly the same e.g. if there is a person who who only appears in one of the two columns

liammmm · February 27, 2021, 3:51am

Hi Edward, it’s been a long while since I looked at that. Can you just ‘double’ your data so you can every observation twice and swap the Person1 Person2 variables (and any other 1,2) variables?

ecatkins · February 28, 2021, 6:46pm

Thanks! I was trying to get in the weeds to modify the code where the vocab was created, and not particularly having a lot of success. So that solutions sounds like a good work around