I have a tabular dataset with two categorical columns Person1 and Person2, which I am each embedding as 50 dimensional vectors. Ideally, whether a person shows up as Person1 or Person2 in a particular row shouldn’t make a difference, so I would like these two variables to have the same embedding. Is it possible to tell the tabular_learner that these embeddings should share parameters?
Yes, you can pass in an embedding dictionary into it. It’s emb_szs
I know, I can make them both the same size (ie 50), but I actually want them to have the same weights also both initially and with each update of SGD.
You could define a relationship between the two columns and combine them into one. IE A and B turns into AB if both are present, A if just one, etc. Otherwise others may chime in on the weights themselves. (You may need to adjust the TabularModel to get the weight idea working)
A person, lets call him Zach, may show up in the Person1 column 100 times, and the Person2 column 100 times. Another person, lets call him Liam, may similarily show up 100 times in each column. But together they may only show up at the same time only once or twice in the whole data set. As we would have very few examples of each pairing, I dont think your solution would be particuarly ideal.
My current approach is the to show my model 2 copies of the dataset, one unchanged, and one with Person1 and Person2 swapped. This is my best approximation, but the model will still come up with different embeddings for Person1 and Person2. Thus Zach has a different embedding depending on whether he shows up in the Person1 or Person2 column, which isnt ideal, because there is only one Zach, and he is a unique snowflake.
If you look at the tabular model implementation, and in particular where category id are turned into embeddings, you should be able to copy/modify the code so that you use a single embedding dictionary for both columns.
Id guess the relevant line in the TabularModel implementation code is:
self.embeds = nn.ModuleList([embedding(ni, nf) for ni,nf in emb_szs])
Are you able to point me in the right direction to how I might modify this?
If you want, for instance, the first embedding to be the same as the second, just add this line after the above:
self.embeds = self.embeds
Is it enough to simply write:
learn = tabular_learner(data, layers=layers, emb_szs=emb_szs,emb_drop=emb_drop,ps=ps, metrics=accuracy) learn.model.embeds=learn.model.embeds learn.fit_one_cycle(epochs, lr,wd=wd,callbacks=[SaveModelCallback(learn, every='improvement', monitor='valid_loss', name=name)])
Or do I actually need to define a new Class TabularModel2 which is identical to TabularModel, except with that additional line you suggested and then define a new tabular_learner function which takes TabluarModel2 in place of TabularModel?
Hey Liam, did you end up working through this? I’m trying to implement something similar myself
The other thing I’m interested in, is how you handle if the vocabs of (to use your example) Person1 and Person2 aren’t exactly the same e.g. if there is a person who who only appears in one of the two columns
Hi Edward, it’s been a long while since I looked at that. Can you just ‘double’ your data so you can every observation twice and swap the Person1 Person2 variables (and any other 1,2) variables?
Thanks! I was trying to get in the weeds to modify the code where the vocab was created, and not particularly having a lot of success. So that solutions sounds like a good work around