How to calculate cosine similarity of words in embedding matrix?

JacquesThibs · May 19, 2020, 8:04pm

Here’s my ultimate goal:

Create a search feature for a web app that shows the information the most relevant to what the user types in the search bar. In other words, I have a bunch of tables and if the user types something like “plants”, I want all the tables that have words similar to “plant” to show up. This might be used as a way to reduce the need for labelling each table (or at least reduce the workload).

Now, I’m going to do transfer learning on my dataset. From what I understand, I can then use the new embedding matrix to find out which words are the most similar. How would I do this with the fastai library?

Thanks!

FraPochetti · May 19, 2020, 8:14pm

Check out the collab chapter of fastbook.
Specifically the Embedding Distance paragraph shows what you are looking for.

JacquesThibs · May 19, 2020, 9:27pm

Thanks for the quick reply! Let me see if I understand:

Create list of word vectors (?):

movie_factors = learn.model.i_weight.weight

Find index of movie x (in this case, SotL):

idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']

Calculate cosine similarity between the movie at idx and every other movie in dataset:

distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])

Sort by highest similarity and then print the most similar movie:

idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]

This seems like exactly what I want to do with text. However, I’m still confused as to what is the equivalent in fastai.text. In other words, there is no i_weight.weight in fastai.text.learner so how would I go about getting the word vectors?

FraPochetti · May 19, 2020, 11:03pm

You have to look inside learner.model, like so:

JacquesThibs · May 19, 2020, 11:12pm

Fantastic! Thank you. I will try this once I get to this point in the project.