How to calculate cosine similarity of words in embedding matrix?

Here’s my ultimate goal:

Create a search feature for a web app that shows the information the most relevant to what the user types in the search bar. In other words, I have a bunch of tables and if the user types something like “plants”, I want all the tables that have words similar to “plant” to show up. This might be used as a way to reduce the need for labelling each table (or at least reduce the workload).

Now, I’m going to do transfer learning on my dataset. From what I understand, I can then use the new embedding matrix to find out which words are the most similar. How would I do this with the fastai library?


Check out the collab chapter of fastbook.
Specifically the Embedding Distance paragraph shows what you are looking for.

Thanks for the quick reply! Let me see if I understand:

Create list of word vectors (?):

movie_factors = learn.model.i_weight.weight

Find index of movie x (in this case, SotL):

idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']

Calculate cosine similarity between the movie at idx and every other movie in dataset:

distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])

Sort by highest similarity and then print the most similar movie:

idx = distances.argsort(descending=True)[1]

This seems like exactly what I want to do with text. However, I’m still confused as to what is the equivalent in fastai.text. In other words, there is no i_weight.weight in fastai.text.learner so how would I go about getting the word vectors?

You have to look inside learner.model, like so:

Fantastic! Thank you. I will try this once I get to this point in the project. :+1:

