Can ULMFiT be used for semantic similarity?

(Zach Eberhart) #1

I’m currently working on a project where I’m testing various NLP models on the task of semantic similarity.

I know that Word2Vec, GLoVE, FastText, and Google’s USE are all fairly adept at this task but I wanted to try some of the latest SOTA methods. However, when I began trying BERT and ELMo, someone mentioned that those algorithms aren’t meant for semantic similarity.

So if I were to build a LM using ULMFiT and get the embeddings, could I use them to calculate semantic similarity (using cosine similarity, etc)?

Thanks!

2 Likes

(Michael) #2

This could be interesting for you:


With this you can easily get BERT-encoded text and then use the output for a similarity search.

This should also work with ULMFiT, but I haven’t come across a implementation.

1 Like

(Zach Eberhart) #3

Thank you, that is very helpful! I will dig into the code to see what they’re doing and if it’ll transfer to ULMFiT :slight_smile:

0 Likes

(Michael) #4

I also stumpled over this some time ago for similarity search and clustering of dense vectors:

I also remember that Jeremy recommended a python library for this in the past part 2 course, but I didn’t found the name in my notes.

Maybe somebody knows the name of the library?

0 Likes

(Zach Eberhart) #5

So I looked through the code and it doesn’t seem like they are doing anything particularly special (they’re really just providing an API for using BERT embeddings).

Do you know why people don’t recommend using BERT or ELMo for Semantic Similarity? One thing that I’ve read is that they aren’t trained on it, but why can we use FastText then? And how might I train one of these models on similarity?

0 Likes

(Michael) #6

Do you have a source to the discussions on why they don’t recommend it?

I’m not a expert but my guesses would be:

  • Always start with a simple solution first.
  • It will depend highly on the data these models are trained on. If the data is not helping them to learn the similarities it will not work.
  • Using such a big model to get embeddings can be a over-engineered solution.

Before training your own model I would first test for example the BERT-as-a-service model to check for semantic similarities.

1 Like

(Abhimanyu) #7

Hi zache, I wanted to ask how did you get the embeddings of the vocabulary after training the model? When we run ‘learn.model??’ it shows that first layer of neural net is the embedding layer of size (size of vocabulary, 400). But I wasn’t able to get those embeddings seperately after training the model. I also want to use them for computing similarities between different terms.

0 Likes

(Bobak Farzin) #8

You can get embedding weights like any other weight in a PyTorch model.

In AWD_LSTM, it looks like:
learn.model[0].encoder.weight

You can also pull it out of the state_dict() if you prefer:
learn.model.state_dict()['0.encoder.weight']

1 Like

(Abhimanyu) #9

That worked! Thank you so much

1 Like

(Zach Eberhart) #10

Do you have a source to the discussions on why they don’t recommend it?

Yea I am referencing a discussion here:

Using such a big model to get embeddings can be a over-engineered solution.

My use case is to clean search result data based on queries (in a specific domain, entertainment), and I am not really satisfied with Similarity scores of any of the models I have tried so far and I am working with a very large dataset (+1 billion search results) so I don’t think it would be over-engineered. Plus, I am using this as a learning experience to get familiar with the various NLP models/methods so I don’t mind :slight_smile:

Before training your own model I would first test for example the BERT-as-a-service model to check for semantic similarities.

I did try it, but it’s just not giving me the results I hoped for… FastText has actually given me the best results so far – which leads me to believe that I could get better results with BERT (or any one of the other SOTA methods) if it were fine-tuned.

1 Like

(Zach Eberhart) #11

I decided not to move forward with any fine-tuning because I got good results by throwing all of the similarity scores into a random forest, but here are the final results if you are curious:

Models:

GLoVe (English Wikipedia)
FastText (English Wikipedia & Web Crawl)
Universal Sentence Encoder (Transformer & DAN)
BERT (Uncased Whole Word Masking)

Note: I didn’t include Word2Vec because I’m working with user-generated data and there were so many OOV errors that it was basically useless.

Feature Importance:
(average after 1000 fits / iterations)

  • ft_wiki_overall_similarity: 0.340653
  • avg_all: 0.299219
  • use_transformer_overall_similarity: 0.102947
  • ft_crawl_overall_similarity: 0.100621
  • use_dan_overall_similarity: 0.095025
  • glove_overall_similarity: 0.033930
  • bert_overall_similarity: 0.027601

Linear Regression:
(average after 1000 fits / iterations)

  • bert_overall_similarity: 45.243495
  • avg_all: 15.504321
  • ft_wiki_overall_similarity: 4.546034
  • ft_crawl_overall_similarity: 1.412537
  • glove_overall_similarity: -4.815655
  • use_dan_overall_similarity: -4.867751
  • use_transformer_overall_similarity: -10.585182

It’s interesting that the linear model relies so heavily on BERT when the RF barely uses it. But there must be some non-linear relationship between the similarity scores because the RF gets a ~15% lower error rate.

0 Likes

(Michael) #12

I would be curious on how you did it in the end.

I’m still thinking about if, for example, ULMFiT LM (pre)training, which tries to predict the next word, could generate good embeddings because the word2vec works “similar” but uses the surrounding works for training (i.e., for both training modes: CBOW it predicts a word from surrounding words; skip-gram predicts the surrounding words from a word).

However, I still have to think this through in detail and look for publication in this field.

1 Like

(Michael) #13

From the TWIML slack group and maybe FYI: https://telegra.ph/Building-a-Search-Engine-with-BERT-and-TensorFlow-07-09

0 Likes

(Zach Eberhart) #14

I labeled a sample of the data (~2.5k, 99% ±3% CI) with how similar they were (0, 0.5, 1, with 0 & 1 balanced), and then trained the similarity scores on that data using a RF. I know I could probably get better results if I fine-tuned one of the models, but that would’ve required me doing some more scraping to build the LM and I don’t really have time for that (though tbh I thought testing all of the models would’ve been quicker than it was).

My end goal is to use this data to train a model (I am using the “Similarity Threshold” score to clean the data) so if I’m not getting the results that I want then I’ll probably go back and build a LM which would give me better similarity scores.

From the TWIML slack group and maybe FYI: https://telegra.ph/Building-a-Search-Engine-with-BERT-and-TensorFlow-07-09

I actually used that guide to get my BERT similarity scores :slight_smile: was super helpful!

1 Like