I’m currently working on a project where I’m testing various NLP models on the task of semantic similarity.
I know that Word2Vec, GLoVE, FastText, and Google’s USE are all fairly adept at this task but I wanted to try some of the latest SOTA methods. However, when I began trying BERT and ELMo, someone mentioned that those algorithms aren’t meant for semantic similarity.
So if I were to build a LM using ULMFiT and get the embeddings, could I use them to calculate semantic similarity (using cosine similarity, etc)?
So I looked through the code and it doesn’t seem like they are doing anything particularly special (they’re really just providing an API for using BERT embeddings).
Do you know why people don’t recommend using BERT or ELMo for Semantic Similarity? One thing that I’ve read is that they aren’t trained on it, but why can we use FastText then? And how might I train one of these models on similarity?
Hi zache, I wanted to ask how did you get the embeddings of the vocabulary after training the model? When we run ‘learn.model??’ it shows that first layer of neural net is the embedding layer of size (size of vocabulary, 400). But I wasn’t able to get those embeddings seperately after training the model. I also want to use them for computing similarities between different terms.
Using such a big model to get embeddings can be a over-engineered solution.
My use case is to clean search result data based on queries (in a specific domain, entertainment), and I am not really satisfied with Similarity scores of any of the models I have tried so far and I am working with a very large dataset (+1 billion search results) so I don’t think it would be over-engineered. Plus, I am using this as a learning experience to get familiar with the various NLP models/methods so I don’t mind
Before training your own model I would first test for example the BERT-as-a-service model to check for semantic similarities.
I did try it, but it’s just not giving me the results I hoped for… FastText has actually given me the best results so far – which leads me to believe that I could get better results with BERT (or any one of the other SOTA methods) if it were fine-tuned.
I decided not to move forward with any fine-tuning because I got good results by throwing all of the similarity scores into a random forest, but here are the final results if you are curious:
Models:
GLoVe (English Wikipedia)
FastText (English Wikipedia & Web Crawl)
Universal Sentence Encoder (Transformer & DAN)
BERT (Uncased Whole Word Masking)
Note: I didn’t include Word2Vec because I’m working with user-generated data and there were so many OOV errors that it was basically useless.
Feature Importance:
(average after 1000 fits / iterations)
ft_wiki_overall_similarity: 0.340653
avg_all: 0.299219
use_transformer_overall_similarity: 0.102947
ft_crawl_overall_similarity: 0.100621
use_dan_overall_similarity: 0.095025
glove_overall_similarity: 0.033930
bert_overall_similarity: 0.027601
Linear Regression:
(average after 1000 fits / iterations)
bert_overall_similarity: 45.243495
avg_all: 15.504321
ft_wiki_overall_similarity: 4.546034
ft_crawl_overall_similarity: 1.412537
glove_overall_similarity: -4.815655
use_dan_overall_similarity: -4.867751
use_transformer_overall_similarity: -10.585182
It’s interesting that the linear model relies so heavily on BERT when the RF barely uses it. But there must be some non-linear relationship between the similarity scores because the RF gets a ~15% lower error rate.
I’m still thinking about if, for example, ULMFiT LM (pre)training, which tries to predict the next word, could generate good embeddings because the word2vec works “similar” but uses the surrounding works for training (i.e., for both training modes: CBOW it predicts a word from surrounding words; skip-gram predicts the surrounding words from a word).
However, I still have to think this through in detail and look for publication in this field.
I labeled a sample of the data (~2.5k, 99% ±3% CI) with how similar they were (0, 0.5, 1, with 0 & 1 balanced), and then trained the similarity scores on that data using a RF. I know I could probably get better results if I fine-tuned one of the models, but that would’ve required me doing some more scraping to build the LM and I don’t really have time for that (though tbh I thought testing all of the models would’ve been quicker than it was).
My end goal is to use this data to train a model (I am using the “Similarity Threshold” score to clean the data) so if I’m not getting the results that I want then I’ll probably go back and build a LM which would give me better similarity scores.