Sentence similarity

Hello, I hope this is the correct forum for this discussion!

Given pre-trained word embeddings we know we can calculate similarity between words by e.g. taking the dot product of the word vectors.

Extending from that idea, I am interested in unsupervised sentence similarity: so given two sentences, are they saying close to the same thing?

My initial idea is to simply calculate the cosine distance between the sparse matrices formed by the embeddings of each word in the sentence.

Before I jump in, is anyone aware of a more appropriate methodology? Or can point me to some applicable research?


Your initial idea is a good starting point, but its possible to do better. You can also try to average them before computing the similarity or you may use some type of neural encoder and use that representation, which in theory will have learnt many important features, such as order, context, syntax,semantics, etc. Good representations will achieve good similarity scores.

1 Like

By average them, do you mean calculate the average them embeddings over the sentence and then calculating the distance?

Coincidentally, this was literally announced today:

The related papers are:


You could also use a language model then compare the cosine similarity of the output states. I think there was something similar done in this thread.


I’ve spent quite some time exploring ways this problem. There is quite a large and growing body of academic research in this area, but it’s a really difficult problem and no-one has really found a good solution as far as I am aware.

Regarding the specific idea of using a distance measure based on word vectors, you usually need a way to aggregate the contents of each sentence to form a fixed length vector encoding for the sentence. So, regardless of the length of the sentence, the sentence vector has to be the same shape. This is the tricky part - how do you combine the individual word vectors in a sentence to encode their meaning.

One way of comparing the two sentences sets of word vectors without aggregating to a fixed length vector is the Word Mover Distance idea set out in this paper. I understand it has now been implemented as part of the Gensim library, but I haven’t tried it myself.

I would recommend having a look at this blog post which gives a great explanation and details of a thorough approach to sentence similarity in the context of the Quora pairs dataset. This dataset has also been used as part of a kaggle competition, which I would also recommend taking a look at - lots of good ideas to be found in the discussions and kernels.

As Alexandre has pointed out, using a neural encoder (usually based on an RNN architecture of some kind) will give you better results than a simple average of word vectors. However, you need a way of training your neural net. A number of approaches have been tried:

  1. Supervised training on a dataset of sentence pairs with human annotated labels which indicate whether the pairs mean the same thing or not. There are a number of datasets around including - the Microsoft Paraphrase Corpus (too small to be of much use training a neural net); the Stanford Natural Language Inference corpus (SNLI); the Quora pairs corpus.

  2. Supervised training on machine generated sentence pairs. There is a big database of paraphrase pairs. In addition, a few people are now trying to create more pairs using a technique known as back-translation (e.g. see The idea is you use machine translation to translate a sentence from, say, English to French, and then translate back again from French to English to generate a sentence with the same meaning, but different words.

  3. Un- or semi-supervised training. The basic idea in this category is to get your model to predict some bit of an existing text, or perhaps predict the order of the text. As such, no labels are required. For example, similar to a language model predicting the next word, you could get the model to predict the next sentence. One of the first attempts in this area was the skipthoughts model that mimics the model used to create the wpord2vec word vectors, but applies it at the sentence level (see

    I would put the latest google work into this final category. They are using conversational data, and predicting the queries that fit with specific answers.

    Another really interesting idea that I saw (can’t find the paper - but will add reference if I do) looked at getting a model to predict the conjunction between 2 clauses of a sentence. From memory, the authors encoded both clauses, and then asked their model it to predict which of a fixed list of conjunction words (e.g. but, because, therefore, and, etc) joined the sentences. The idea being that the encoder would need to have encoded a good understand of the sentence clauses to guess which conjunction.


We did extensive work on this problem. Our FitLam model was getting better results than the Tensorflow hub universal sentence encoder (vaswani transformer).
But I see they have a lite version tokenized with sentence piece. I’ll need to work on that next.
As a person who spent a good chunk of last year on all the above methods to build chatbots for Enterprise clients, I’ve been most impressed by classifier built by the transfer learning from an LM.
Unlike other DL systems where you can run inference as a batch process, there’s a speed - accuracy trade-off when operating at chat speed.

Also, the Quora corpus has burnt many a bot developer as it’s not representative of a real world paraphrase corpus. Snli, multinli, para-nmt are better bets.

Slyvain posted a link to a thread above where we documented our experiments - hope they are helpful.
Best wishes!

1 Like

Thanks for the comprehensive reply. I still need to work through everything you posted. The clause conjunction idea is very interesting.


I would like to run some experiments on the Universal Sentence Encoder from Tensorflow Hub. I have managed to use it in Tensorflow but I would like to work with it in PyTorch. Would you have any idea on how I could do this? I think that I need to recreate the network architecture and load the trained weights manually but I don’t know how to do it in TF. If I manage to do it I will release it to the community as I think it could help people.

I would also like to point to the Semantic Textual Similarity Wiki (I don’t think it has been linked in this topic yet) that contains a lot of references on this subject.

Regarding the Word Mover’s Distance, the results shown at are not very encouraging

Based on our results, there’s little reason to use Word Mover’s Distance rather than simple word2vec averages. Only on STS-TEST, and only in combination with a stoplist, can WMD compete with the simpler baselines.

Here’s a fairly recent overview comparing different approaches: . State-of-the-art results were achieved with

ULMFiT (@sebastianruder ) was mentioned but not included in the comparison. Would be great to see how it performs. Anybody interested in hooking up ULMFiT to SentEval ( so we can get those comparisons?


I’ve been trying to implement something similar to the approach in this blog post:

Looks like it works pretty well.

I’ve built a LM with the SNLI corpus.

The LM worked and I was able to generate new sentences that were reasonable.

I’ve been getting stuck when trying to make sentence vectors. The vectors that I’m getting have no predictive power.

I using pytorch with the lib. Every time I try to modify the lib I get horribly lost.

Here is my LM code, any ideas where I’m going wrong creating the sentence vector? I’m using forward to train the LM and sentence_vector to create the vectors.

#based on
class LSTMLM(nn.Module):
	"""Container module with an encoder, a module, and a decoder."""

	def __init__(self, ntoken, nhid, nlayers, dropout=0.5):
		super(LSTMLM, self).__init__()
		self.drop = nn.Dropout(dropout)
		self.encoder = nn.Embedding(ntoken, nhid)
		self.lstm = nn.LSTM(nhid, nhid, nlayers, dropout=dropout)
		self.decoder = nn.Linear(nhid, ntoken)

		# "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
		# and
		# "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
		self.decoder.weight = self.encoder.weight


		self.nhid = nhid
		self.nlayers = nlayers

	def init_weights(self):
		initrange = 0.1, initrange), initrange)

	def sentence_vector(self, input, hidden):
		emb = self.drop(self.encoder(input))
		output, hidden = self.lstm(emb, hidden)
		output = self.drop(output)

		num_words = output.shape[0]
		batch_size = output.shape[1]

		#flip the outputs first 2 parts so the pooling will do the right thing
		#we want the a single vector from the sentence, one per element in the batch
		output = output.view(batch_size, num_words, hidden_size)
		m = max_pool(output)[0][0]
		a = avg_pool(output)[0][0]
		l = output[0][-1]
		#sentence_vec =[m,a,l])
		return l
		return sentence_vec

	def forward(self, input, hidden):
		emb = self.drop(self.encoder(input))
		output, hidden = self.lstm(emb, hidden)
		output = self.drop(output)
		decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
		return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden

	def init_hidden(self, bsz):
		weight = next(self.parameters())
		return (weight.new_zeros(self.nlayers, bsz, self.nhid),
				weight.new_zeros(self.nlayers, bsz, self.nhid))

The lib has this code for pooling, but I have no idea what the inputs are.

	class PoolingLinearClassifier(nn.Module):
		def __init__(self, layers, drops):
			self.layers = nn.ModuleList([
				LinearBlock(layers[i], layers[i + 1], drops[i]) for i in range(len(layers) - 1)])

		def pool(self, x, bs, is_max):
			f = F.adaptive_max_pool1d if is_max else F.adaptive_avg_pool1d
			return f(x.permute(1,2,0), (1,)).view(bs,-1)

		def forward(self, input):
			raw_outputs, outputs = input
			output = outputs[-1]
			sl,bs,_ = output.size()
			avgpool = self.pool(output, bs, False)
			mxpool = self.pool(output, bs, True)
			x =[output[-1], mxpool, avgpool], 1)
			for l in self.layers:
				l_x = l(x)
				x = F.relu(l_x)
			return l_x, raw_outputs, outputs
1 Like

Brian, have you tried the language model from Lessen 4 or 10 to retrieve sentence vectors? See my example here Get vectors from an rnn during evaluation on how to fetch them.

I’m still experimenting with different versions of how to compile the RNN encoder hidden state into good sentence vectors. The basic version of using the hidden state of the last input word does not yield great results. I got better results with using a mean of the hidden states for all input words. Other ideas could be concatenating the average and max state into a sentence vector or even using ideas from the p-mean paper I quoted above.

1 Like

@JensF Thank you for the reply.
I’ll use your example from the other post to double check my work.

I have used the lesson 10 model as the basis for creating vectors. I wrote up a bit of that in this post:

I used the avgpool+maxpool+last technique to create the vectors, but they didn’t produce good results.

I’ve been re-reading some of the papers on this topic including Facebook’s InferSent.

I think that I’m going to try their approach of encoding 2 sentences then using a fully connected layer to classify them into entailed, contradictory or non-entailed. I’d be pretty happy with greater that 80% results like they claimed.

Meanwhile I also experimented more with concatenating avg, max, and last state like this:

hidden_states, outs = model[0](<variable with input> )
hidden_states_last_layer = hidden_states[-1]
# return avg-pooling, max-pooling, and last hidden 
state_mean = hidden_states_last_layer.mean(0).squeeze().data
state_max = hidden_states_last_layer.max(0)[0].squeeze().data
hidden_state_last_word = hidden_states_last_layer[-1].squeeze().data
feature_vec =[state_mean, state_max, hidden_state_last_word]) 

It would be interesting to compare the output with your PoolingVector from the other post. I assume they should be the same.

In my experience, simply concatenating all 3 and using this as a sentence vector gave me worse results than just using the mean of all hidden states. You might want to check if mean itself works better for you as well.

When comparing the mean-based sentence vector with my baseline model (which is using fasttext word vectors) I don’t see too much improvement yet (it’s sometimes better and sometimes worse). I’m wondering if using the cos-distance between sentence vectors as a similarity measure is just limited in some ways.

Based on that thinking, I’m also looking into using a classifier to replace the cos-distance measurement, similar to your PoolingLinearClassifier above.

As part of that, I’m still exploring which datasets to use for training. Unfortunately, my domain specific dataset is unlabled and hence I’d like to start with something like the Quora or SNLI dataset. What gives me pause is that I read some negative comments in the forums here about the usefulness of the Quora dataset when using it for other domains. And SNLI seems to have some issues as well:

They all pointed out some peculiarities in the data that enable that. For example, hypotheses of contradicting examples tend to contain more negative words. This happens because the premises are image captions (“a dog is running in the park”). Image captions rarely describe something that doesn’t happen. The easiest thing for a person asked to write a contradicting sentence is to add negation: “a dog is not running in the park”.
Funnily, 1 also showed that the appearance of the word “cat” in the hypothesis can indicate contradiction, as there were many dog images, and what contradicts a dog better than a cat? In reality, cats are lovely creatures, and a sentence with a cat doesn’t immediately contradict any other sentence.

What’s your experience with those datasets (including multinli, para-nmt)? Any other dataset you would recommend?

I’m planing on using duplicate GitHub issues eventually. I though that it was best to start with a know quantity like the SNLI set first. I haven’t used any other yet. The issues that you mentioned are concerning, but I think I’m going to stick with SNLI until I can get close to the published accuracies.

I might make my GitHub duplicates available once I verify that they are in good shape. You have to link the issues together based on issue comments, so it’s not that straight forward.

This isn’t an unsupervised method, but a Siamese architecture is pretty good for similarity measurements.

1 Like

Thanks for bringing this up @TomLisankie . I saw a few people using this for the Kaggle Quora competition, e.g. here and here and it sounds interesting. The article here mentioned they were beating one of the other two Siamese networks by using xgboost and a bunch of handselected features. They also described a DL architecture afterwards which looks a little bit like a Siamese network to me with lots of extra layers.

Since this is all more than a year old I’m wondering if now with more powerful approaches such as ULMFit we can simplify this. Did you by any chance try out ULMFit with Siamese networks? Would be curios about the experience. Otherwise that might be one of the next things I’ll try.

MultiNLI looks like it might be a better version of SNLI

I’ve cleaned up my notebooks and made them available on GitHub at SiameseULMFiT.

I’ve taken the model from lesson 10 and retrained it on the SNLI dataset. Then I’ve used that encoder to create a vector for each sentence, concatenated those vectors and passed them to a classifier network.

I’m not sure what I’m doing wrong here, but I’m only getting about 40% accuracy when predicting the the SNLI entailment category.

I’d love any advice on how to improve this system.


Just a couple of observations

  • when you train your language model in ULMFiT_Tokenize.ipynb, you define your BOS marker as x_bos. The wikipedia pre-trained model from lesson 10 was using xbos instead. The lesson 10 model also prefixes text with a FLD token. ({BOS} {FLD} 1). I wonder if/how much impact using different tokens has on your language model.
  • I noticed you are using the files snli_dev.json and snli_test.json from the SNLI dataset when fine-tuning your language model and when building the Siamese classifier. Why are you not using snli_1.0_train.json? This file is a much larger then the other two. So maybe you just don’t provide enough training data?

I haven’t looked in detail into your Siamese network architecture, maybe the larger training set will already solve your issue.

1 Like

Thanks for having a look.
Not including the training set is an embarrassing oversight, I’ve fixed that now.

I’ve re-run my pre-training phase with the full data set. I’m getting a loss of 2.9 and an accuracy of 43% when training the language model on the SNLI corpus. That’s a very good result. For comparison, lesson 10 got a loss of 3.9 and an accuracy of 31% in pre-training. So I think the tokens are fine.

I’ve tried again with the bigger dataset, and got the same result.

Another point of comparison is that the lesson 10 classifier got 93% accuracy on the first epoch. I think that if my pre-trained classifier starts at 36% for the first epoch that it’s doomed from the start.

I think that one of the following must be true:

  1. ULMFit doesn’t produce vectors that are suitable for semantic similarity tasks.
  2. I’ve make some kind of mistake in my coding that’s preventing the network from training.
  3. The Siamese Architecture is a bad fit for this task.

I still feel like #2 is most likely, but I can’t find out were I’m going wrong.
I’ve tried using just the last hidden state to see that helps and I get the same result.
I’ve tried changing the hidden size of the classifier layer too.