Sentence similarity

(Andrich van Wyk) #7

Thanks for the comprehensive reply. I still need to work through everything you posted. The clause conjunction idea is very interesting.

(Chatel Gregory) #8


I would like to run some experiments on the Universal Sentence Encoder from Tensorflow Hub. I have managed to use it in Tensorflow but I would like to work with it in PyTorch. Would you have any idea on how I could do this? I think that I need to recreate the network architecture and load the trained weights manually but I don’t know how to do it in TF. If I manage to do it I will release it to the community as I think it could help people.

I would also like to point to the Semantic Textual Similarity Wiki (I don’t think it has been linked in this topic yet) that contains a lot of references on this subject.


Regarding the Word Mover’s Distance, the results shown at are not very encouraging

Based on our results, there’s little reason to use Word Mover’s Distance rather than simple word2vec averages. Only on STS-TEST, and only in combination with a stoplist, can WMD compete with the simpler baselines.

Here’s a fairly recent overview comparing different approaches: . State-of-the-art results were achieved with

ULMFiT (@sebastianruder ) was mentioned but not included in the comparison. Would be great to see how it performs. Anybody interested in hooking up ULMFiT to SentEval ( so we can get those comparisons?

(Brian) #10

I’ve been trying to implement something similar to the approach in this blog post:

Looks like it works pretty well.

I’ve built a LM with the SNLI corpus.

The LM worked and I was able to generate new sentences that were reasonable.

I’ve been getting stuck when trying to make sentence vectors. The vectors that I’m getting have no predictive power.

I using pytorch with the lib. Every time I try to modify the lib I get horribly lost.

Here is my LM code, any ideas where I’m going wrong creating the sentence vector? I’m using forward to train the LM and sentence_vector to create the vectors.

#based on
class LSTMLM(nn.Module):
	"""Container module with an encoder, a module, and a decoder."""

	def __init__(self, ntoken, nhid, nlayers, dropout=0.5):
		super(LSTMLM, self).__init__()
		self.drop = nn.Dropout(dropout)
		self.encoder = nn.Embedding(ntoken, nhid)
		self.lstm = nn.LSTM(nhid, nhid, nlayers, dropout=dropout)
		self.decoder = nn.Linear(nhid, ntoken)

		# "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
		# and
		# "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
		self.decoder.weight = self.encoder.weight


		self.nhid = nhid
		self.nlayers = nlayers

	def init_weights(self):
		initrange = 0.1, initrange), initrange)

	def sentence_vector(self, input, hidden):
		emb = self.drop(self.encoder(input))
		output, hidden = self.lstm(emb, hidden)
		output = self.drop(output)

		num_words = output.shape[0]
		batch_size = output.shape[1]

		#flip the outputs first 2 parts so the pooling will do the right thing
		#we want the a single vector from the sentence, one per element in the batch
		output = output.view(batch_size, num_words, hidden_size)
		m = max_pool(output)[0][0]
		a = avg_pool(output)[0][0]
		l = output[0][-1]
		#sentence_vec =[m,a,l])
		return l
		return sentence_vec

	def forward(self, input, hidden):
		emb = self.drop(self.encoder(input))
		output, hidden = self.lstm(emb, hidden)
		output = self.drop(output)
		decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
		return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden

	def init_hidden(self, bsz):
		weight = next(self.parameters())
		return (weight.new_zeros(self.nlayers, bsz, self.nhid),
				weight.new_zeros(self.nlayers, bsz, self.nhid))

The lib has this code for pooling, but I have no idea what the inputs are.

	class PoolingLinearClassifier(nn.Module):
		def __init__(self, layers, drops):
			self.layers = nn.ModuleList([
				LinearBlock(layers[i], layers[i + 1], drops[i]) for i in range(len(layers) - 1)])

		def pool(self, x, bs, is_max):
			f = F.adaptive_max_pool1d if is_max else F.adaptive_avg_pool1d
			return f(x.permute(1,2,0), (1,)).view(bs,-1)

		def forward(self, input):
			raw_outputs, outputs = input
			output = outputs[-1]
			sl,bs,_ = output.size()
			avgpool = self.pool(output, bs, False)
			mxpool = self.pool(output, bs, True)
			x =[output[-1], mxpool, avgpool], 1)
			for l in self.layers:
				l_x = l(x)
				x = F.relu(l_x)
			return l_x, raw_outputs, outputs


Brian, have you tried the language model from Lessen 4 or 10 to retrieve sentence vectors? See my example here Get vectors from an rnn during evaluation on how to fetch them.

I’m still experimenting with different versions of how to compile the RNN encoder hidden state into good sentence vectors. The basic version of using the hidden state of the last input word does not yield great results. I got better results with using a mean of the hidden states for all input words. Other ideas could be concatenating the average and max state into a sentence vector or even using ideas from the p-mean paper I quoted above.

(Brian) #12

@JensF Thank you for the reply.
I’ll use your example from the other post to double check my work.

I have used the lesson 10 model as the basis for creating vectors. I wrote up a bit of that in this post:

I used the avgpool+maxpool+last technique to create the vectors, but they didn’t produce good results.

I’ve been re-reading some of the papers on this topic including Facebook’s InferSent.

I think that I’m going to try their approach of encoding 2 sentences then using a fully connected layer to classify them into entailed, contradictory or non-entailed. I’d be pretty happy with greater that 80% results like they claimed.


Meanwhile I also experimented more with concatenating avg, max, and last state like this:

hidden_states, outs = model[0](<variable with input> )
hidden_states_last_layer = hidden_states[-1]
# return avg-pooling, max-pooling, and last hidden 
state_mean = hidden_states_last_layer.mean(0).squeeze().data
state_max = hidden_states_last_layer.max(0)[0].squeeze().data
hidden_state_last_word = hidden_states_last_layer[-1].squeeze().data
feature_vec =[state_mean, state_max, hidden_state_last_word]) 

It would be interesting to compare the output with your PoolingVector from the other post. I assume they should be the same.

In my experience, simply concatenating all 3 and using this as a sentence vector gave me worse results than just using the mean of all hidden states. You might want to check if mean itself works better for you as well.

When comparing the mean-based sentence vector with my baseline model (which is using fasttext word vectors) I don’t see too much improvement yet (it’s sometimes better and sometimes worse). I’m wondering if using the cos-distance between sentence vectors as a similarity measure is just limited in some ways.

Based on that thinking, I’m also looking into using a classifier to replace the cos-distance measurement, similar to your PoolingLinearClassifier above.

As part of that, I’m still exploring which datasets to use for training. Unfortunately, my domain specific dataset is unlabled and hence I’d like to start with something like the Quora or SNLI dataset. What gives me pause is that I read some negative comments in the forums here about the usefulness of the Quora dataset when using it for other domains. And SNLI seems to have some issues as well:

They all pointed out some peculiarities in the data that enable that. For example, hypotheses of contradicting examples tend to contain more negative words. This happens because the premises are image captions (“a dog is running in the park”). Image captions rarely describe something that doesn’t happen. The easiest thing for a person asked to write a contradicting sentence is to add negation: “a dog is not running in the park”.
Funnily, 1 also showed that the appearance of the word “cat” in the hypothesis can indicate contradiction, as there were many dog images, and what contradicts a dog better than a cat? In reality, cats are lovely creatures, and a sentence with a cat doesn’t immediately contradict any other sentence.

What’s your experience with those datasets (including multinli, para-nmt)? Any other dataset you would recommend?

(Brian) #14

I’m planing on using duplicate GitHub issues eventually. I though that it was best to start with a know quantity like the SNLI set first. I haven’t used any other yet. The issues that you mentioned are concerning, but I think I’m going to stick with SNLI until I can get close to the published accuracies.

I might make my GitHub duplicates available once I verify that they are in good shape. You have to link the issues together based on issue comments, so it’s not that straight forward.

(Thomas Lisankie) #15

This isn’t an unsupervised method, but a Siamese architecture is pretty good for similarity measurements.


Thanks for bringing this up @TomLisankie . I saw a few people using this for the Kaggle Quora competition, e.g. here and here and it sounds interesting. The article here mentioned they were beating one of the other two Siamese networks by using xgboost and a bunch of handselected features. They also described a DL architecture afterwards which looks a little bit like a Siamese network to me with lots of extra layers.

Since this is all more than a year old I’m wondering if now with more powerful approaches such as ULMFit we can simplify this. Did you by any chance try out ULMFit with Siamese networks? Would be curios about the experience. Otherwise that might be one of the next things I’ll try.

(Brian) #17

MultiNLI looks like it might be a better version of SNLI

(Brian) #18

I’ve cleaned up my notebooks and made them available on GitHub at SiameseULMFiT.

I’ve taken the model from lesson 10 and retrained it on the SNLI dataset. Then I’ve used that encoder to create a vector for each sentence, concatenated those vectors and passed them to a classifier network.

I’m not sure what I’m doing wrong here, but I’m only getting about 40% accuracy when predicting the the SNLI entailment category.

I’d love any advice on how to improve this system.


Just a couple of observations

  • when you train your language model in ULMFiT_Tokenize.ipynb, you define your BOS marker as x_bos. The wikipedia pre-trained model from lesson 10 was using xbos instead. The lesson 10 model also prefixes text with a FLD token. ({BOS} {FLD} 1). I wonder if/how much impact using different tokens has on your language model.
  • I noticed you are using the files snli_dev.json and snli_test.json from the SNLI dataset when fine-tuning your language model and when building the Siamese classifier. Why are you not using snli_1.0_train.json? This file is a much larger then the other two. So maybe you just don’t provide enough training data?

I haven’t looked in detail into your Siamese network architecture, maybe the larger training set will already solve your issue.

(Brian) #20

Thanks for having a look.
Not including the training set is an embarrassing oversight, I’ve fixed that now.

I’ve re-run my pre-training phase with the full data set. I’m getting a loss of 2.9 and an accuracy of 43% when training the language model on the SNLI corpus. That’s a very good result. For comparison, lesson 10 got a loss of 3.9 and an accuracy of 31% in pre-training. So I think the tokens are fine.

I’ve tried again with the bigger dataset, and got the same result.

Another point of comparison is that the lesson 10 classifier got 93% accuracy on the first epoch. I think that if my pre-trained classifier starts at 36% for the first epoch that it’s doomed from the start.

I think that one of the following must be true:

  1. ULMFit doesn’t produce vectors that are suitable for semantic similarity tasks.
  2. I’ve make some kind of mistake in my coding that’s preventing the network from training.
  3. The Siamese Architecture is a bad fit for this task.

I still feel like #2 is most likely, but I can’t find out were I’m going wrong.
I’ve tried using just the last hidden state to see that helps and I get the same result.
I’ve tried changing the hidden size of the classifier layer too.

(Brian) #21

@jeremy I’ve been attempting to use the pre-trained LM from lesson 10 to create sentence vectors. I’d like to use the vectors to create a semantic search system.
My first attempt at using pooled hidden states as vectors ( described here ) showed that semantically different sentences weren’t appreciably different from semantically similar ones. Further attempts to build a classifier from the LM to predict entailment yielded similar results. The classifier is a Siamese Network available here


  1. Should the pooled hidden states of a LM produce vectors suitable for determining sentence similarity? In other words, would you expect 2 semantically similar sentence to have a greater cosine similarity than 2 unrelated sentences?
  2. I’m not sure how to proceed. Does this look like a reasonable approach? What do you do when you get stuck on a problem like this?
  3. Am I missing something obvious?

Any insight is greatly appreciated.

(Jeremy Howard (Admin)) #22

ULMFit is all about fine-tuning. I wouldn’t expect it to work without that step. I would expect it to work well for semantic similarity if you fine tune a siamese network on a ULMFit encoder.

(Brian) #23

Thanks for your input!

(Brian) #24

Update on my progress:
I made 2 changes that have boosted my performance from 40% to 50%.
The first was to sort my sentences by length.
The other was that I switched to the MultiBatchRNN encoder.

50% is still a very poor result, so I’m going to dig in further to the InferSent code to what might be different.

The other thing I did was to validate my loader and model code with the original IMDB task.
I was able to get good results, but not as good.

Update: I’ve gotten 61% accuracy now. Better but not great. Infersent gets an accuracy of 84.5% on SNLI.

(Jeremy Howard (Admin)) #25

You’re making quite progress!

(Brian) #26

Update: I changed my vector concatenation to the way that InferSent does it. So my forward pass now looks like this:

def forward(self, in1, in2):
        u = self.encode(in1)
        v = self.encode(in2)
        features =, v, torch.abs(u-v), u*v), 1)
        out = self.linear(features)
        return out 

This has improved my accuracy to about 71%