Sentence similarity

JensF · August 23, 2018, 4:03am

Thanks for bringing this up @TomLisankie . I saw a few people using this for the Kaggle Quora competition, e.g. here and here and it sounds interesting. The article here mentioned they were beating one of the other two Siamese networks by using xgboost and a bunch of handselected features. They also described a DL architecture afterwards which looks a little bit like a Siamese network to me with lots of extra layers.

Since this is all more than a year old I’m wondering if now with more powerful approaches such as ULMFit we can simplify this. Did you by any chance try out ULMFit with Siamese networks? Would be curios about the experience. Otherwise that might be one of the next things I’ll try.

brian · August 23, 2018, 4:58pm

MultiNLI looks like it might be a better version of SNLI

brian · August 27, 2018, 7:09pm

I’ve cleaned up my notebooks and made them available on GitHub at SiameseULMFiT.

I’ve taken the model from lesson 10 and retrained it on the SNLI dataset. Then I’ve used that encoder to create a vector for each sentence, concatenated those vectors and passed them to a classifier network.

I’m not sure what I’m doing wrong here, but I’m only getting about 40% accuracy when predicting the the SNLI entailment category.

I’d love any advice on how to improve this system.

JensF · August 29, 2018, 6:20am

Just a couple of observations

when you train your language model in ULMFiT_Tokenize.ipynb, you define your BOS marker as x_bos. The wikipedia pre-trained model from lesson 10 was using xbos instead. The lesson 10 model also prefixes text with a FLD token. ({BOS} {FLD} 1). I wonder if/how much impact using different tokens has on your language model.
I noticed you are using the files snli_dev.json and snli_test.json from the SNLI dataset when fine-tuning your language model and when building the Siamese classifier. Why are you not using snli_1.0_train.json? This file is a much larger then the other two. So maybe you just don’t provide enough training data?

I haven’t looked in detail into your Siamese network architecture, maybe the larger training set will already solve your issue.

brian · August 29, 2018, 8:50pm

Thanks for having a look.
Not including the training set is an embarrassing oversight, I’ve fixed that now.

I’ve re-run my pre-training phase with the full data set. I’m getting a loss of 2.9 and an accuracy of 43% when training the language model on the SNLI corpus. That’s a very good result. For comparison, lesson 10 got a loss of 3.9 and an accuracy of 31% in pre-training. So I think the tokens are fine.

I’ve tried again with the bigger dataset, and got the same result.

Another point of comparison is that the lesson 10 classifier got 93% accuracy on the first epoch. I think that if my pre-trained classifier starts at 36% for the first epoch that it’s doomed from the start.

I think that one of the following must be true:

ULMFit doesn’t produce vectors that are suitable for semantic similarity tasks.
I’ve make some kind of mistake in my coding that’s preventing the network from training.
The Siamese Architecture is a bad fit for this task.

I still feel like #2 is most likely, but I can’t find out were I’m going wrong.
I’ve tried using just the last hidden state to see that helps and I get the same result.
I’ve tried changing the hidden size of the classifier layer too.

brian · August 29, 2018, 11:23pm

@jeremy I’ve been attempting to use the pre-trained LM from lesson 10 to create sentence vectors. I’d like to use the vectors to create a semantic search system.
My first attempt at using pooled hidden states as vectors ( described here ) showed that semantically different sentences weren’t appreciably different from semantically similar ones. Further attempts to build a classifier from the LM to predict entailment yielded similar results. The classifier is a Siamese Network available here

Questions:

Should the pooled hidden states of a LM produce vectors suitable for determining sentence similarity? In other words, would you expect 2 semantically similar sentence to have a greater cosine similarity than 2 unrelated sentences?
I’m not sure how to proceed. Does this look like a reasonable approach? What do you do when you get stuck on a problem like this?
Am I missing something obvious?

Any insight is greatly appreciated.

jeremy · August 30, 2018, 2:00am

ULMFit is all about fine-tuning. I wouldn’t expect it to work without that step. I would expect it to work well for semantic similarity if you fine tune a siamese network on a ULMFit encoder.

brian · August 30, 2018, 5:37pm

Thanks for your input!

brian · September 3, 2018, 1:07am

Update on my progress:
I made 2 changes that have boosted my performance from 40% to 50%.
The first was to sort my sentences by length.
The other was that I switched to the MultiBatchRNN encoder.

50% is still a very poor result, so I’m going to dig in further to the InferSent code to what might be different.

The other thing I did was to validate my loader and model code with the original IMDB task.
I was able to get good results, but not as good.

Update: I’ve gotten 61% accuracy now. Better but not great. Infersent gets an accuracy of 84.5% on SNLI.

jeremy · September 6, 2018, 4:45am

You’re making quite progress!

brian · September 20, 2018, 12:33pm

Update: I changed my vector concatenation to the way that InferSent does it. So my forward pass now looks like this:

def forward(self, in1, in2):
        u = self.encode(in1)
        v = self.encode(in2)
        features = torch.cat((u, v, torch.abs(u-v), u*v), 1)
        out = self.linear(features)
        return out

This has improved my accuracy to about 71%

armheb · September 20, 2018, 10:13pm

Hi
I read your code in your ULMFiT_Classify notebook and it seemed that you don’t tokenize your input when making the Dataset . sorry if I’m wrong .

brian · September 24, 2018, 4:57pm

Tokenizing happens in a different notebook. I wanted to split-up tokenizing, pertaining and classification to make the notebooks clearer.

avinregmi · October 13, 2018, 3:31pm

When I’m running your ULMFit_classify after running other two notebooks it’s giving me an error. Says “ValueError: optimizing a parameter that doesn’t require gradients”. I have been stuck here for the past few hours!!

brian · October 15, 2018, 4:22am

Hey sorry you got stuck!
Looks like you are using an earlier versions of PyTorch. FastAi defaults to 0.31 but I’m using 0.4.1.

Please upgrade and try again.

brian · October 16, 2018, 9:10pm

Did upgrading to 0.4.1 work fo you?

avinregmi · October 16, 2018, 9:28pm

Hello Brian, I got it to work but I’m not able to train it. When I’m running the fit function with multiple epochs I’m running out of GPU memory. I’m using Google Compute Engine with tesla 11gb memory. Which GPU did you use? Also if its possible can you please upload your pre-trained model? That would be a really good help for me. I’ve attached a picture where I run out of memory. After running this cell, it gives me CUDA memory error. I’ve also tried reducing the batch size.

08%20PM

brian · October 16, 2018, 9:59pm

I’m using a GTX 1080Ti. It’s also has 11GB of memory. You can try reducing the BPTT parameter as well as the batch size.

I can upload my retrained model, but it’ll take a while. I’ve got a slow upload speed.

avinregmi · October 16, 2018, 10:13pm

Thanks, Brian. I really appreciate your help. Are you uploading to github or another source?

brian · October 16, 2018, 10:17pm

@avinregmi Here is my pre-trained SNLI language model. Hope it helps: