Using ULMFiT for Natural Language Inference

I think the two approaches are different. In the first approach, you have one encoder, you will use it to encode the premise and then reuse it to encode the hypothesis (the weights are shared and updated in both cases). The inputs are always coherent texts, and you train your model as a document/sentence encoder. In the second approach, you also have one encoder, but the input now are no longer a conherent text but a concatenation of two texts that might not be related to each other. Basically after the training step, in the first approach, we have a good document/sentence encoder that can be used to represent a text, while in the second approach, we have an encoder that knows if its two parts are related or not. In addition, in the second approach, you have to double the size (and number of parameters) of the input layer to be able to take the concatenation of the two texts. This is expensive if you are using a complex encoder. In the first approach, we only need to double the size of the simple classifier after that and it’s cheaper.

Ok, I see your point. I was too much focused on end-to-end solution.

I googled ‘pytorch weight sharing’
Would this link be helpful to you?

Thanks, @urmas.pitsi
Yes, I think in pytorch we just need to reuse a Module. But I find it not easy to implement this end-to-end in the current fastai UMLFiT framework. I tried several ways either creating two ModelData objects one stores the premise and one stores the hypothesis, and then call the learner twice or creating one ModelData object that has two fields of data. But in the first case, I don’t know how to combine the outputs and build a classifier on top after that so that the whole system is end-to-end and use only one learner object, and in the second case, I don’t know how to encode different fields of the ModelData object separately and combine them and pass to the next module (layer).

I agree that encoding the hypothesis and premise separately has some benefits. If you look on the NLI website, a lot of models encode them separately and then use a concatenation of certain combinations of the representations (the difference, the sum, etc.) as the input to the final layer.

I think the main point in favour of concatenation (besides its simplicity) is that the model can be trained to condition the processing of one sequence on the other optionally using a special token. If the model has been pretrained to capture long-term dependencies via language modelling, I would assume that this conditioning might be more beneficial than encoding them separately independently from each other.

Thanks, @sebastianruder . Yes, I will try both the approaches and see how it goes. I’m thinking about how to implement the approach that encodes the hypothesis and premise separately in the ULMFiT fastai framework effectively. If you or someone has some ideas on this, please share. Thanks so much!

I’ve tried making using the lesson 10 network to build a classifier for the SNLI dataset. SiameseULMFiT

It’s a siamese network where the encoder is reused for each of the sentences. The 2 vectors are then concatenated and fed into a classifier network.

I haven’t been getting good results, just a little better than random.
@asotov What results did you get with your attempts?

1 Like

@sebastianruder I built a Siamese network and posted it here.
I’m only getting about 50% accuracy.
@Samuel what accuracy did you get in you experiments?

Did you try to do some debugging? Does the model learn anything when you don’t load a pretrained model but train it from scratch? If it doesn’t, then maybe the problem is with the training.

1 Like

actually this is my problem too . to train an end to end model with multiple inputs and using shared weights . if anyone has any idea on making datasets and dataloaders compatible with fastai it would be very helpful . also in this case i was wondering if we could use some kind of attention on the encoders outputs. I think it would be helpful . that requiers two separate encoders .

Thanks, I’ve made a number of changes and now I’m getting about 71%. One of the big improvements was to concatenate the vectors in the same manner as InferSent. The other big problem was that i wasn’t padding my data correctly.

Awesome! That’s great to hear! Would love if you could share your code or maybe even create a PR.

I’ve posted my work to this repo.
I didn’t use the lib for the final training of the model. I’m not sure how to make the data loader and learner work with the siamese architecture. If I can figure out how to make it work with the lib I’ll love to make a PR.

Thanks for the pointer! That’d be awesome. Even merging your siamese architecture into the fastai library might be useful.

1 Like

Could you give me a pointer to the changes you made for padding data? I maybe making a similar mistake.

Good work, @brian! I’m trying to do the same thing. Thanks for sharing.

1 Like

Yep, I just went back and cleaned up my notebooks. I’m retraining and will push them to GitHub after it’s done. Look for an update tonight.

1 Like

I just pushed my latest and it’s at 81% accuracy. Not quite at InfraSent level, but close! I’m going to switch gears and try to make the network produce vectors that can be used directly by nmslib and find the entailed sentence in the 5 to 10 nearest neighbors.


i have a simple question. what is the word embedding type(i.e word2vec or ELMo or…) used in embedding layers in ULMFiT architecture?? ULMFiT use these embeddings or not?(creating its own embeddings relatedto the task and datasets)

anybody help me please?

The embeddings in ULMFit are trained during language model pre-training on the Wikitext dataset. When you then fine-tune the language model on a specific dataset you are fine-tuning these pre-trained embeddings.

tnx for replying

so i can not use Bert or Elmo embeddings in stage 2 and 3 of ULMFIT model?