Using ULMFiT for Natural Language Inference


I would like to try the ULMFiT language model on Natural Language Inference (NLI) datasets. Give two textual snippets s_1 and s_2, a premise and a hypothesis, the goal is to label the two with one of three labels: entailment, contradiction, or neutral. I want to encode the premise and hypothesis with the ULMFiT language model’s encoder, as follows

ULMFiT\_enc (s_1) = h_c(s_1)
ULMFiT\_enc (s_2) = h_c(s_2)
where h_c = [h_T , maxpool(H), meanpool(H)] and H = \{h_1,...,h_T\} are the hidden states given by the ULMFiT language model, either pre-trained or fine-tuned.
Then I concatenate h_c(s_1) and h_c(s_2) and build a classifier on top.

I am struggling with how to use the TextDataset/DataLoader for NLI datasets, i.e, represent both the premise and hypothesis and how to modify the RNN_Encoder/get_rnn_classifier to encode both the two texts (re-use the encoder). I would be grateful if somebody could give some suggestions or direct me to a tutorial/sample code (if any).

@sebastianruder @jeremy @nickl @wgpubs @asotov


Hi, @Samuel!
Does you like to creae end to end model? If no, then you can just run set of s_1 and s_2 trought the pre-trained RNN and receive h_c(s_i) features. Then build any model you like over these features to classify.

If you like to use end-to-end model with encoder, then you need to double parameters in TextDataset to have two sequences of sentences (for s_1 and s_2 ) and also you need to have two sets of hidden layers in you RNN.

@asotov: Thanks for your reply. I would like to build an end-to-end model but at this stage, getting h_c(s_i) features and build a model over them would be fine for me. It will help me get a better understanding of ULMFiT before building an end-to-end model. Do you know how to get the language model’s RNN hidden states or h_c for a set of textual snippets ULMFit\_enc(\{s_{i_1}, s_{i_2}, \ldots ,s_{i_n}\}) = \{h_c(s_{i_1}), h_c(s_{i_2}), \ldots, h_c(s_{i_n})\}?

Hi, @Samuel. So, before you ask, I have already used ULMFiT model to get h_c from pretrained RNN. Here my code, that I used to get features for clustering purpose:

bptt,em_sz,nh,nl = 64,200,512,3
vs = len(itos)
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
bs = 48

# 10393

y = np.zeros((len(test_clas), 3*em_sz))
test_ds = TextDataset(test_clas, y)
test_dl = DataLoader(test_ds, bs, transpose=True, num_workers=1, pad_idx=1)
#some hacky :slight_smile: we do not need to train our model, so we just put test_dl as trn_dl parameter
md = ModelData(PATH, test_dl, None)

#define our custom head to just get h_c
class PoolingLinearClustering(nn.Module):
    def __init__(self, layers, drops):
        self.layers = nn.ModuleList([
            LinearBlock(layers[i], layers[i + 1], drops[i]) for i in range(len(layers) - 1)])

    def pool(self, x, bs, is_max):
        f = F.adaptive_max_pool1d if is_max else F.adaptive_avg_pool1d
        return f(x.permute(1,2,0), (1,)).view(bs,-1)

    def forward(self, input):
        raw_outputs, outputs = input
        output = outputs[-1]
        sl,bs,_ = output.size()
        avgpool = self.pool(output, bs, False)
        mxpool = self.pool(output, bs, True)
        x =[output[-1], mxpool, avgpool], 1)
        return x, raw_outputs, outputs

#define our function to get rnn learner with head defined above
def get_rnn_clustering(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token, layers=[em_sz*3], drops=None, bidir=False,
                      dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5):
    rnn_enc = MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token=pad_token, bidir=bidir,
                      dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop)
    return SequentialRNN(rnn_enc, PoolingLinearClustering(layers, drops))

dps = np.array([0.4,0.5,0.05,0.3,0.4])*0.5

#get model
m = get_rnn_clustering(bptt, 20*70, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
          drops=[dps[4], 0.1],
          dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

#define learn as usual
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.metrics = [accuracy]

#Here we just feed test_dl into data object: = test_dl

#Then use ordinal predict in test layer
predictions = learn.predict(is_test=True)

# it will be equals (len(test_dl), 3*em_sz), in my case (10393, 600) 

Are you sure you want learn.clip to be 25. and not .25?

learn.clip can be removed at all, because we do not train model. Anyway I just copy clip value from imdb lesson in DL2 course.

@asotov: Thanks so much for sharing your code. It would definitely help my case.

@asotov: Running your code multiple times will result in different results for the same input. Do you know why? And do we need to set dropouts = 0, i.e., set dropmult = 0 since we are not training?

@Samuel, dropouts is used only at training time. Therefore when you execute predict it internally use model.eval() to turn model on to evaluation mode.

I am sorry, I forget the main thing - after creating learner you need to load encoder from pretrained language model. Then you receive correct, expected result.

1 Like

@asotov, thanks a lot for your help!

For training an end-to-end model, the most straightforward approach is probably just to concatenate the premise and the hypothesis (with a delimiter in the middle). You can then feed the concatenated sequence into the model and train the model to produce the right label.


You could use the same mechanism Jeremy does in the lesson 10 imdb notebook to do this via the xfld data tag. Simply pass both your premise and hypothesis columns as text columns to:

def get_texts(df, n_lbls=1):
    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
    texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
    for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
    texts = texts.apply(fixup).values.astype(str)

    tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
    return tok, list(labels)

Curious as to why you are passing the concatenated output through a relu here?

Why not just return x, raw_outputs, outputs?

It is totally not necessarily to use ReLU here, because it’s lead to loose information from hidden states. It is code from my experimenting under the clustering problem. So, to not confusing somebody I remove F.relu(x) from code above. Thank you @wgpubs for your attentiveness.

Cool and no problem.

Also another kinda related question, how are you dealing with padding when it comes to the hidden states?

Are you running documents through with no padding, one at a time or do the hidden vectors reflected padded docs? I’m wondering whether or not the later in particular would really affect the interpretability of the vectors or not.

@wgpubs, @asotov

I tried @asotov’s code with different kinds of padding, i.e., SortishSampler, SortSample, and no padding, and there was no significant difference in performance in a downstream classification task.

@wgpubs, @asotov

I tried an experiment to see if @asotov’s code works well.

In the first case, I got rid off all the techniques in the ULMFiT paper such as fine-tuning LM, discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing. That means I only used the pre-trained LM on Wikitext 103 to get h_c representations and then built a classifier on top of that.

+ python path-to-clasification-data data/wt103 --lm-id pretrain_wt103 --notrain True 
(notrain = True: no fine-tuning the LM)
+ python path-to-clasification-data --lm-id pretrain_wt103 --clas-id pretrain_wt103 --startat 2 --unfreeze False --use_clr False --use_regular_schedule True --use_discriminative False 
(startat > 1: no gradual unfreezing)

In the second case, I tried @asotov’s code to get h_c representations with the LM pre-trained on Wikitext 103 and saved them to a file. Then I loaded the file and built the same classifier as in the first case on top.

Basically, the performance in both cases should match. However, in the first case, I got 9% of accuracy better than in the second case. I doubt that in the first case, maybe I forgot getting rid of something that led to the gap in performance. I really want to make it clear why was this the case to get a better understanding of the code. Do you know what might cause the difference?

@sebastianruder @jeremy @wgpubs: Is there any way that I can encode the premise and the hypothesis separately (re-use the encoder) instead of concatenating and encoding them as a whole? In another experiment, I got much better results when encoding the premise and the hypothesis separately (with tensorflow).

1 Like

You can build a 2-branch model, learning in parallel. One takes premise as input and other takes hypothesis. You can merge them in the end or earlier as you wish. Is that what you had in mind?

Thanks, @urmas.pitsi. Yes, that’s what I had in mind though it’s not necessary to encode the two in parallel. I just want to reuse the encoder. In tensorflow, this can be done using Variable Scope
for sharing variables

with tf.variable_scope("scope_name") as scope:
    premise_enc = LM_encoder(premise)
    hypothesis_enc = LM_encoder(hypothesis)


with tf.variable_scope("scope_name"):
    premise_enc = LM_encoder(premise)
with tf.variable_scope("scope_name", reuse=True):
    hypothesis_enc = LM_encoder(hypothesis)

Then, I concatenate the two together and build a classifier on top

h = tf.concat([premise_enc, hypothesis_enc], 1)

But I don’t know how to implement this with fastai. Do you know any tutorial/sample code?

I read the TF link and tried to understand the scope reuse logic. I get that model has two inputs that share variables and weights. But this should be pretty much the same thing logically compared with concatenating two inputs into one input for the model. The alternative proposed above by others. Am I missing something?