Using ULMFiT for Natural Language Inference

Samuel · July 8, 2018, 10:07pm

Hi,

I would like to try the ULMFiT language model on Natural Language Inference (NLI) datasets. Give two textual snippets s_1 and s_2, a premise and a hypothesis, the goal is to label the two with one of three labels: entailment, contradiction, or neutral. I want to encode the premise and hypothesis with the ULMFiT language model’s encoder, as follows

ULMFiT\_enc (s_1) = h_c(s_1)
ULMFiT\_enc (s_2) = h_c(s_2)
where h_c = [h_T , maxpool(H), meanpool(H)] and H = \{h_1,...,h_T\} are the hidden states given by the ULMFiT language model, either pre-trained or fine-tuned.
Then I concatenate h_c(s_1) and h_c(s_2) and build a classifier on top.

I am struggling with how to use the TextDataset/DataLoader for NLI datasets, i.e, represent both the premise and hypothesis and how to modify the RNN_Encoder/get_rnn_classifier to encode both the two texts (re-use the encoder). I would be grateful if somebody could give some suggestions or direct me to a tutorial/sample code (if any).

@sebastianruder @jeremy @nickl @wgpubs @asotov

asotov · July 9, 2018, 11:23am

Hi, @Samuel!
Does you like to creae end to end model? If no, then you can just run set of s_1 and s_2 trought the pre-trained RNN and receive h_c(s_i) features. Then build any model you like over these features to classify.

If you like to use end-to-end model with encoder, then you need to double parameters in TextDataset to have two sequences of sentences (for s_1 and s_2 ) and also you need to have two sets of hidden layers in you RNN.

Samuel · July 9, 2018, 2:20pm

@asotov: Thanks for your reply. I would like to build an end-to-end model but at this stage, getting h_c(s_i) features and build a model over them would be fine for me. It will help me get a better understanding of ULMFiT before building an end-to-end model. Do you know how to get the language model’s RNN hidden states or h_c for a set of textual snippets ULMFit\_enc(\{s_{i_1}, s_{i_2}, \ldots ,s_{i_n}\}) = \{h_c(s_{i_1}), h_c(s_{i_2}), \ldots, h_c(s_{i_n})\}?

asotov · July 10, 2018, 1:00pm

Hi, @Samuel. So, before you ask, I have already used ULMFiT model to get h_c from pretrained RNN. Here my code, that I used to get features for clustering purpose:

bptt,em_sz,nh,nl = 64,200,512,3
vs = len(itos)
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
bs = 48

len(test_clas)
# 10393

y = np.zeros((len(test_clas), 3*em_sz))
test_ds = TextDataset(test_clas, y)
test_dl = DataLoader(test_ds, bs, transpose=True, num_workers=1, pad_idx=1)
#some hacky :slight_smile: we do not need to train our model, so we just put test_dl as trn_dl parameter
md = ModelData(PATH, test_dl, None)

#define our custom head to just get h_c
class PoolingLinearClustering(nn.Module):
    def __init__(self, layers, drops):
        super().__init__()
        self.layers = nn.ModuleList([
            LinearBlock(layers[i], layers[i + 1], drops[i]) for i in range(len(layers) - 1)])

    def pool(self, x, bs, is_max):
        f = F.adaptive_max_pool1d if is_max else F.adaptive_avg_pool1d
        return f(x.permute(1,2,0), (1,)).view(bs,-1)

    def forward(self, input):
        raw_outputs, outputs = input
        output = outputs[-1]
        sl,bs,_ = output.size()
        avgpool = self.pool(output, bs, False)
        mxpool = self.pool(output, bs, True)
        x = torch.cat([output[-1], mxpool, avgpool], 1)
        return x, raw_outputs, outputs

#define our function to get rnn learner with head defined above
def get_rnn_clustering(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token, layers=[em_sz*3], drops=None, bidir=False,
                      dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5):
    rnn_enc = MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token=pad_token, bidir=bidir,
                      dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop)
    return SequentialRNN(rnn_enc, PoolingLinearClustering(layers, drops))

dps = np.array([0.4,0.5,0.05,0.3,0.4])*0.5

#get model
m = get_rnn_clustering(bptt, 20*70, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
          layers=[em_sz*3],
          drops=[dps[4], 0.1],
          dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

#define learn as usual
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=25.
learn.metrics = [accuracy]

#Here we just feed test_dl into data object:
learn.data.test_dl = test_dl

#Then use ordinal predict in test layer
predictions = learn.predict(is_test=True)

predictions.shape
# it will be equals (len(test_dl), 3*em_sz), in my case (10393, 600)

urmas.pitsi · July 10, 2018, 1:25pm

Are you sure you want learn.clip to be 25. and not .25?

asotov · July 10, 2018, 6:23pm

learn.clip can be removed at all, because we do not train model. Anyway I just copy clip value from imdb lesson in DL2 course.

Samuel · July 10, 2018, 10:20pm

@asotov: Thanks so much for sharing your code. It would definitely help my case.

Samuel · July 11, 2018, 3:45am

@asotov: Running your code multiple times will result in different results for the same input. Do you know why? And do we need to set dropouts = 0, i.e., set dropmult = 0 since we are not training?

asotov · July 11, 2018, 4:23am

@Samuel, dropouts is used only at training time. Therefore when you execute predict it internally use model.eval() to turn model on to evaluation mode.

I am sorry, I forget the main thing - after creating learner you need to load encoder from pretrained language model. Then you receive correct, expected result.

sebastianruder · July 11, 2018, 11:57am

@asotov, thanks a lot for your help!

For training an end-to-end model, the most straightforward approach is probably just to concatenate the premise and the hypothesis (with a delimiter in the middle). You can then feed the concatenated sequence into the model and train the model to produce the right label.

wgpubs · July 13, 2018, 4:19am

You could use the same mechanism Jeremy does in the lesson 10 imdb notebook to do this via the xfld data tag. Simply pass both your premise and hypothesis columns as text columns to:

def get_texts(df, n_lbls=1):
    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
    texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
    for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
    texts = texts.apply(fixup).values.astype(str)

    tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
    return tok, list(labels)

wgpubs · July 13, 2018, 4:21am

Curious as to why you are passing the concatenated output through a relu here?

Why not just return x, raw_outputs, outputs?

asotov · July 16, 2018, 11:21am

It is totally not necessarily to use ReLU here, because it’s lead to loose information from hidden states. It is code from my experimenting under the clustering problem. So, to not confusing somebody I remove F.relu(x) from code above. Thank you @wgpubs for your attentiveness.

wgpubs · July 16, 2018, 3:50pm

Cool and no problem.

Also another kinda related question, how are you dealing with padding when it comes to the hidden states?

Are you running documents through with no padding, one at a time or do the hidden vectors reflected padded docs? I’m wondering whether or not the later in particular would really affect the interpretability of the vectors or not.

Samuel · July 16, 2018, 4:16pm

@wgpubs, @asotov

I tried @asotov’s code with different kinds of padding, i.e., SortishSampler, SortSample, and no padding, and there was no significant difference in performance in a downstream classification task.

Samuel · July 16, 2018, 4:35pm

@wgpubs, @asotov

I tried an experiment to see if @asotov’s code works well.

In the first case, I got rid off all the techniques in the ULMFiT paper such as fine-tuning LM, discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing. That means I only used the pre-trained LM on Wikitext 103 to get h_c representations and then built a classifier on top of that.

+ python finetune_lm.py path-to-clasification-data data/wt103 --lm-id pretrain_wt103 --notrain True 
(notrain = True: no fine-tuning the LM)
+ python train_clas.py path-to-clasification-data --lm-id pretrain_wt103 --clas-id pretrain_wt103 --startat 2 --unfreeze False --use_clr False --use_regular_schedule True --use_discriminative False 
(startat > 1: no gradual unfreezing)

In the second case, I tried @asotov’s code to get h_c representations with the LM pre-trained on Wikitext 103 and saved them to a file. Then I loaded the file and built the same classifier as in the first case on top.

Basically, the performance in both cases should match. However, in the first case, I got 9% of accuracy better than in the second case. I doubt that in the first case, maybe I forgot getting rid of something that led to the gap in performance. I really want to make it clear why was this the case to get a better understanding of the code. Do you know what might cause the difference?

Samuel · July 18, 2018, 9:42pm

@sebastianruder @jeremy @wgpubs: Is there any way that I can encode the premise and the hypothesis separately (re-use the encoder) instead of concatenating and encoding them as a whole? In another experiment, I got much better results when encoding the premise and the hypothesis separately (with tensorflow).

urmas.pitsi · July 20, 2018, 11:17am

You can build a 2-branch model, learning in parallel. One takes premise as input and other takes hypothesis. You can merge them in the end or earlier as you wish. Is that what you had in mind?

Samuel · July 20, 2018, 3:45pm

Thanks, @urmas.pitsi. Yes, that’s what I had in mind though it’s not necessary to encode the two in parallel. I just want to reuse the encoder. In tensorflow, this can be done using Variable Scope
for sharing variables

with tf.variable_scope("scope_name") as scope:
    premise_enc = LM_encoder(premise)
    scope.reuse_variables()
    hypothesis_enc = LM_encoder(hypothesis)

Or

with tf.variable_scope("scope_name"):
    premise_enc = LM_encoder(premise)
with tf.variable_scope("scope_name", reuse=True):
    hypothesis_enc = LM_encoder(hypothesis)

https://www.tensorflow.org/versions/r1.1/programmers_guide/variable_scope

Then, I concatenate the two together and build a classifier on top

h = tf.concat([premise_enc, hypothesis_enc], 1)
MLP_classifier(h)

But I don’t know how to implement this with fastai. Do you know any tutorial/sample code?

urmas.pitsi · July 20, 2018, 4:56pm

I read the TF link and tried to understand the scope reuse logic. I get that model has two inputs that share variables and weights. But this should be pretty much the same thing logically compared with concatenating two inputs into one input for the model. The alternative proposed above by others. Am I missing something?