ULMFit for sequence tagging

(Peter) #1


I saw on this question people discussing adapting ULMFit for regression instead of classification. This is a great adaptation.

I was wondering now about classification/regression at the word level, to classify/regress each word from he input text (NER, POS tagging,…).
Which class should we change to get a predictor for sequence tagging?
@sebastianruder @jeremy

1 Like

(Nathan Glenn) #2

I would also love this for tokenization of non-spaced text (Japanese, Mandarin, etc.). I’m watching this space very closely: http://nlp.fast.ai/category/seq_label.html


(Hong Emrys) #3

I have similar thoughts! And I am planning to explore on this field. Inspired by previous approaches: paper: https://arxiv.org/pdf/1603.01360.pdf. code:https://github.com/guillaumegenthial/sequence_tagging(sadly it is written in tensorflow). I think adding a bi-LSTM and a CRF(conditional random field) layer on top of AWS LSTM might work. Or we can add the CRF model first if it is hard to train.

But I am not sure whether Jeremy and Sebastian have done similar tasks before? If have, can give some suggestion on how to do it?

1 Like

(Peter) #4

In my understanding, the get_rnn_classifier function in the lm_rnn file:

def get_rnn_classifier(bptt, max_seq, n_class, n_tok, emb_sz, n_hid, n_layers, pad_token, layers, drops, bidir=False,
                  dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, qrnn=False):
rnn_enc = MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token=pad_token, bidir=bidir,
                  dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)
return SequentialRNN(rnn_enc, PoolingLinearClassifier(layers, drops))

returns the SequentialRNN wrapper containing the rnn_enc backbone and the classifier layer.

Similarly, the language model function:

def get_language_model(n_tok, emb_sz, n_hid, n_layers, pad_token,
             dropout=0.4, dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, tie_weights=True, qrnn=False, bias=False):

rnn_enc = RNN_Encoder(n_tok, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token,
             dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)
enc = rnn_enc.encoder if tie_weights else None
return SequentialRNN(rnn_enc, LinearDecoder(n_tok, emb_sz, dropout, tie_encoder=enc, bias=bias))

returns a SequentialRNN wrapper with the rnn_enc and the linear decoder.
Here the linear decoder outputs the probabilities for the next word:

class LinearDecoder(nn.Module):
def __init__(self, n_out, n_hid, dropout, tie_encoder=None, bias=False):
    self.decoder = nn.Linear(n_hid, n_out, bias=bias)
    self.decoder.weight.data.uniform_(-self.initrange, self.initrange)
    self.dropout = LockedDropout(dropout)
    if bias: self.decoder.bias.data.zero_()
    if tie_encoder: self.decoder.weight = tie_encoder.weight

def forward(self, input):
    raw_outputs, outputs = input
    output = self.dropout(outputs[-1])
    decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
    result = decoded.view(-1, decoded.size(1))
    return result, raw_outputs, outputs

Hence, I guess that if we want to output custom probabilities at each word step, that’s the class to adapt.

Any thougts about this?


(Hong Emrys) #5

I gave it a try using my free time and simply put a Conditional Random Field layer on top of the linear decoder to model the dependencies between successive tags.
I have to change a lot of fastai built in function to add in the CRF layers. for now the accuracy is around 70% which is really really bad. I will keep improving it. let me know if you are interested in the code and we can discuss. feel free to improve it also. here is the code(https://github.com/Emrys-Hong/fastai_sequence_tagging).
some interesting findings:
adding a BilSTM layer on top of the LM encoder, and then adds on a cnn decoder will provide a jump in F1 score.
In this paper published by Peters et al(https://arxiv.org/abs/1705.00108) says that using the first layer from the language model seems to give higher accuracy as sequence tagging task requires low language features(correct me if am wrong), but in my experiment, using the third layer gives me higher accuracy.

1 Like

(Tuhin Chakrabarty) #6

Can any of you tell me / point me how to get contextualised word representations from the ULMFIT trained LM .


(Piotr Czapla) #7

You mean like in ELMO?


(Tuhin Chakrabarty) #8

Yes will be very helpful piotr :slight_smile: Some code possibly


(Piotr Czapla) #9

Ulmfit is one directional LSTM, while ELMO was trained on BiLSTM. So I’m not sure if you get desire effects. I’m on holidays until end of next week, but I remember that @mkardas was playing with embbedings maybe he has the code handy.


(Tuhin Chakrabarty) #10

I agree but both would give contextual word representations i believe . the hidden state instead of concat [hf , hb] will be a single state . I can wait till you are back or if @mkardas can help it will be v beneficial for my research


(Marcin Kardas) #11

ULMFiT uses RNN to get the encoding, there are 3 layers by default. You can get output from the last layer or concatenate also previous outputs and an embedding layer. If you create RNN_Encoder instance and load it using load_model, then by calling the encoder you will get a pair of raw outputs (i.e., before dropout) and outputs. Both are lists of tensors, the i-th element is an output of the i-th layer. Each tensor is of shape len x bs x dim, where len is a sentence/sequence length, bs is a batch size and dim is an embedding dimension (400 by default for the last layer, 1150 for the rest). So it would be something like this:

from fastai.text import *
from sampled_sm import *

encoder = RNN_Encoder(n_tokens, 400, n_hid=1150, n_layers=3, pad_token=1, qrnn=False,
    dropouth=0, dropouti=0,dropoute=0,wdrop=0.5)
load_model(encoder, encoder_filename)
# ... prepare batch of shape len x bs

raw_outputs, outputs = encoder(batch)

Hope it helps.



I used ULMFit for sequence tagging with results far better than random. This is the decoder I used to generate tag sequences:

It is quite rudimentary, but you can apply some advanced stuff such as beam search over the returned probabilities.


(Karl) #13

Do you have a notebook showing how you structured the dataloader? Currently trying to do something similar, and I’m stuck at getting a list of labels (one label for each token) into a standard dataloader format.


(hari rajeev) #14

i had tried glove + bilstm + CRF (not using fastai) and it gave very good results . So are you trying LM + bilstm + CRF ?


(Hong Emrys) #15

Yeah, it is one of the classic ways to deal with sequence tagging. I have tried (3 x AWDLSTM) as LM and two linear layer and CRF, so its BiLSTM + linear head + CRF, but the result is not very good.


(Shivani Malhotra) #16

Hello Marcin
What does the conventional linear decoder of ULMFit receive? Output from last hidden layer of decoder or combined output of all hidden layers?
Also, the instrinsic Attention is confusing me a bit due to this.


(Marcin Kardas) #17

Hi Shivani,
the decoder layer gets output from the last RNN layer. Here’s a schema of MultiFiT classifier:

If you change QRNN to LSTM, 1550 to 1152, remove the QRNN3 layer and the blue blocks (classifying head) you get the standard ULMFiT encoder with the default parameters. The decoder is simply a dense layer which size equals the size of transposed embedding matrix (if weights tying is enabled (default) it’s exactly the transposed embedding matrix + optionally a bias) followed by a softmax layer. The decoder is applied separately to output of each time step of the last RNN layer.


(Shivani Malhotra) #18

Hi Marcin

Thanks for the explaination.Now if the decoder layer gets output from last LSTM layer (AWD-LSTM), can we add attention layer in ULMFit so that every hidden layer of its encoder responds to the decoder? I wanted to use attention in ULMFit without using Transformer concept.