ULMFit for sequence tagging

petermartigny · August 5, 2018, 11:20am

Hello,

I saw on this question people discussing adapting ULMFit for regression instead of classification. This is a great adaptation.

I was wondering now about classification/regression at the word level, to classify/regress each word from he input text (NER, POS tagging,…).
Which class should we change to get a predictor for sequence tagging?
@sebastianruder @jeremy

garfieldnate · August 5, 2018, 3:19pm

I would also love this for tokenization of non-spaced text (Japanese, Mandarin, etc.). I’m watching this space very closely: http://nlp.fast.ai/category/seq_label.html

Emrys-Hong · August 7, 2018, 1:47pm

I have similar thoughts! And I am planning to explore on this field. Inspired by previous approaches: paper: https://arxiv.org/pdf/1603.01360.pdf. code:https://github.com/guillaumegenthial/sequence_tagging(sadly it is written in tensorflow). I think adding a bi-LSTM and a CRF(conditional random field) layer on top of AWS LSTM might work. Or we can add the CRF model first if it is hard to train.

But I am not sure whether Jeremy and Sebastian have done similar tasks before? If have, can give some suggestion on how to do it?

petermartigny · August 7, 2018, 8:55pm

In my understanding, the get_rnn_classifier function in the lm_rnn file:

def get_rnn_classifier(bptt, max_seq, n_class, n_tok, emb_sz, n_hid, n_layers, pad_token, layers, drops, bidir=False,
                  dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, qrnn=False):
rnn_enc = MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token=pad_token, bidir=bidir,
                  dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)
return SequentialRNN(rnn_enc, PoolingLinearClassifier(layers, drops))

returns the SequentialRNN wrapper containing the rnn_enc backbone and the classifier layer.

Similarly, the language model function:

def get_language_model(n_tok, emb_sz, n_hid, n_layers, pad_token,
             dropout=0.4, dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, tie_weights=True, qrnn=False, bias=False):

rnn_enc = RNN_Encoder(n_tok, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token,
             dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)
enc = rnn_enc.encoder if tie_weights else None
return SequentialRNN(rnn_enc, LinearDecoder(n_tok, emb_sz, dropout, tie_encoder=enc, bias=bias))

returns a SequentialRNN wrapper with the rnn_enc and the linear decoder.
Here the linear decoder outputs the probabilities for the next word:

class LinearDecoder(nn.Module):
initrange=0.1
def __init__(self, n_out, n_hid, dropout, tie_encoder=None, bias=False):
    super().__init__()
    self.decoder = nn.Linear(n_hid, n_out, bias=bias)
    self.decoder.weight.data.uniform_(-self.initrange, self.initrange)
    self.dropout = LockedDropout(dropout)
    if bias: self.decoder.bias.data.zero_()
    if tie_encoder: self.decoder.weight = tie_encoder.weight

def forward(self, input):
    raw_outputs, outputs = input
    output = self.dropout(outputs[-1])
    decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
    result = decoded.view(-1, decoded.size(1))
    return result, raw_outputs, outputs

Hence, I guess that if we want to output custom probabilities at each word step, that’s the class to adapt.

Any thougts about this?

Emrys-Hong · August 17, 2018, 9:42am

I gave it a try using my free time and simply put a Conditional Random Field layer on top of the linear decoder to model the dependencies between successive tags.
I have to change a lot of fastai built in function to add in the CRF layers. for now the accuracy is around 70% which is really really bad. I will keep improving it. let me know if you are interested in the code and we can discuss. feel free to improve it also. here is the code(https://github.com/Emrys-Hong/fastai_sequence_tagging).
some interesting findings:
adding a BilSTM layer on top of the LM encoder, and then adds on a cnn decoder will provide a jump in F1 score.
In this paper published by Peters et al(https://arxiv.org/abs/1705.00108) says that using the first layer from the language model seems to give higher accuracy as sequence tagging task requires low language features(correct me if am wrong), but in my experiment, using the third layer gives me higher accuracy.

tuhincolumbia · September 29, 2018, 9:35am

Can any of you tell me / point me how to get contextualised word representations from the ULMFIT trained LM .

piotr.czapla · October 2, 2018, 4:46pm

You mean like in ELMO?

tuhincolumbia · October 2, 2018, 4:58pm

Yes will be very helpful piotr Some code possibly

piotr.czapla · October 4, 2018, 8:28pm

Ulmfit is one directional LSTM, while ELMO was trained on BiLSTM. So I’m not sure if you get desire effects. I’m on holidays until end of next week, but I remember that @mkardas was playing with embbedings maybe he has the code handy.

tuhincolumbia · October 4, 2018, 8:41pm

I agree but both would give contextual word representations i believe . the hidden state instead of concat [hf , hb] will be a single state . I can wait till you are back or if @mkardas can help it will be v beneficial for my research

mkardas · October 5, 2018, 2:13am

ULMFiT uses RNN to get the encoding, there are 3 layers by default. You can get output from the last layer or concatenate also previous outputs and an embedding layer. If you create RNN_Encoder instance and load it using load_model, then by calling the encoder you will get a pair of raw outputs (i.e., before dropout) and outputs. Both are lists of tensors, the i-th element is an output of the i-th layer. Each tensor is of shape len x bs x dim, where len is a sentence/sequence length, bs is a batch size and dim is an embedding dimension (400 by default for the last layer, 1150 for the rest). So it would be something like this:

from fastai.text import *
from sampled_sm import *

encoder = RNN_Encoder(n_tokens, 400, n_hid=1150, n_layers=3, pad_token=1, qrnn=False,
    dropouth=0, dropouti=0,dropoute=0,wdrop=0.5)
load_model(encoder, encoder_filename)
encoder.reset()
encoder.eval()
# ... prepare batch of shape len x bs

raw_outputs, outputs = encoder(batch)

Hope it helps.

noisefield · February 20, 2019, 3:11pm

I used ULMFit for sequence tagging with results far better than random. This is the decoder I used to generate tag sequences:

gist.github.com

https://gist.github.com/mamamot/822944e245622e904e9bccb32633cd97

decoder.py

class LinearDecoder(nn.Module):
    def __init__(self, encoder, output_size, dropout_p=0.1, weights=None):
        """
        :param encoder: is the AWD-LSTM encoder from fast.ai 
        :param output_size: number of tags
        :param dropout: dropout to apply to *raw* encoder outputs
        :param weights: loss weights to deal with class imbalance
        """
        super(LinearDecoder, self).__init__()
        self.encoder = encoder

This file has been truncated. show original

It is quite rudimentary, but you can apply some advanced stuff such as beam search over the returned probabilities.

KarlH · April 10, 2019, 8:13pm

Do you have a notebook showing how you structured the dataloader? Currently trying to do something similar, and I’m stuck at getting a list of labels (one label for each token) into a standard dataloader format.

harikrishnanrajeev · April 12, 2019, 2:17am

i had tried glove + bilstm + CRF (not using fastai) and it gave very good results . So are you trying LM + bilstm + CRF ?

Emrys-Hong · April 13, 2019, 10:25am

Yeah, it is one of the classic ways to deal with sequence tagging. I have tried (3 x AWDLSTM) as LM and two linear layer and CRF, so its BiLSTM + linear head + CRF, but the result is not very good.

ulmfitter · December 17, 2019, 7:30am

Hello Marcin
What does the conventional linear decoder of ULMFit receive? Output from last hidden layer of decoder or combined output of all hidden layers?
Also, the instrinsic Attention is confusing me a bit due to this.

mkardas · December 18, 2019, 5:34am

Hi Shivani,
the decoder layer gets output from the last RNN layer. Here’s a schema of MultiFiT classifier:

If you change QRNN to LSTM, 1550 to 1152, remove the QRNN₃ layer and the blue blocks (classifying head) you get the standard ULMFiT encoder with the default parameters. The decoder is simply a dense layer which size equals the size of transposed embedding matrix (if weights tying is enabled (default) it’s exactly the transposed embedding matrix + optionally a bias) followed by a softmax layer. The decoder is applied separately to output of each time step of the last RNN layer.

ulmfitter · December 18, 2019, 7:19am

Hi Marcin

Thanks for the explaination.Now if the decoder layer gets output from last LSTM layer (AWD-LSTM), can we add attention layer in ULMFit so that every hidden layer of its encoder responds to the decoder? I wanted to use attention in ULMFit without using Transformer concept.
Thankyou.

harikrishnanrajeev · February 18, 2020, 6:14am

Hi @Emrys-Hong , have you been able to make progress on ULMFit for Sequence tagging ?

stephen13 · April 24, 2021, 12:08pm

Did anyone finally try to run pos tagging on fastai’s ulmfit pre trained model and fine tune it for pos tagging??