ULMFit for sequence tagging


(Peter) #1

Hello,

I saw on this question people discussing adapting ULMFit for regression instead of classification. This is a great adaptation.

I was wondering now about classification/regression at the word level, to classify/regress each word from he input text (NER, POS tagging,…).
Which class should we change to get a predictor for sequence tagging?
@sebastianruder @jeremy


(Nathan Glenn) #2

I would also love this for tokenization of non-spaced text (Japanese, Mandarin, etc.). I’m watching this space very closely: http://nlp.fast.ai/category/seq_label.html


(Hong Emrys) #3

I have similar thoughts! And I am planning to explore on this field. Inspired by previous approaches: paper: https://arxiv.org/pdf/1603.01360.pdf. code:https://github.com/guillaumegenthial/sequence_tagging(sadly it is written in tensorflow). I think adding a bi-LSTM and a CRF(conditional random field) layer on top of AWS LSTM might work. Or we can add the CRF model first if it is hard to train.

But I am not sure whether Jeremy and Sebastian have done similar tasks before? If have, can give some suggestion on how to do it?


(Peter) #4

In my understanding, the get_rnn_classifier function in the lm_rnn file:

def get_rnn_classifier(bptt, max_seq, n_class, n_tok, emb_sz, n_hid, n_layers, pad_token, layers, drops, bidir=False,
                  dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, qrnn=False):
rnn_enc = MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token=pad_token, bidir=bidir,
                  dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)
return SequentialRNN(rnn_enc, PoolingLinearClassifier(layers, drops))

returns the SequentialRNN wrapper containing the rnn_enc backbone and the classifier layer.

Similarly, the language model function:

def get_language_model(n_tok, emb_sz, n_hid, n_layers, pad_token,
             dropout=0.4, dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, tie_weights=True, qrnn=False, bias=False):

rnn_enc = RNN_Encoder(n_tok, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token,
             dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn)
enc = rnn_enc.encoder if tie_weights else None
return SequentialRNN(rnn_enc, LinearDecoder(n_tok, emb_sz, dropout, tie_encoder=enc, bias=bias))

returns a SequentialRNN wrapper with the rnn_enc and the linear decoder.
Here the linear decoder outputs the probabilities for the next word:

class LinearDecoder(nn.Module):
initrange=0.1
def __init__(self, n_out, n_hid, dropout, tie_encoder=None, bias=False):
    super().__init__()
    self.decoder = nn.Linear(n_hid, n_out, bias=bias)
    self.decoder.weight.data.uniform_(-self.initrange, self.initrange)
    self.dropout = LockedDropout(dropout)
    if bias: self.decoder.bias.data.zero_()
    if tie_encoder: self.decoder.weight = tie_encoder.weight

def forward(self, input):
    raw_outputs, outputs = input
    output = self.dropout(outputs[-1])
    decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
    result = decoded.view(-1, decoded.size(1))
    return result, raw_outputs, outputs

Hence, I guess that if we want to output custom probabilities at each word step, that’s the class to adapt.

Any thougts about this?


(Hong Emrys) #5

I gave it a try using my free time and simply put a Conditional Random Field layer on top of the linear decoder to model the dependencies between successive tags.
I have to change a lot of fastai built in function to add in the CRF layers. for now the accuracy is around 70% which is really really bad. I will keep improving it. let me know if you are interested in the code and we can discuss. feel free to improve it also. here is the code(https://github.com/Emrys-Hong/fastai_sequence_tagging).
some interesting findings:
adding a BilSTM layer on top of the LM encoder, and then adds on a cnn decoder will provide a jump in F1 score.
In this paper published by Peters et al(https://arxiv.org/abs/1705.00108) says that using the first layer from the language model seems to give higher accuracy as sequence tagging task requires low language features(correct me if am wrong), but in my experiment, using the third layer gives me higher accuracy.


(Tuhin Chakrabarty) #6

Can any of you tell me / point me how to get contextualised word representations from the ULMFIT trained LM .


(Piotr Czapla) #7

You mean like in ELMO?


(Tuhin Chakrabarty) #8

Yes will be very helpful piotr :slight_smile: Some code possibly


(Piotr Czapla) #9

Ulmfit is one directional LSTM, while ELMO was trained on BiLSTM. So I’m not sure if you get desire effects. I’m on holidays until end of next week, but I remember that @mkardas was playing with embbedings maybe he has the code handy.


(Tuhin Chakrabarty) #10

I agree but both would give contextual word representations i believe . the hidden state instead of concat [hf , hb] will be a single state . I can wait till you are back or if @mkardas can help it will be v beneficial for my research


(Marcin Kardas) #11

ULMFiT uses RNN to get the encoding, there are 3 layers by default. You can get output from the last layer or concatenate also previous outputs and an embedding layer. If you create RNN_Encoder instance and load it using load_model, then by calling the encoder you will get a pair of raw outputs (i.e., before dropout) and outputs. Both are lists of tensors, the i-th element is an output of the i-th layer. Each tensor is of shape len x bs x dim, where len is a sentence/sequence length, bs is a batch size and dim is an embedding dimension (400 by default for the last layer, 1150 for the rest). So it would be something like this:

from fastai.text import *
from sampled_sm import *

encoder = RNN_Encoder(n_tokens, 400, n_hid=1150, n_layers=3, pad_token=1, qrnn=False,
    dropouth=0, dropouti=0,dropoute=0,wdrop=0.5)
load_model(encoder, encoder_filename)
encoder.reset()
encoder.eval()
# ... prepare batch of shape len x bs

raw_outputs, outputs = encoder(batch)

Hope it helps.