Fastai transfer learning for NER

Is it possible to use the current fastai NLP transfer learning implementation to do word tagging tasks like NER, or would this require going down to the pytorch level and doing extra work? From looking at it, it seems like the current setup is designed more for document (maybe even sentence if treated as IID) classification, but I could be wrong…

1 Like

It’s not trivial to do, but it’s not a big step either - you’d need to tweak the current classification model to do sequence labeling.


I’ve been trying to get sequence labeling going for some time without success. Would anyone be interested in forming a team / study group to do this together?

Hi David, any progress on this. I’m looking for a similar solution and would like to contribute if needed to take it foreword.

I’ve been bogged down with other stuff and will not get to this before December, possibly a lot longer.

I am working on NER project at the moment. I want to build the simplest\fastest pipeline to find entities in Chilean Spanish for various areas (claims, reviews), all in Chile. It’ll most likely require Spanish LM fine tuning on each labelled dataset. And I want to add NER module to fastai.

I am going with implementing NER in fastai because I see multifit style transfer learning being much faster than current transformers. And it claims to require less labelled data (which is the problem)

Here’s what I’ve done so far
The only working piece (kind of) is the classification model predicting a tag for a word for English conll dataset.

That’s bad, but I wanted to start with something which I can do and it works.

Here’s next steps:

  1. Tweak the class model to predict all the tags for text (not just for 1 word). I need to add a decoder on top instead of PoolingLinearClassifier
  2. Create a databunch
  3. Go with Spanish and verify how many labels it needs to get accuracy suited for production

The notebook also contains work done by @davidpfahler - big thanks. He updated seq2seq botebook from nlp course. I have yet to check if it works and what are the results.


How does decoder module should look?

Here’s what I am thinking

  1. From the encoder we get 128 (words), 1336, 1152 (encoded states) tensor per text sequence.
  2. I run encoded states through Dropout and then Linear with size of 1152 x 24 (number of tags)
  3. Then thru activation LogSoftmax on 24 tags?
  4. Find loss and multiply on mask

What’s not clear for me is how to do it for the whole sentence and where to apply activation.

The code I have

There is a paper KALM knowledge augmented Language model , that claims to achieve the same results as that of crf biltsm ner. Have anybody worked in it ?