FastHugs - fastai-v2 and HuggingFace Transformers

morgan · May 6, 2020, 3:01pm

MLM Language Modelling with HuggingFace transformers - RoBERTa pre-training edition

With this you can fine-tune a RoBERTa model on your specific dataset before training it on a down-stream task like sequence classification

The main trick for me was the creation of a MLM Transform (MLMTokensLabels) that would take the numericalized input x, do the masking and output a tuple of (x,y), where x has 15% of its tokens masked and y is the original input but with 85% of its tokens masked.

I have see others us a Callback to do the masking here, but by using Transforms I was able to use the dls.show_batch to see the decoded inputs and targets.

The MLM transform is more or less a rewrite of the masking function used in HuggingFace’s how to train a language model from scratch tutorial

I also had to overwrite one line in the Datasets class as it would try and make a tuple out of my tuple

Blog post and code can be found here