MLM Language Modelling with HuggingFace transformers - RoBERTa pre-training edition
With this you can fine-tune a RoBERTa model on your specific dataset before training it on a down-stream task like sequence classification
The main trick for me was the creation of a MLM Transform (MLMTokensLabels) that would take the numericalized input x, do the masking and output a tuple of (x,y), where x has 15% of its tokens masked and y is the original input but with 85% of its tokens masked.
I have see others us a Callback to do the masking here, but by using Transforms I was able to use the dls.show_batch to see the decoded inputs and targets.
The MLM transform is more or less a rewrite of the masking function used in HuggingFace’s how to train a language model from scratch tutorial
I also had to overwrite one line in the Datasets class as it would try and make a tuple out of my tuple