How to load data for Masked Language Modeling

Hi, I am trying to pretrain a model from scratch with masked language model (MLM) objective as bert, and I am thinking about loading the data.

  • Is there an off-the-shelf dataloader or snippet I can use?

  • If there’s no such thing, which level ( dataset , datablock , …) should I start ? How would you do it ?


Hey @Richard-Wang, did you manage to find an answer to this? Looking to share a language model tutorial notebook too :slight_smile:

Hi @morgan !

I am sorry for this too late reply, I just had my whole heart set on organizing all things together.

So the result is this post (and the following releases)

It will be a series from pretraing mlm model to multi-task finetune on GLUE. You can follow my Twitter for updates of this series.

I will also be grateful if you can try the code on your gpus and with a larger corpus to see if it truely can reach the accuracy reported (which it should be) . And also help me debug the fp16 issue mentioned in the post.

1 Like