How to load data for Masked Language Modeling

Richard-Wang · March 24, 2020, 12:43am

Hi, I am trying to pretrain a model from scratch with masked language model (MLM) objective as bert, and I am thinking about loading the data.

Is there an off-the-shelf dataloader or snippet I can use?
If there’s no such thing, which level ( dataset , datablock , …) should I start ? How would you do it ?

Thanks

morgan · April 20, 2020, 9:34am

Hey @Richard-Wang, did you manage to find an answer to this? Looking to share a language model tutorial notebook too

Richard-Wang · May 3, 2020, 2:49pm

Hi @morgan !

I am sorry for this too late reply, I just had my whole heart set on organizing all things together.

So the result is this post (and the following releases)

It will be a series from pretraing mlm model to multi-task finetune on GLUE. You can follow my Twitter for updates of this series.

I will also be grateful if you can try the code on your gpus and with a larger corpus to see if it truely can reach the accuracy reported (which it should be) . And also help me debug the fp16 issue mentioned in the post.