I created callbacks for masked language modeling and replaced token detection (Electra). Please help me test / check / improve the code, and also I am seeking a chance to PR to fastai2.
Take a peek at it:
class MaskedLanguageModel(Callback): @delegates(mask_tokens) def __init__(self, mask_tok_id, special_tok_ids, vocab_size, **kwargs): self.mask_tokens = partial(mask_tokens, mask_token_index=mask_tok_id, special_token_indices=special_tok_ids, vocab_size=vocab_size, **kwargs) def begin_batch(self): text_indices = self.xb masked_inputs, labels = self.mask_tokens(text_indices) self.learn.xb, self.learn.yb = (masked_inputs,), (labels,)
There are still something need your helps, let’s facilitate researches on NLP pretraining !
- It gives nan loss when using fp16 in the cases of both mlm and electra
- Haven’t pretrain it to reported accuracy, I hope someone can spend several GPU hours to train a small model on a proper corpus and see its accuracy.
Help me tag someone might be interested in or could help !
Also follow this thread or my twitter Richard Wang, I will update this series.
(Spoiler alert: custom dataloader to make the most use of max length, prepare GLUE data, single/multi task training on GLUE is in the line !!)