Pretrain MLM and fintune on GLUE with fastai - 1 - Masked laguage model callback and Electra callback

I created callbacks for masked language modeling and replaced token detection (Electra). Please help me test / check / improve the code, and also I am seeking a chance to PR to fastai2.

richardyy1188/ Pretrain-MLM-and-finetune-on-GLUE-with-fastai/MLM_pretraing_with_fastai.ipynb

Take a peek at it:

class MaskedLanguageModel(Callback):
  def __init__(self, mask_tok_id, special_tok_ids, vocab_size, **kwargs):
    self.mask_tokens = partial(mask_tokens,
  def begin_batch(self):
    text_indices = self.xb[0]
    masked_inputs, labels = self.mask_tokens(text_indices)
    self.learn.xb, self.learn.yb = (masked_inputs,), (labels,)

There are still something need your helps, let’s facilitate researches on NLP pretraining !

  • It gives nan loss when using fp16 in the cases of both mlm and electra
  • Haven’t pretrain it to reported accuracy, I hope someone can spend several GPU hours to train a small model on a proper corpus and see its accuracy.

Help me tag someone might be interested in or could help !
Also follow this thread or my twitter Richard Wang, I will update this series.

(Spoiler alert: custom dataloader to make the most use of max length, prepare GLUE data, single/multi task training on GLUE is in the line !!)

1 Like

Hows the progress going with ELECTRA? Would love to see it working, I might have time to look at it in a couple of weeks :frowning:

A few questions, maybe you’ve thought of them already:

  • Are you loading randomly initialised weights? The authors found that loading pre-trained weights didn’t perform so well

  • Your generator DistilBERT is super small (1 layer), maybe it can’t learn anything useful…the authors found in their setup that a generator of half the discriminator size worked best (so with DistilBERT that would be 3 layers for the generator I think?).

  • Did you consider using ALBERT-base? It looks like it only has 11M params vs 65M for DistilBERT-base. You could even use ALBERT-base (11M) -> ALBERT-large (17M) for your gen->discrim setup? (Param sizes mentioned here)

Hi, thanks for your interest.

Are you loading randomly initialised weights? The authors found that loading pre-trained weights didn’t perform so well

I did load pretrained weights. Thanks for your info, it makes me wonder if different pretraining objective makes different representation that can’t be use by each other.

Your generator DistilBERT is super small (1 layer), …, the authors found in their setup that a generator of half the discriminator size worked best .

Thanks for your comment, I quickly reexamine Electra paper and config. As far as I understood, I found two things.

  • As figure 3 shows, the authors might want to say generator hidden size is better to be 1/2 * discriminator hidden size. Although I found config in github use 1/4, but refer to the graph, it seems make no big drop on acc but saves time and params.

  • According to config. Generator will have the same number of layers as discriminator but with small hidden size.

Did you consider using ALBERT-base? …

I might have seen a graph saying weight sharing transformer (Universal Transformer and ALBERT ), although have far smaller params, but have more computational cost than non-weight sharing one. (Sorry I didn’t found the graph.) But we might think one layer of ALBERT if larger than one layer of BERT, so run a ALBERT layer 12 times will cost more than run through 12 BERT layers.

I will update the setting in notebook according to config in ELECTAR repository tomorrow. By the way, I think I should use the newly released Electra in Transformers (Although no pretrained weights now?).

I would like to test it with a reasonable setting but the only GPU available for me now is Colab :cry:, so I am thinking if someone could help me try this callback. Or I would need to train a 6.25% trained ELECTRA-small by myself after I finish my work in hand…

  • Yep definitely, makes life a bit easier!
  • I might be able to help train it in a couple of days depending on whether I’m in writing, reading or modelling model :slight_smile: , I have a 2080 here

It seems that I can’t edit my original post, so post the updates here. All updates is reflected in the original notebook.

==== 2020/05/17 ====

  • Use Electra pretrained model (with pretrained weights) instead of original distillbert
  • Normal result when using fp16, it seems because of the change the above point.
  • Print masked input and label for easier understanding for MLM callback
  • Use TextDataloader introduced here for broader context. (which may make you load more tokens a batch than original paper but I haven’t figure out how they load their data.)
  • Stick to papaer’s hyperameter
  1. use google/electra-small-generator and electra-small-discriminator pretrained model config, so we should be correct about the model part (including number of layers) .
  2. batch_size == 128
  3. scale discriminator by 50

Please let me know if there’s other differences from the original implementation.

Note: The accuracy is about 7%. I don’t know if it is normal, nobody has reported “MLM acc” after all. And I am waiting huggingface/nlp to solve issue #148 so I can try it on wikipedia corpus (where the pretrained model is trained on) instead of simplebook corpus (which have many children stories so mabye a discrepancy in these data ?).

Note: You may not be able to view the notebook on github (I don’t know why, maybe it takes time), but you can download and open it locally.

1 Like

=== 2020.05.24 ===
Show what we mask !

It surprised me how easy to let a MaskedLMCallback to pretty print the results. :heart_eyes: :heart_eyes: :heart_eyes:

def show_batch(self, dl, **kwargs):
    b = dl.one_batch()
    new_b = self.mask_tokens(b[0])
    dl.show_batch(b=new_b, **kwargs)