FastHugs - fastai-v2 and HuggingFace Transformers

pierreguillou · August 26, 2020, 9:25am

Hi Morgan. After some tests, I’m back to you about this issue: the MASK tokens distribution (80% of 15% of tokens of each training and validation sequence) by your class MLMTokensLabels(Transform).

Are you sure that this MASK tokens distribution is calculated at Dataloaders level when batches are generated (that would be great as it would mean that at each batch generation, the MASK tokens distribution is changed) and not at Datasets level (that would mean that the MASK tokens distribution is calculated just one time and never changes)?

When I see the following code in your notebook, the time needed by running it and the files size (dsets and dls files), I think that the MASK tokens distribution is done just one time. What do you think?

tfms=[attrgetter("text"), fastai_tokenizer, Numericalize(vocab=tokenizer_vocab_ls), 
      AddSpecialTokens(tokenizer), MLMTokensLabels(tokenizer)]
dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)

padding = transformer_mlm_padding(tokenizer, max_seq_len=max_seq_len)
dls = dsets.dataloaders(bs=bs, before_batch=[padding])