FastHugs - fastai-v2 and HuggingFace Transformers

Hello @Morgan,

You did an awesome work with FastHugs in order to wrap Hugging Face BERT-like models into fastai v2. Congratulations!

I’m currently using your code about Model Language in order to adapt my fine-tuning method (see post) for generative model.

I have 2 questions about class FastHugsTokenizer() and class MLMTokensLabels(Transform):

  1. it allows to create a sequence of max_seq_len - 2 (ex. for RoBERTa and BERT: 512 - 2 = 510 tokens) from each text cell of the training and validation dataset. But what about the tokens after this limit for a text of 1000 tokens for example? They are thrown away?
  2. and about the 15% tokens that are masked (80%), changed to another token (10%) or unchanged (10%): at each batch generation, they are (re)created (within Dataloaders) that would be a kind of Data Augmentation technique or they are always the same?

Note: about your code in class MLMTokensLabels(Transform) > def _replace_with_other() > random_words = torch.randint(len(self.tok), labels.shape, dtype=torch.long), it allows as well the special tokens (<s>, </s>, <pad>, <unk>, <mask>) to replace a token from the 15% chosen but not replaced by the <mask> token: don’t you think it should be better not authorize the special tokens here? What would be the meaning of passing a sequence to the model with the <pad> token inside for example?