Hello @Morgan,
You did an awesome work with FastHugs in order to wrap Hugging Face BERT-like models into fastai v2. Congratulations!
I’m currently using your code about Model Language in order to adapt my fine-tuning method (see post) for generative model.
I have 2 questions about class FastHugsTokenizer()
and class MLMTokensLabels(Transform)
:
- it allows to create a sequence of max_seq_len - 2 (ex. for RoBERTa and BERT: 512 - 2 = 510 tokens) from each text cell of the training and validation dataset. But what about the tokens after this limit for a text of 1000 tokens for example? They are thrown away?
- and about the 15% tokens that are masked (80%), changed to another token (10%) or unchanged (10%): at each batch generation, they are (re)created (within
Dataloaders
) that would be a kind ofData Augmentation
technique or they are always the same?
Note: about your code in class MLMTokensLabels(Transform)
> def _replace_with_other()
> random_words = torch.randint(len(self.tok), labels.shape, dtype=torch.long)
, it allows as well the special tokens (<s>
, </s>
, <pad>
, <unk>
, <mask>
) to replace a token from the 15% chosen but not replaced by the <mask>
token: don’t you think it should be better not authorize the special tokens here? What would be the meaning of passing a sequence to the model with the <pad>
token inside for example?