FastHugs - fastai-v2 and HuggingFace Transformers

Thanks! No plans right now to extend FastHugs to include wrappers like BertDataBunch etc My intention was more to demonstrate how to extend fastai2 by showing what transforms or callbacks can be used, similar to the Transformers tutorial in the fastai docs.

I realise that this makes things a little more difficult for beginners, the blurr library might be another good option as it includes wrappers.

Having said that, you should be able to train a BERT-like model from scratch using the MLM notebook in FastHugs. (Note this notebook doesn’t include the Next Sentence Prediction training task that BERT also used, as subsequent researchers found that this task didn’t help performance (see the RoBERTa paper for more).

4 Likes

@morgan have you tried training your own ByteLevelBPETokenizer from the Tokenizers library. I tried using the .encodes method of the tokenizer to adapt to fastai’s tokenizer but I become unable to call next(iter(dls.train)) or next(iter(dls.train)). However, dls.one_batch() works but I can’t use this while training a model in the Learner. Have you encountered such a problem before?

PS: I have also tried using Bert tokenizer encode and encode_plus methods but none seem to work. I noticed in your notebook that you used the tokenize method instead. The BPE tokenizer from the tokenizers library does not have this tokenize method, it has just encode

Hmm that annoying, I’m actually planning on training ByteLevelBPETokenizer shortly!

Have you tried importing the RoBERTa tokenizer from the transformers library, should be Byte-Level BPE? Not sure if this one includes the ability to train or not tho…

Sorry can’t be much help!

I doubt it has the ability to train. I will check though. I’m trying to do translation with a low resource language which I strongly doubt Roberta has been trained on. I’ll keep trying to get it working and share my results. I’m also suspecting that it’s a multiprocessing problem with fastai(noticed you had the same conclusion in your notebook)

1 Like

Good to know, I’ll be trying to train it for Irish, also low resource, however XLM-R was trained on it, Maybe that might be worth a go if the language you’re training (or a cousin of it) is in XLM-R?

1 Like

Hello @Morgan,

You did an awesome work with FastHugs in order to wrap Hugging Face BERT-like models into fastai v2. Congratulations!

I’m currently using your code about Model Language in order to adapt my fine-tuning method (see post) for generative model.

I have 2 questions about class FastHugsTokenizer() and class MLMTokensLabels(Transform):

  1. it allows to create a sequence of max_seq_len - 2 (ex. for RoBERTa and BERT: 512 - 2 = 510 tokens) from each text cell of the training and validation dataset. But what about the tokens after this limit for a text of 1000 tokens for example? They are thrown away?
  2. and about the 15% tokens that are masked (80%), changed to another token (10%) or unchanged (10%): at each batch generation, they are (re)created (within Dataloaders) that would be a kind of Data Augmentation technique or they are always the same?

Note: about your code in class MLMTokensLabels(Transform) > def _replace_with_other() > random_words = torch.randint(len(self.tok), labels.shape, dtype=torch.long), it allows as well the special tokens (<s>, </s>, <pad>, <unk>, <mask>) to replace a token from the 15% chosen but not replaced by the <mask> token: don’t you think it should be better not authorize the special tokens here? What would be the meaning of passing a sequence to the model with the <pad> token inside for example?

Correct. If I recall correctly I don’t think the Roberta authors made an effort to use the remainder of the text samples, they just took the first 510.

(Interestingly in the NLP Checklist paper (youtube) they point out that performance on the Quora Question Pairs (QQP) task suffers when the questions’ positions are swapped, with models tending to focus more on the first question. Maybe this chopping of text is related…)

Correct, I believe that is one of the advantages of MLM training,

Thats an excellent point! I don’t recall if the authors took that into account. That section of code was a re-write from HuggingFace’s implementation. Maybe they accounted for it elsewhere, but if not then their models are also trained like that. Well spotted, I would avoid the special tokens alright yep as it wouldn’t make sense, even for data augmentation, to be adding those tokens I would say

1 Like

Hi Morgan.
Do you plan to update your code with Whole Word Masking technique?
Thanks.

Oh interesting, I’ll add it to my to do list!

Hi, I’m taking baby steps into the world of transformers here, so thank you for this repository! I had a question on configuring the _num_labels because it is not working in my case.

I am sending _num_labels as an argument as I initialize the model:

fasthugs_model = FastHugsModel(transformer_cls=model_class, config_dict=config_dict, n_class=fct_dls.c, pretrained=True)

And, I just traced it with pdb, and also see that the config _num_labels is indeed updated to 30 (number of my classes):

-> if pretrained: self.transformer = transformer_cls.from_pretrained(model_name, config=self.config)
(Pdb) self.config
RobertaConfig {
  "_num_labels": 30,
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

However, when I print the model, the classification head still has two. Is there some other place it is getting overwritten?

    )
    (classifier): RobertaClassificationHead(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (out_proj): Linear(in_features=768, out_features=2, bias=True)
    )
  )
)
(Pdb) q

Thanks!

Not sure if that is the source of the issue but your config file says:

"architectures": [
    "RobertaForMaskedLM"
  ]

Shouldn’t it be RobertaForSequenceClassification if you have a classification task?

1 Like

For anyone else who might face same issue.

It should be config.num_labels

not config._num_labels

1 Like

Thanks @shimsan , this solved the same problem I had.

1 Like

Hi @morgan, I appreciate your help with the Roberta example. I have a few questions about the code. Thanks for your time and help.

  1. When saving the model, is using TextLearner.save_encoder more appropriate? I am guessing it saves the unique vocabulary in the tokenizer too.

  2. What part of tokenization updates the vocabulary with new words? I am doing Language Modeling on domain specific text.

  3. Is it possible to change the sequence length for the batching process? I understand 510 is the max sequence length for the model, but is it possible to have this 510 sequence length for the model while shorting the batch sequence length to reduce memory?

Hey @nickgeoca

  1. save_encoder will only save the model, not the optimizer, vocab, dls etc

  2. The tokenizer isn’t trained in this version. You’d have to train your own tokenizer if your text distribution is very different from the text Roberta was trained on. If you’re using the pre-trained model you’d also have to modify the embedding layer to account for this difference in vocab.

  3. No need to change the batch length, SortedDL cleverly sorts batches according to the length of the sequences (roughly), and the padding function only pads to the length of the longest sequence. So it should already be pretty efficient in that regard!

Hi Morgan. After some tests, I’m back to you about this issue: the MASK tokens distribution (80% of 15% of tokens of each training and validation sequence) by your class MLMTokensLabels(Transform).

Are you sure that this MASK tokens distribution is calculated at Dataloaders level when batches are generated (that would be great as it would mean that at each batch generation, the MASK tokens distribution is changed) and not at Datasets level (that would mean that the MASK tokens distribution is calculated just one time and never changes)?

When I see the following code in your notebook, the time needed by running it and the files size (dsets and dls files), I think that the MASK tokens distribution is done just one time. What do you think?

tfms=[attrgetter("text"), fastai_tokenizer, Numericalize(vocab=tokenizer_vocab_ls), 
      AddSpecialTokens(tokenizer), MLMTokensLabels(tokenizer)]
dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)

padding = transformer_mlm_padding(tokenizer, max_seq_len=max_seq_len)
dls = dsets.dataloaders(bs=bs, before_batch=[padding])

@pierreguillou I think a new mask distribution is going to be called every batch as MLMTokensLabels.encodes called for each batch. And in each call to encodes we get a different masked_indices:

I’m pretty sure thats the case anyways, happy to be corrected if my thinking is fuzzy!

        # Create random mask indices according to probability matrix
        masked_indices = torch.bernoulli(probability_matrix).bool()
        
        ....
        
        # Randomly replace with mask token
        inputs, indices_replaced = self._replace_with_mask(inputs, labels, masked_indices)

I understand @morgan but let’s me explain why I’m still wondering what is happening.

  1. I did check with the function dls.show_batch() that a same sequence appears with a different MASK tokens distribution each time I run this function (ie, each time a batch is created). One point for you :slight_smile:

  2. But, when I save my dls file (Dataloaders) through torch.save(dls, path_to_dls_file), its size is 6 Go when the training dataset has a size of 2Go. Why this gigantic size if dls is only a series of instructions to get batches of sequences?

  3. On top of that, I remember that Jeremy had mentioned in a video that in order to avoid a break between the generation of batches (per CPU) and the speed of data processing at the input of the GPU, text transform processes of fastai are applied when the Datasets and Dataloaders are created (see your code in my post). This would mean that text transforms are applied just one time on sequences (with just a bit of randomness between sequences of similar length applies at batch generation time). But this point goes against the point 1, no?

What do you think? Thank you.

Hi @morgan.

I have 2 questions about the code in your notebook “FastHugs: Language Modelling with Tranformers and Fastai”:

special token_ids

  1. special token_ids (ids of the tokens [CLS] and [SEP] in the case of BERT, for example) are added to the sequence of numericalized tokens by the object instantiated by the class AddSpecialTokens(Transform).

  2. This operation is done in the Datasets creation by the following code:

    splits = ColSplitter()(df)
    tfms=[attrgetter("text"), fastai_tokenizer, Numericalize(vocab=tokenizer_vocab_ls), 
          AddSpecialTokens(tokenizer), MLMTokensLabels(tokenizer)]
    dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)
    
  3. My understanding is that the special tokens_ids are added before the sequence transformation to a sequence with MASK tokens by the object instantiated by the class MLMTokensLabels(). If it is correct, it means that the special tokens_ids can be replaced as well by the MASK token id.

What do you think? If I’m right, this could be a problem because BERT-like models are trained and used in production with sequences always with special tokens_ids.

dl_type=SortedDL

Do you think an efficient idea to shuffle the object instantiated by the class SortedDL, moving the code

dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=SortedDL)

to

dsets = Datasets(df, splits=splits, tfms=[tfms], dl_type=partial(SortedDL, shuffle=True))

?

Hi @morgan,

Today, I tried to run your notebook 2020-04-24-fasthugs_language_model.ipynb but it appears that fastcore and/or fastai v2 have changed, no?

For example, I do not see how your class MLMTokenizer that inherits from the class Tokenizer can have a function fastai_tokenizer (MLMTokenizer.from_df) with arguments tok_func and post_rules: the function tokenize_df does not have these arguments but tok and rules (see my screen shots).

What do you think?

In the following screen shot, the error that comes from recent changes in fastai v2 libraries (I’m still searching where…) in your notebook 2020-04-24-fasthugs_language_model.ipynb: