FastHugs - fastai-v2 and HuggingFace Transformers

I have already tried to export the model,
I have used a custom transformers model to get logits like this :

class CustomTransformerModel(nn.Module):
    def __init__(self, transformer_model: PreTrainedModel):
        super(CustomTransformerModel,self).__init__()
        self.transformer = transformer_model
    
   def forward(self, input_ids):
       # Return only the logits from the transfomer
       logits = self.transformer(input_ids)[0]   
    return logits

and I haved defined the learner as follows :

loss_func = nn.BCEWithLogitsLoss()
custom_transformer_model = CustomTransformerModel(transformer_model = bert_model)
from fastai.callbacks import * learner = Learner( databunch, custom_transformer_model, loss_func=loss_func, )

but when loading the model i get this error :

You need to declare what your custom model is before calling it. IE in a .py file you import or in a cell so it can reference it.

1 Like

I don’t know how to define the costum model to the learner ? ( without using data )

Just the architecture you’re using. That custom transformer model you defined earlier needs to be in that notebook

1 Like

Hi, I’m also working on trying to get multi-label text classification to work. This is what I have done so far, any help would be really appreciated :smiley:

My data looks like this:

text label is_valid
Lorem ipsum tag_01 | tag_02 False

And I initialise the model and tokenizer with the following plus a few modifications to your FastHugsTokenizer and FastHugsModel functions.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

transformer_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
transformer_model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name)

Following the rest of your notebook works with single labels, but I’m having some issues trying to adapt it to handle multi-labels. I first get a list of all labels with

a = [x.split('|') for x in df.label]
chain = itertools.chain(*a)
b = list(set(list(chain)))

And then I create a dataset with

splits = ColSplitter()(df)
x_tfms = [attrgetter("text"), Tokenizer.from_df('text', fasthugstok), Numericalize(vocab=transformer_vocab)]
dsets = Datasets(df, splits=splits, tfms=[x_tfms, [ColReader('label', label_delim='|'), MultiCategorize(vocab=b)]], dl_type=SortedDL)
bs = 16
dls = dsets.dataloaders(bs=bs, device='cuda', before_batch=transformer_padding(transformer_tokenizer))

The problem is that although I do get something when I run dls.train_ds[0]

(TensorText([  102, 4078, 30952, 
...
           157, 12264,  3937, 103]),
 TensorMultiCategory([3]))

The dataloader is not working and I can’t start any training.

Could not do one pass in your dataloader, there is something wrong in it

My guess is that something is going on with tokenization and numericalize, as this works (with fastai’s own tokenization):

text_cols = ['text']
dsets = DataBlock(blocks=(TextBlock.from_df(text_cols), MultiCategoryBlock(vocab=b)),
                      get_x = [attrgetter('text')],
                      get_y = ColReader('label', label_delim='|'),
                      splitter = RandomSplitter(valid_pct=0.2),
                      dl_type = SortedDL,
                     )
bs = 16
dls = dsets.dataloaders(df, 
                        bs=bs, 
                        seq_len=80,
                        device='cuda', 
                        before_batch=transformer_padding(transformer_tokenizer),
                       )

Update:
This seems to work:

new_df = df.copy()
new_df = pd.concat([new_df, new_df['label'].str.get_dummies(sep='|')], axis=1)
x_tfms = [
    attrgetter('text'),
    Tokenizer.from_df(text_cols='text', res_col_name='text', tok_func=fasthugstok),
    Numericalize(vocab=transformer_vocab)
]

y_tfms = [
    ColReader(b),
    EncodedMultiCategorize(vocab=b)
]

dsets = Datasets(items=new_df,
                 tfms=[x_tfms, y_tfms],
                 splits=ColSplitter(col='is_valid')(new_df),
                 dl_type=SortedDL)
bs = 16
dls = dsets.dataloaders(
    bs=bs, 
#     device='cuda', 
    device='cpu',
    before_batch=transformer_padding(transformer_tokenizer),
)
...
opt_func = partial(Adam, decouple_wd=True)
cbs = [MixedPrecision(clip=0.1), SaveModelCallback()]
# loss = CrossEntropyLossFlat() #LabelSmoothingCrossEntropy
loss = nn.BCEWithLogitsLoss()
splitter = splitters[transformer_model.config.model_type]
learn = Learner(dls, 
                fasthugs_model, 
                opt_func=opt_func, 
                splitter=splitter, 
                loss_func=loss, 
                cbs=cbs, 
                metrics=[accuracy],
               )
learn.fit_one_cycle(3, lr_max=1e-2)

But training happens in the CPU. What might be issue with CUDA here?

Thank you!
Regards

2 Likes

MLM Language Modelling with HuggingFace transformers - RoBERTa pre-training edition

With this you can fine-tune a RoBERTa model on your specific dataset before training it on a down-stream task like sequence classification

The main trick for me was the creation of a MLM Transform (MLMTokensLabels) that would take the numericalized input x, do the masking and output a tuple of (x,y), where x has 15% of its tokens masked and y is the original input but with 85% of its tokens masked.

I have see others us a Callback to do the masking here, but by using Transforms I was able to use the dls.show_batch to see the decoded inputs and targets.

The MLM transform is more or less a rewrite of the masking function used in HuggingFace’s how to train a language model from scratch tutorial

I also had to overwrite one line in the Datasets class as it would try and make a tuple out of my tuple

Blog post and code can be found here

5 Likes

Hi, @morgan Fasthugs repo looks cool, I found this repo named fast-bert which is actually made for fastai v1, it allows training BERT with fastai1 very simple.
For example:

import torch
from transformers import BertTokenizer
from fast_bert.data import BertDataBunch
from fast_bert.learner import BertLearner
from fast_bert.metrics import accuracy
device = torch.device('cuda')

metrics = [{'name': 'accuracy', 'function': accuracy}]
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
databunch = BertDataBunch(
                          [PATH_TO_DATA],[PATH_TO_LABELS],
                          tokenizer,
                          train_file=[TRAIN_CSV],
                          val_file=[VAL_CSV],
                          test_data=[TEST_CSV],
                          text_col=[TEST_FEATURE_COL],
                          label_col=[0],
                          bs=64,
                          maxlen=140,
                          multi_gpu=False,
                          multi_label=False)
learner = BertLearner.from_pretrained_model(
                                            databunch,
                                            'bert-base-uncased',
                                            metrics,
                                            device,
                                            logger,
                                            is_fp16=False,
                                            multi_gpu=False,
                                            multi_label=False)

learner.fit(3, lr='1e-2')

is all you need to train a BERT model with fastai1, is there a plan to make “FastHugs” for v2 like “fast-bert” was for v1? It would make training all the models in the transformers library really accessible with fastai2.

1 Like

Thanks! No plans right now to extend FastHugs to include wrappers like BertDataBunch etc My intention was more to demonstrate how to extend fastai2 by showing what transforms or callbacks can be used, similar to the Transformers tutorial in the fastai docs.

I realise that this makes things a little more difficult for beginners, the blurr library might be another good option as it includes wrappers.

Having said that, you should be able to train a BERT-like model from scratch using the MLM notebook in FastHugs. (Note this notebook doesn’t include the Next Sentence Prediction training task that BERT also used, as subsequent researchers found that this task didn’t help performance (see the RoBERTa paper for more).

4 Likes

@morgan have you tried training your own ByteLevelBPETokenizer from the Tokenizers library. I tried using the .encodes method of the tokenizer to adapt to fastai’s tokenizer but I become unable to call next(iter(dls.train)) or next(iter(dls.train)). However, dls.one_batch() works but I can’t use this while training a model in the Learner. Have you encountered such a problem before?

PS: I have also tried using Bert tokenizer encode and encode_plus methods but none seem to work. I noticed in your notebook that you used the tokenize method instead. The BPE tokenizer from the tokenizers library does not have this tokenize method, it has just encode

Hmm that annoying, I’m actually planning on training ByteLevelBPETokenizer shortly!

Have you tried importing the RoBERTa tokenizer from the transformers library, should be Byte-Level BPE? Not sure if this one includes the ability to train or not tho…

Sorry can’t be much help!

I doubt it has the ability to train. I will check though. I’m trying to do translation with a low resource language which I strongly doubt Roberta has been trained on. I’ll keep trying to get it working and share my results. I’m also suspecting that it’s a multiprocessing problem with fastai(noticed you had the same conclusion in your notebook)

1 Like

Good to know, I’ll be trying to train it for Irish, also low resource, however XLM-R was trained on it, Maybe that might be worth a go if the language you’re training (or a cousin of it) is in XLM-R?

1 Like

Hello @Morgan,

You did an awesome work with FastHugs in order to wrap Hugging Face BERT-like models into fastai v2. Congratulations!

I’m currently using your code about Model Language in order to adapt my fine-tuning method (see post) for generative model.

I have 2 questions about class FastHugsTokenizer() and class MLMTokensLabels(Transform):

  1. it allows to create a sequence of max_seq_len - 2 (ex. for RoBERTa and BERT: 512 - 2 = 510 tokens) from each text cell of the training and validation dataset. But what about the tokens after this limit for a text of 1000 tokens for example? They are thrown away?
  2. and about the 15% tokens that are masked (80%), changed to another token (10%) or unchanged (10%): at each batch generation, they are (re)created (within Dataloaders) that would be a kind of Data Augmentation technique or they are always the same?

Note: about your code in class MLMTokensLabels(Transform) > def _replace_with_other() > random_words = torch.randint(len(self.tok), labels.shape, dtype=torch.long), it allows as well the special tokens (<s>, </s>, <pad>, <unk>, <mask>) to replace a token from the 15% chosen but not replaced by the <mask> token: don’t you think it should be better not authorize the special tokens here? What would be the meaning of passing a sequence to the model with the <pad> token inside for example?

Correct. If I recall correctly I don’t think the Roberta authors made an effort to use the remainder of the text samples, they just took the first 510.

(Interestingly in the NLP Checklist paper (youtube) they point out that performance on the Quora Question Pairs (QQP) task suffers when the questions’ positions are swapped, with models tending to focus more on the first question. Maybe this chopping of text is related…)

Correct, I believe that is one of the advantages of MLM training,

Thats an excellent point! I don’t recall if the authors took that into account. That section of code was a re-write from HuggingFace’s implementation. Maybe they accounted for it elsewhere, but if not then their models are also trained like that. Well spotted, I would avoid the special tokens alright yep as it wouldn’t make sense, even for data augmentation, to be adding those tokens I would say

1 Like

Hi Morgan.
Do you plan to update your code with Whole Word Masking technique?
Thanks.

Oh interesting, I’ll add it to my to do list!

Hi, I’m taking baby steps into the world of transformers here, so thank you for this repository! I had a question on configuring the _num_labels because it is not working in my case.

I am sending _num_labels as an argument as I initialize the model:

fasthugs_model = FastHugsModel(transformer_cls=model_class, config_dict=config_dict, n_class=fct_dls.c, pretrained=True)

And, I just traced it with pdb, and also see that the config _num_labels is indeed updated to 30 (number of my classes):

-> if pretrained: self.transformer = transformer_cls.from_pretrained(model_name, config=self.config)
(Pdb) self.config
RobertaConfig {
  "_num_labels": 30,
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

However, when I print the model, the classification head still has two. Is there some other place it is getting overwritten?

    )
    (classifier): RobertaClassificationHead(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (out_proj): Linear(in_features=768, out_features=2, bias=True)
    )
  )
)
(Pdb) q

Thanks!

Not sure if that is the source of the issue but your config file says:

"architectures": [
    "RobertaForMaskedLM"
  ]

Shouldn’t it be RobertaForSequenceClassification if you have a classification task?

1 Like

For anyone else who might face same issue.

It should be config.num_labels

not config._num_labels

1 Like

Thanks @shimsan , this solved the same problem I had.

1 Like