I don’t know how to define the costum model to the learner ? ( without using data )
Just the architecture you’re using. That custom transformer model you defined earlier needs to be in that notebook
Hi, I’m also working on trying to get multi-label text classification to work. This is what I have done so far, any help would be really appreciated
My data looks like this:
text | label | is_valid |
---|---|---|
Lorem ipsum | tag_01 | tag_02 | False |
And I initialise the model and tokenizer with the following plus a few modifications to your FastHugsTokenizer
and FastHugsModel
functions.
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
transformer_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
transformer_model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name)
Following the rest of your notebook works with single labels, but I’m having some issues trying to adapt it to handle multi-labels. I first get a list of all labels with
a = [x.split('|') for x in df.label]
chain = itertools.chain(*a)
b = list(set(list(chain)))
And then I create a dataset with
splits = ColSplitter()(df)
x_tfms = [attrgetter("text"), Tokenizer.from_df('text', fasthugstok), Numericalize(vocab=transformer_vocab)]
dsets = Datasets(df, splits=splits, tfms=[x_tfms, [ColReader('label', label_delim='|'), MultiCategorize(vocab=b)]], dl_type=SortedDL)
bs = 16
dls = dsets.dataloaders(bs=bs, device='cuda', before_batch=transformer_padding(transformer_tokenizer))
The problem is that although I do get something when I run dls.train_ds[0]
(TensorText([ 102, 4078, 30952,
...
157, 12264, 3937, 103]),
TensorMultiCategory([3]))
The dataloader is not working and I can’t start any training.
Could not do one pass in your dataloader, there is something wrong in it
My guess is that something is going on with tokenization and numericalize, as this works (with fastai’s own tokenization):
text_cols = ['text']
dsets = DataBlock(blocks=(TextBlock.from_df(text_cols), MultiCategoryBlock(vocab=b)),
get_x = [attrgetter('text')],
get_y = ColReader('label', label_delim='|'),
splitter = RandomSplitter(valid_pct=0.2),
dl_type = SortedDL,
)
bs = 16
dls = dsets.dataloaders(df,
bs=bs,
seq_len=80,
device='cuda',
before_batch=transformer_padding(transformer_tokenizer),
)
Update:
This seems to work:
new_df = df.copy()
new_df = pd.concat([new_df, new_df['label'].str.get_dummies(sep='|')], axis=1)
x_tfms = [
attrgetter('text'),
Tokenizer.from_df(text_cols='text', res_col_name='text', tok_func=fasthugstok),
Numericalize(vocab=transformer_vocab)
]
y_tfms = [
ColReader(b),
EncodedMultiCategorize(vocab=b)
]
dsets = Datasets(items=new_df,
tfms=[x_tfms, y_tfms],
splits=ColSplitter(col='is_valid')(new_df),
dl_type=SortedDL)
bs = 16
dls = dsets.dataloaders(
bs=bs,
# device='cuda',
device='cpu',
before_batch=transformer_padding(transformer_tokenizer),
)
...
opt_func = partial(Adam, decouple_wd=True)
cbs = [MixedPrecision(clip=0.1), SaveModelCallback()]
# loss = CrossEntropyLossFlat() #LabelSmoothingCrossEntropy
loss = nn.BCEWithLogitsLoss()
splitter = splitters[transformer_model.config.model_type]
learn = Learner(dls,
fasthugs_model,
opt_func=opt_func,
splitter=splitter,
loss_func=loss,
cbs=cbs,
metrics=[accuracy],
)
learn.fit_one_cycle(3, lr_max=1e-2)
But training happens in the CPU. What might be issue with CUDA here?
Thank you!
Regards
MLM Language Modelling with HuggingFace transformers - RoBERTa pre-training edition
With this you can fine-tune a RoBERTa model on your specific dataset before training it on a down-stream task like sequence classification
The main trick for me was the creation of a MLM Transform (MLMTokensLabels
) that would take the numericalized input x
, do the masking and output a tuple of (x,y)
, where x
has 15% of its tokens masked and y is the original input but with 85% of its tokens masked.
I have see others us a Callback to do the masking here, but by using Transforms I was able to use the dls.show_batch
to see the decoded inputs and targets.
The MLM transform is more or less a rewrite of the masking function used in HuggingFace’s how to train a language model from scratch tutorial
I also had to overwrite one line in the Datasets
class as it would try and make a tuple out of my tuple
Hi, @morgan Fasthugs repo looks cool, I found this repo named fast-bert which is actually made for fastai v1, it allows training BERT with fastai1 very simple.
For example:
import torch
from transformers import BertTokenizer
from fast_bert.data import BertDataBunch
from fast_bert.learner import BertLearner
from fast_bert.metrics import accuracy
device = torch.device('cuda')
metrics = [{'name': 'accuracy', 'function': accuracy}]
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
databunch = BertDataBunch(
[PATH_TO_DATA],[PATH_TO_LABELS],
tokenizer,
train_file=[TRAIN_CSV],
val_file=[VAL_CSV],
test_data=[TEST_CSV],
text_col=[TEST_FEATURE_COL],
label_col=[0],
bs=64,
maxlen=140,
multi_gpu=False,
multi_label=False)
learner = BertLearner.from_pretrained_model(
databunch,
'bert-base-uncased',
metrics,
device,
logger,
is_fp16=False,
multi_gpu=False,
multi_label=False)
learner.fit(3, lr='1e-2')
is all you need to train a BERT model with fastai1, is there a plan to make “FastHugs” for v2 like “fast-bert” was for v1? It would make training all the models in the transformers library really accessible with fastai2.
Thanks! No plans right now to extend FastHugs to include wrappers like BertDataBunch etc My intention was more to demonstrate how to extend fastai2 by showing what transforms or callbacks can be used, similar to the Transformers tutorial in the fastai docs.
I realise that this makes things a little more difficult for beginners, the blurr library might be another good option as it includes wrappers.
Having said that, you should be able to train a BERT-like model from scratch using the MLM notebook in FastHugs. (Note this notebook doesn’t include the Next Sentence Prediction training task that BERT also used, as subsequent researchers found that this task didn’t help performance (see the RoBERTa paper for more).
@morgan have you tried training your own ByteLevelBPETokenizer
from the Tokenizers
library. I tried using the .encodes
method of the tokenizer to adapt to fastai’s tokenizer but I become unable to call next(iter(dls.train))
or next(iter(dls.train))
. However, dls.one_batch()
works but I can’t use this while training a model in the Learner
. Have you encountered such a problem before?
PS: I have also tried using Bert tokenizer encode
and encode_plus
methods but none seem to work. I noticed in your notebook that you used the tokenize
method instead. The BPE tokenizer from the tokenizers library does not have this tokenize
method, it has just encode
Hmm that annoying, I’m actually planning on training ByteLevelBPETokenizer
shortly!
Have you tried importing the RoBERTa tokenizer from the transformers library, should be Byte-Level BPE? Not sure if this one includes the ability to train or not tho…
Sorry can’t be much help!
I doubt it has the ability to train. I will check though. I’m trying to do translation with a low resource language which I strongly doubt Roberta has been trained on. I’ll keep trying to get it working and share my results. I’m also suspecting that it’s a multiprocessing problem with fastai(noticed you had the same conclusion in your notebook)
Good to know, I’ll be trying to train it for Irish, also low resource, however XLM-R was trained on it, Maybe that might be worth a go if the language you’re training (or a cousin of it) is in XLM-R?
Hello @Morgan,
You did an awesome work with FastHugs in order to wrap Hugging Face BERT-like models into fastai v2. Congratulations!
I’m currently using your code about Model Language in order to adapt my fine-tuning method (see post) for generative model.
I have 2 questions about class FastHugsTokenizer()
and class MLMTokensLabels(Transform)
:
- it allows to create a sequence of max_seq_len - 2 (ex. for RoBERTa and BERT: 512 - 2 = 510 tokens) from each text cell of the training and validation dataset. But what about the tokens after this limit for a text of 1000 tokens for example? They are thrown away?
- and about the 15% tokens that are masked (80%), changed to another token (10%) or unchanged (10%): at each batch generation, they are (re)created (within
Dataloaders
) that would be a kind ofData Augmentation
technique or they are always the same?
Note: about your code in class MLMTokensLabels(Transform)
> def _replace_with_other()
> random_words = torch.randint(len(self.tok), labels.shape, dtype=torch.long)
, it allows as well the special tokens (<s>
, </s>
, <pad>
, <unk>
, <mask>
) to replace a token from the 15% chosen but not replaced by the <mask>
token: don’t you think it should be better not authorize the special tokens here? What would be the meaning of passing a sequence to the model with the <pad>
token inside for example?
Correct. If I recall correctly I don’t think the Roberta authors made an effort to use the remainder of the text samples, they just took the first 510.
(Interestingly in the NLP Checklist paper (youtube) they point out that performance on the Quora Question Pairs (QQP) task suffers when the questions’ positions are swapped, with models tending to focus more on the first question. Maybe this chopping of text is related…)
Correct, I believe that is one of the advantages of MLM training,
Thats an excellent point! I don’t recall if the authors took that into account. That section of code was a re-write from HuggingFace’s implementation. Maybe they accounted for it elsewhere, but if not then their models are also trained like that. Well spotted, I would avoid the special tokens alright yep as it wouldn’t make sense, even for data augmentation, to be adding those tokens I would say
Hi Morgan.
Do you plan to update your code with Whole Word Masking technique?
Thanks.
Oh interesting, I’ll add it to my to do list!
Hi, I’m taking baby steps into the world of transformers here, so thank you for this repository! I had a question on configuring the _num_labels
because it is not working in my case.
I am sending _num_labels as an argument as I initialize the model:
fasthugs_model = FastHugsModel(transformer_cls=model_class, config_dict=config_dict, n_class=fct_dls.c, pretrained=True)
And, I just traced it with pdb, and also see that the config _num_labels is indeed updated to 30 (number of my classes):
-> if pretrained: self.transformer = transformer_cls.from_pretrained(model_name, config=self.config)
(Pdb) self.config
RobertaConfig {
"_num_labels": 30,
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"type_vocab_size": 1,
"vocab_size": 50265
}
However, when I print the model, the classification head still has two. Is there some other place it is getting overwritten?
)
(classifier): RobertaClassificationHead(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(out_proj): Linear(in_features=768, out_features=2, bias=True)
)
)
)
(Pdb) q
Thanks!
Not sure if that is the source of the issue but your config file says:
"architectures": [
"RobertaForMaskedLM"
]
Shouldn’t it be RobertaForSequenceClassification
if you have a classification task?
For anyone else who might face same issue.
It should be config.num_labels
not config._num_labels
Thanks @shimsan , this solved the same problem I had.
Hi @morgan, I appreciate your help with the Roberta example. I have a few questions about the code. Thanks for your time and help.
-
When saving the model, is using TextLearner.save_encoder more appropriate? I am guessing it saves the unique vocabulary in the tokenizer too.
-
What part of tokenization updates the vocabulary with new words? I am doing Language Modeling on domain specific text.
-
Is it possible to change the sequence length for the batching process? I understand 510 is the max sequence length for the model, but is it possible to have this 510 sequence length for the model while shorting the batch sequence length to reduce memory?
Hey @nickgeoca
-
save_encoder
will only save the model, not the optimizer, vocab, dls etc -
The tokenizer isn’t trained in this version. You’d have to train your own tokenizer if your text distribution is very different from the text Roberta was trained on. If you’re using the pre-trained model you’d also have to modify the embedding layer to account for this difference in vocab.
-
No need to change the batch length,
SortedDL
cleverly sorts batches according to the length of the sequences (roughly), and the padding function only pads to the length of the longest sequence. So it should already be pretty efficient in that regard!