Text transform functions fail due to bytelevel BPE from Roberta tokenizer

wdewit · May 10, 2021, 7:31pm

Hi,

I want to use fastai to train a text classifier using the Dutch Roberta model as the pretrained language model.

#Downloading the Dutch Roberta language model
from transformers import RobertaTokenizer, RobertaForSequenceClassification
dtokenizer = RobertaTokenizer.from_pretrained(“pdelobelle/robbert-v2-dutch-base”)
dmodel = RobertaForSequenceClassification.from_pretrained(“pdelobelle/robbert-v2-dutch-base”)

However, if I then follow the code on Medium to use the Roberta model with Fastai, I get nearly everywhere beginning of sentence tokens:

| ~~ik ~~itch . ~~verbinding ~~itch . ~~|action > internetproblemslow|~~~~~~~~~~

when running:

fastai_tokenizer = Tokenizer(tok_func = FastAiRobertaTokenizer(dtokenizer, max_seq_len=256), pre_rules=[replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces],
post_rules=[replace_all_caps, deal_caps]) ###text.transform | fastai

processor = get_roberta_processor(tokenizer=fastai_tokenizer, vocab=fastai_roberta_vocab_cleand)
data = RobertaTextList.from_df(df, “.”, cols=feat_cols, processor=processor)
.split_from_df(col=‘valid’)
.label_from_df(cols=label_cols,label_cls=CategoryList)
.databunch(bs=4, pad_first=False, pad_idx=0)

This is due to the bytelevel BPE tokenization from Roberta which puts a strange ‘G’ in front of most words:
Ik Ġprobeer Ġte Ġstreamen Ġop Ġtw itch . ĠNu Ġheeft Ġdat Ġeen Ġmaand Ġzonder Ġproblemen Ġgewerkt Ġmaar Ġde Ġlaatste Ġ3 Ġdagen Ġdrop Ġik Ġalleen Ġmaar Ġframes Ġdoor Ġde Ġnetwerk verbinding Ġtussen Ġmij Ġen Ġtw itch . ĠIk Ġheb Ġverschillende Ġservers Ġgeprobeerd Ġen Ġhetzelfde Ġblijft Ġgebeuren . ĠMijn Ġupload snelheid Ġzit Ġnormaal Ġrond Ġde Ġ12 ĠM bps , Ġnu Ġzit Ġik Ġaf Ġen Ġtoe Ġeen Ġsnelheid stest Ġte Ġdoen Ġen | action > internetproblemslow

Is there a way I can avoid he puts this strange G in front so that he doesn’t think all those words are the beginning of a sentence ?

Thanks,
Wendy

arampacha · May 10, 2021, 10:10pm

The medium link you shared seems to be broken. So not sure what’s going on in FastaiRobertaTokenizer. But if I remember correctly Ġ simply indicates a whitespace, so it’s not a problem and should be there. It should be removable from printed outputs by setting remove_special_tokens=True whre the tokenizer decodes method is used.

And just in case: I’m using HuggingFace models with fastai and have lightweight utility lib for that, you can find example for text classification here | fasthugs and I can help with using it if needed. Also you can try blurr library here

wdewit · May 11, 2021, 6:40am

Thanks, I’ll try that.

arampacha · May 11, 2021, 9:48am

That’s a nice article, but it uses fastai v1. I’d recommend adapting the code from it to fastai v2 or going with one of the options I listed before. There is also official transformers tutorial Tutorial - Transformers | fastai in case you missed it