Text transform functions fail due to bytelevel BPE from Roberta tokenizer


I want to use fastai to train a text classifier using the Dutch Roberta model as the pretrained language model.

#Downloading the Dutch Roberta language model
from transformers import RobertaTokenizer, RobertaForSequenceClassification
dtokenizer = RobertaTokenizer.from_pretrained(“pdelobelle/robbert-v2-dutch-base”)
dmodel = RobertaForSequenceClassification.from_pretrained(“pdelobelle/robbert-v2-dutch-base”)

However, if I then follow the code on https://medium.com/analytics-vidhya/using-roberta-with-fastai-for-nlp-7ed3fed21f6 to use the Roberta model with Fastai, I get nearly everywhere beginning of sentence tokens:

| ik itch . verbinding itch . |action > internetproblemslow|

when running:

fastai_tokenizer = Tokenizer(tok_func = FastAiRobertaTokenizer(dtokenizer, max_seq_len=256), pre_rules=[replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces],
post_rules=[replace_all_caps, deal_caps]) ###text.transform | fastai

processor = get_roberta_processor(tokenizer=fastai_tokenizer, vocab=fastai_roberta_vocab_cleand)
data = RobertaTextList.from_df(df, “.”, cols=feat_cols, processor=processor)
.databunch(bs=4, pad_first=False, pad_idx=0)

This is due to the bytelevel BPE tokenization from Roberta which puts a strange ‘G’ in front of most words:
Ik Ġprobeer Ġte Ġstreamen Ġop Ġtw itch . ĠNu Ġheeft Ġdat Ġeen Ġmaand Ġzonder Ġproblemen Ġgewerkt Ġmaar Ġde Ġlaatste Ġ3 Ġdagen Ġdrop Ġik Ġalleen Ġmaar Ġframes Ġdoor Ġde Ġnetwerk verbinding Ġtussen Ġmij Ġen Ġtw itch . ĠIk Ġheb Ġverschillende Ġservers Ġgeprobeerd Ġen Ġhetzelfde Ġblijft Ġgebeuren . ĠMijn Ġupload snelheid Ġzit Ġnormaal Ġrond Ġde Ġ12 ĠM bps , Ġnu Ġzit Ġik Ġaf Ġen Ġtoe Ġeen Ġsnelheid stest Ġte Ġdoen Ġen | action > internetproblemslow

Is there a way I can avoid he puts this strange G in front so that he doesn’t think all those words are the beginning of a sentence ?


The medium link you shared seems to be broken. So not sure what’s going on in FastaiRobertaTokenizer. But if I remember correctly Ġ simply indicates a whitespace, so it’s not a problem and should be there. It should be removable from printed outputs by setting remove_special_tokens=True whre the tokenizer decodes method is used.

And just in case: I’m using HuggingFace models with fastai and have lightweight utility lib for that, you can find example for text classification here | fasthugs and I can help with using it if needed. Also you can try blurr library here

Thanks, I’ll try that.

The correct link is by the way:
[Using RoBERTa with fast.ai for NLP | by Dev Sharma | Analytics Vidhya | Medium](Using RoBERTa with fast.ai for NLP | by Dev Sharma | Analytics Vidhya | Medium

That’s a nice article, but it uses fastai v1. I’d recommend adapting the code from it to fastai v2 or going with one of the options I listed before. There is also official transformers tutorial Tutorial - Transformers | fastai in case you missed it