Hi,
I want to use fastai to train a text classifier using the Dutch Roberta model as the pretrained language model.
#Downloading the Dutch Roberta language model
from transformers import RobertaTokenizer, RobertaForSequenceClassification
dtokenizer = RobertaTokenizer.from_pretrained(“pdelobelle/robbert-v2-dutch-base”)
dmodel = RobertaForSequenceClassification.from_pretrained(“pdelobelle/robbert-v2-dutch-base”)
However, if I then follow the code on https://medium.com/analytics-vidhya/using-roberta-with-fastai-for-nlp-7ed3fed21f6 to use the Roberta model with Fastai, I get nearly everywhere beginning of sentence tokens:
| ik itch . verbinding itch . |action > internetproblemslow|
when running:
fastai_tokenizer = Tokenizer(tok_func = FastAiRobertaTokenizer(dtokenizer, max_seq_len=256), pre_rules=[replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces],
post_rules=[replace_all_caps, deal_caps]) ###text.transform | fastaiprocessor = get_roberta_processor(tokenizer=fastai_tokenizer, vocab=fastai_roberta_vocab_cleand)
data = RobertaTextList.from_df(df, “.”, cols=feat_cols, processor=processor)
.split_from_df(col=‘valid’)
.label_from_df(cols=label_cols,label_cls=CategoryList)
.databunch(bs=4, pad_first=False, pad_idx=0)
This is due to the bytelevel BPE tokenization from Roberta which puts a strange ‘G’ in front of most words:
Ik Ġprobeer Ġte Ġstreamen Ġop Ġtw itch . ĠNu Ġheeft Ġdat Ġeen Ġmaand Ġzonder Ġproblemen Ġgewerkt Ġmaar Ġde Ġlaatste Ġ3 Ġdagen Ġdrop Ġik Ġalleen Ġmaar Ġframes Ġdoor Ġde Ġnetwerk verbinding Ġtussen Ġmij Ġen Ġtw itch . ĠIk Ġheb Ġverschillende Ġservers Ġgeprobeerd Ġen Ġhetzelfde Ġblijft Ġgebeuren . ĠMijn Ġupload snelheid Ġzit Ġnormaal Ġrond Ġde Ġ12 ĠM bps , Ġnu Ġzit Ġik Ġaf Ġen Ġtoe Ġeen Ġsnelheid stest Ġte Ġdoen Ġen | action > internetproblemslow
Is there a way I can avoid he puts this strange G in front so that he doesn’t think all those words are the beginning of a sentence ?
Thanks,
Wendy