Hi all,
after I’ve had problems getting decent results with the default fastai transformer in language modelling, I tried to integrate a pretrained transformer from huggingface into fastai following this tutorial.
Since I am looking at language generation, I used the pretrained GPT2LMHeadModel.
I initialized a LanguageLearner with this model and without further training tried to predict text with it.
However, this produces complete nonsense. I suspect this is due to tokenization.
These are the relevant snippets from my code:
gpt2_tok = GPT2Tokenizer.from_pretrained(‘gpt2’)
fastai_tokenizer = Tokenizer(tok_func=FastAiGPT2Tokenizer(gpt2_tok), pre_rules=[], post_rules=[])
gpt2_transformer_vocab = TransformersVocab(tokenizer = gpt2_tok)numericalize_processor = NumericalizeProcessor(vocab=gpt2_transformer_vocab)
tokenize_processor = TokenizeProcessor(tokenizer=fastai_tokenizer, include_bos=False, include_eos=False)
transformer_processor = [OpenFileProcessor(), tokenize_processor, numericalize_processor]
Since a language learner needs a databunch to be initialized, I passed it the following databunch which is created from only one file with very few lines of text.
data_lm = (TextList.from_folder(samples_path, processor=transformer_processor)
.split_none()
.label_for_lm()
.databunch(bs=bs))
My prediction using
learner.predict(“It had been a beatiful day and”, 40, temperature=0.75)
generated
‘It had been a beatiful day and Disclaimer Ġof Ġthe Ġthe Ġfinal Ġfirst . 23 " ĠGL . Ċ Ċ , Ġand Ġgoing Ġby Ġdid Ġto Ġanimal , Ġa Ġsix Ġteachings Ġby , Ġand Ġmore Ġof Ġthe Ġsuper Ġtracking Ġin Ġthe Ġtimes Ġof Ġhis Ġto Ġtake Ġfor’
Has anyone got any ideas on this? I suspect the Ċ is a special token which is not translated back for some reason. But even ignoring Ċ, the text is nonsense, although the model is meant to be pretrained.
I’m new to fast.ai and I would really appreciate some advice!
Thanks a lot in advance,
David
P.S.: I have posted a different question about using fastai’s default Transformer here.