Integrating pretrained huggingface transformers for language modelling

Hi all,

after I’ve had problems getting decent results with the default fastai transformer in language modelling, I tried to integrate a pretrained transformer from huggingface into fastai following this tutorial.
Since I am looking at language generation, I used the pretrained GPT2LMHeadModel.
I initialized a LanguageLearner with this model and without further training tried to predict text with it.
However, this produces complete nonsense. I suspect this is due to tokenization.

These are the relevant snippets from my code:

gpt2_tok = GPT2Tokenizer.from_pretrained(‘gpt2’)
fastai_tokenizer = Tokenizer(tok_func=FastAiGPT2Tokenizer(gpt2_tok), pre_rules=[], post_rules=[])
gpt2_transformer_vocab = TransformersVocab(tokenizer = gpt2_tok)

numericalize_processor = NumericalizeProcessor(vocab=gpt2_transformer_vocab)
tokenize_processor = TokenizeProcessor(tokenizer=fastai_tokenizer, include_bos=False, include_eos=False)
transformer_processor = [OpenFileProcessor(), tokenize_processor, numericalize_processor]

Since a language learner needs a databunch to be initialized, I passed it the following databunch which is created from only one file with very few lines of text.

data_lm = (TextList.from_folder(samples_path, processor=transformer_processor)
.split_none()
.label_for_lm()
.databunch(bs=bs))

My prediction using

learner.predict(“It had been a beatiful day and”, 40, temperature=0.75)

generated

‘It had been a beatiful day and Disclaimer Ġof Ġthe Ġthe Ġfinal Ġfirst . 23 " ĠGL . Ċ Ċ , Ġand Ġgoing Ġby Ġdid Ġto Ġanimal , Ġa Ġsix Ġteachings Ġby , Ġand Ġmore Ġof Ġthe Ġsuper Ġtracking Ġin Ġthe Ġtimes Ġof Ġhis Ġto Ġtake Ġfor’

Has anyone got any ideas on this? I suspect the Ċ is a special token which is not translated back for some reason. But even ignoring Ċ, the text is nonsense, although the model is meant to be pretrained.

I’m new to fast.ai and I would really appreciate some advice!
Thanks a lot in advance,

David

P.S.: I have posted a different question about using fastai’s default Transformer here.

Don’t know if you saw it, but Sylvain released a GPT2 tutorial for fastai v2 here: http://dev.fast.ai/tutorial.transformers

Does anyone have an example of a simple multi-category classifier building on the Sylvain’s GPT2 tutorial?

I think Jeremy is working on something, but in the meantime you can have a look at www.github.com/morganmcg1/fasthugs for some fastai + Transformers demos