Training Transformer models

Has anyone had any experience training the Transformer/Transformer XL models? I tried to use Transformer for ULM-FiT on the IMDB dataset, and found that the accuracy for training the frozen language model to have about 1/3 the validation accuracy compared to the AWD-LSTM. It may be due to the fact that the model needs much longer to train, but has anyone been able to get good results using Transformers?



The pretrained model isn’t with the same tokenization as fastai, it GPT-1 from openAI, so you should use their tokenization.

Ah ok, how should I use their tokenization? And if I use a non-pretrained model, will the default fastai tokenization work?



If you use a non-pretrained model, any tokenization will work yes. It’s just likely the reason you found bad results with the pretrained one.


Thanks for your help! A few last questions: does transformerXL use different tokenization as well? If we want to use pre-trained transformer/transformerXL, how would we go about the process (or will fastai implement those in the future)? I’m assuming I would have to implement their tokenizers and wrap it with a BaseTokenizer?



There is no pretrained model with transformer XL yet. We’ll be releasing one soon, and it’ll be with the default fastai tokenization.


Hey @sgugger,

Is the usage of GPT-1 documented in a notebook in docs_src so I can see a working example (If not I’d love to volunteer)?

Does fastai provide a GPT-1 tokenizer?
I could not locate one in text.transform.

How would one go about using the GPT-1 tokenizer along with the default pretrained model for Transformer to train a language model?



No it’s not documented anywhere, and no one has suggested a working example AFAICT. There is no GPT1-tokenizer, although we have Sentencepiece with BPE.


Hey @sgugger,

Thanks for the response.
If I understand correctly you’re referring to this

What I’m trying to understand is as follows:

Jeremy through the course of his lectures shows a default way of doing a lot of things. Approaches that work out of the box because they’ve been fine-tuned.

I’m trying to build a language model and build a host of classifiers on top of it.
The default setup for a language model is the AWD_LSTM.

However, if I want to try out the Transformer architecture it seems like (from the discussion above) that the default SpacyTokenizer will not give the best results.
So, what Tokenizer should I be using?
I’m not trying Neural Translation, I want to build a domain specific language model (like Jeremy has done in the course for the imdb classifier) using the Transformer arch.