Fastai integration with huggingface pytorch-transformers?

Has anyone used huggingface pytorch-transformers repo?

They have SOTA language models, including the very recent XLNet.

How difficult would it be to integrate their models for compatibility with fastai? It looks like even right now Transformer support in Fastai is still very early. Thanks.


Hi I recently integrated Fastai with BERT (using huggingface’s pretrained models). I will be writing a medium article and share the notebook asap. Let me know if that helps.



Hey, I just posted reference of my medium article on “Share your work here” section.

Here is the link:


Thank you for the implementation!

You used the “old” pytorch_pretrained_bert library instead of the new pytorch_transformers one. There is a breaking change, where model outputs are now tuples.
They give the instructions to just do the following:

# Let's load our model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# If you used to have this line in pytorch-pretrained-bert:
loss = model(input_ids, labels=labels)

# Now just use this line in pytorch-transformers to extract the loss from the output tuple:
outputs = model(input_ids, labels=labels)
loss = outputs[0]

How would you integrate this change with your existing notebook? Would one have to write a custom to achieve this? I can’t figure it out.

Your notebook gives me a 404 when I try the link. Can you please re-post the link here?

Oh, I will check the links in article in sometime. Meanwhile, here is the kaggle link:

Hey Fabian

Yes it so happened that when I finished my project, they changed from pytorch_pretrained_models to pytorch_transformers.

So, I haven’t tried this on newer version. I will see in few days time if it can be done easily on newer version as well


1 Like

has anybody tried pytorch-transformers to build a qa model on custom dataset ?

I have started using it for custom datasets for sentiment analysis. Just finished for IMDB movie review. Will soon put the code on github.

1 Like

Posted this article Running Pytorch-Transformers on Custom Datasets having code and other details.


Hello Fabian,

I think this article made by David Zhao may help you!

He found an easy solution to make the the new pytorch_transformers library compatible with fastai.

He also made a notebook that you can find here.


Thank you @maroberti! I already stumbled upon this link through the huggingface twitter account but that’s exactly what I was looking for :slight_smile: .



I’m currently trying to use:

from transformers import *

This works great for BERT:

pretrained_tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
fastai_vocab = Vocab(list(pretrained_tokenizer.vocab.keys()))

But if I try the same thing with XLNet:

pretrained_tokenizer = XLNetTokenizer.from_pretrained(‘xlnet-base-cased’)
fastai_vocab = Vocab(list(pretrained_tokenizer.vocab.keys()))

I get this error:

AttributeError: ‘XLNetTokenizer’ object has no attribute ‘vocab’

Does anyone know how to get the vocab for XLNet?



Here I made an article that can resolve your problem.

Hope it will help!

I found that compared to this article the results are a lot worse, do you have an idea why this might be the case?

Hello Max,

Thank you for the question!
Dev Sharma’s article doesn’t use the same dataset. So it’s not really relevant to compare it with my implementation.

Oh you’re right, my mistake

Also in addition, from my experiments, I’ve been finding that unfreezing and training the entire model seems to have equal, if not better performance than training the head first then gradually unfreezing the model. Often times it seems like it saves a lot more time to just train the unfrozen model

1 Like

Amazing thank you for this article.

1 Like

Yes, you are right!

It’s weird but it seems that sometimes it gives better results.

I didn’t take time to check if the tools given by fastai like Discriminative Learning Rate, Gradual Unfreezing or even Slanted Triangular Learning Rates return better results with the transformer architectures. So it’s good to experiment with these parameters!

I used Gradual Unfreezing to let the possibility to people to use it. Maybe Gradual Unfreezing gives better performances with other model types or other datasets…

Thank you very much for all your remarks. If you have other questions don’t hesitate!

Hi @maroberti

Thanks for your works and article. Really helpful.

I am actually participating in a Kaggle competition (Google QUEST) wherein I would like to use Transformers integration with Fastai. The problem is its an “internet off” competition. Any idea how to use your Kaggle Kernel with internet off?