Using fast ai 2.0 with other languages

Hello, community,

In the fast ai 1 world I had a lot of success with this repository for german text classification:

So the code is not compatible with 2.0 and I generally want to know what would be the best way to use fast.ai 2.0 with different languages than English?

First question: Can I use the pre-trained model from the repository with fast.ai 2.0?

Looking at Hugging Face and the coverage they have created with the different BERT models trained on all kinds of languages, it is really easy to just load the language model of choice. Does Fast.ai 2.0 offer such a list of user trained LM models that can be used?

Thank you for your help!
Best regards,
Alex

The guidance so far I have seen is to leverage blurr library created by @wgpubs. This integrates fastai with HuggingFace. Please check using the examples here

1 Like

Thank you for your reply. I will start exploring immediately and I will update the post.

Still, if someone else has deeper experience with using user-created ulmfit models and how they can be used in fast.ai 2.0, that would be much appreciated.

Best,
Axel

Hey Axel,

It’s easy to build your own language model and text classifier in the same manner as you did in v1 - check out the docs. However, v2 is a rewrite from scratch and so models created using v1 are not compatible with v2.

Hello Orendar,

Thank you for you for your message. I will explore this.

In the past I used the community-pretrained models for German. If I recall correctly, the data comes mostly from Wikipedia and news articles.

My use case is multilabel or multiclass classification of job descriptions. I have tons of data, so I guess my question is, does it make sense to train my LM on general data (wikipedia for example) or rather focus immediately on the domain data (job descriptions)?

Thank you for your help.

Hey Axel,

I see - I don’t know if there’s a model “zoo” yet for v2, you’re welcome to search though. In any case, the ulmfit approach is first pre-training a language model on a large, generic corpus such as wikipedia, then fine-tuning it on your own domain data, and finally using the language model encoder for a classifier trained on your own domain data for classification.

Let us know if you run into any problems, good luck! :slight_smile:

1 Like