Using fast ai 2.0 with other languages

Hello, community,

In the fast ai 1 world I had a lot of success with this repository for german text classification:

So the code is not compatible with 2.0 and I generally want to know what would be the best way to use fast.ai 2.0 with different languages than English?

First question: Can I use the pre-trained model from the repository with fast.ai 2.0?

Looking at Hugging Face and the coverage they have created with the different BERT models trained on all kinds of languages, it is really easy to just load the language model of choice. Does Fast.ai 2.0 offer such a list of user trained LM models that can be used?

Thank you for your help!
Best regards,
Alex

1 Like

The guidance so far I have seen is to leverage blurr library created by @wgpubs. This integrates fastai with HuggingFace. Please check using the examples here

1 Like

Thank you for your reply. I will start exploring immediately and I will update the post.

Still, if someone else has deeper experience with using user-created ulmfit models and how they can be used in fast.ai 2.0, that would be much appreciated.

Best,
Axel

Hey Axel,

It’s easy to build your own language model and text classifier in the same manner as you did in v1 - check out the docs. However, v2 is a rewrite from scratch and so models created using v1 are not compatible with v2.

Hello Orendar,

Thank you for you for your message. I will explore this.

In the past I used the community-pretrained models for German. If I recall correctly, the data comes mostly from Wikipedia and news articles.

My use case is multilabel or multiclass classification of job descriptions. I have tons of data, so I guess my question is, does it make sense to train my LM on general data (wikipedia for example) or rather focus immediately on the domain data (job descriptions)?

Thank you for your help.

Hey Axel,

I see - I don’t know if there’s a model “zoo” yet for v2, you’re welcome to search though. In any case, the ulmfit approach is first pre-training a language model on a large, generic corpus such as wikipedia, then fine-tuning it on your own domain data, and finally using the language model encoder for a classifier trained on your own domain data for classification.

Let us know if you run into any problems, good luck! :slight_smile:

1 Like

I’m trying to build an ULMFIT model from the german wikipedia corpus with fastAi v2. However when creating the DataBlock (code below) my 32gig cpu memory are filled and the jupyter kernel crashes just stating:

The kernel appears to have died. It will restart automatically.
I have tried to reduce the batch size (from originally 128 to 64, 48 or 8), as well as reducing the corpus from complete (~100k docs/4.7gig to 16k documents). I’m running on Ubuntu 20.4, a gtx 1650 and I9-9980HK.
First I thought the SentencePieceTokenizer might be the issue but even when not using it the kernel crashes after some time. I can see a progress bar multiple times before the kernel crashes.
Sometimes it also gets stuck in the progress bar.

Does anybody have a suggestion?

bs=48
lang="de"
tok = SentencePieceTokenizer(lang=lang)
dblock = DataBlock(blocks=TextBlock.from_folder(path, is_lm=True, tok=tok, encoding='utf8'),
               get_items=get_files,
               splitter=RandomSplitter(valid_pct=0.1, seed=42),
              )

My next attempt is to run it on a different Machine using a gtx 1070, but I’m afraid that that won’t help since I suspect a cpu memory leak is the issue. That machine also has 32gig. (I also tried Colab, but it has issues with the thousands of files in one google drive folder)

Once I get the model build I’m happy to share it!
Greetings,
Felix

Hi Felix,

I am afraid I am not of much help yet, but my team is also in the process of setting up fast.ai 2.0 and training a German language model. I will notify you if we have the same problem.

Best,
Axel

Not sure what is going on. But this would be my next attempts.

  • sentencepiece version that comes along with fastai
  • reducing the vocab size to 10K or 15K rather than the default max vocab size 30K
  • try the DataBlock.summary to see where things are going wrong
  • May be create datasets first to isolate the problem

@m1active520 @msivanes
Thanks for your replies! I got everything setup on the second pc and it ran smoothly. :pray:
Let’s see if the Model trains fine over the Weekend.
I’ll try your suggestions anyway. If I find a solution I’ll share it.

I trained a German language model recently with sentence piece (vocab size 30k for now). Here are the notebooks:

Did some experiments with QRNN and vocab size of 15k but didn’t really work better or faster than LSTM with 30k.

If someone is interested I can provide you with the pre-trained model too.

Florian

1 Like

Florian, from the bottom of my heart, Dankeschön :wink:

You just made my weekend a lot more interesting! May I ask what machine and how long the training took?

Best regards,
Axel

Das freut mich :wink:

I trained it on a RTX3090 and the training on 160k wiki articles took about 25 minutes per epoch - so about 4 hours. I did the fine tuning on GermEval 2018 + 2019 tweets which took minutes and then trained a classifier on one of the 2018 GermEval tasks.

The notebooks don’t have any documentation right now … I’ll try to add that later. The preparation (notebooks 1+2) might not work (had issues with wikiextract) but the notebooks 3-5 should work fine.

2 Likes

I found the solution:

Allocate more RAM swap as much as needed. Of course this will slow down the process but prevented the crashes. I used gparted.

Here you can find a tutorial:

Hi @florianl,

nice work! I’m currently working on a lyrics generator for various german artists as a learning / toy project over the holidays. However, since I’m fairly new to data science / AI and Python in general, I’m struggling to run / debug your notebooks 1+2. Is it possible to provide me with the pretrained language model, so that I can fine-tune this with my own lyrics data as shown in your notebook 4?

Thanks in advance and best regards,
Tobias

Hi Tobias,

yes notebooks 1 and 2 were a lot of hacking unfortunately . I’ve uploaded the pretrained weights - see the README for the links. Yes you can fine tune the LM like in notebook 4.

I haven’t tried text generation myself but as far as I know transformer models (BERT, GPT2, etc …) work better for generation tasks.

See https://huggingface.co and Blurr (https://github.com/ohmeow/blurr huggingface integration for fastai).