Using fast ai 2.0 with other languages

m1active520 · November 17, 2020, 1:17pm

Hello, community,

In the fast ai 1 world I had a lot of success with this repository for german text classification:

So the code is not compatible with 2.0 and I generally want to know what would be the best way to use fast.ai 2.0 with different languages than English?

First question: Can I use the pre-trained model from the repository with fast.ai 2.0?

Looking at Hugging Face and the coverage they have created with the different BERT models trained on all kinds of languages, it is really easy to just load the language model of choice. Does Fast.ai 2.0 offer such a list of user trained LM models that can be used?

Thank you for your help!
Best regards,
Alex

msivanes · November 17, 2020, 10:19pm

The guidance so far I have seen is to leverage blurr library created by @wgpubs. This integrates fastai with HuggingFace. Please check using the examples here

m1active520 · November 18, 2020, 8:17am

Thank you for your reply. I will start exploring immediately and I will update the post.

Still, if someone else has deeper experience with using user-created ulmfit models and how they can be used in fast.ai 2.0, that would be much appreciated.

Best,
Axel

orendar · November 18, 2020, 9:07am

Hey Axel,

It’s easy to build your own language model and text classifier in the same manner as you did in v1 - check out the docs. However, v2 is a rewrite from scratch and so models created using v1 are not compatible with v2.

m1active520 · November 18, 2020, 1:29pm

Hello Orendar,

Thank you for you for your message. I will explore this.

In the past I used the community-pretrained models for German. If I recall correctly, the data comes mostly from Wikipedia and news articles.

My use case is multilabel or multiclass classification of job descriptions. I have tons of data, so I guess my question is, does it make sense to train my LM on general data (wikipedia for example) or rather focus immediately on the domain data (job descriptions)?

Thank you for your help.

orendar · November 18, 2020, 5:42pm

Hey Axel,

I see - I don’t know if there’s a model “zoo” yet for v2, you’re welcome to search though. In any case, the ulmfit approach is first pre-training a language model on a large, generic corpus such as wikipedia, then fine-tuning it on your own domain data, and finally using the language model encoder for a classifier trained on your own domain data for classification.

Let us know if you run into any problems, good luck!

FelixRe · December 11, 2020, 11:44am

I’m trying to build an ULMFIT model from the german wikipedia corpus with fastAi v2. However when creating the DataBlock (code below) my 32gig cpu memory are filled and the jupyter kernel crashes just stating:

The kernel appears to have died. It will restart automatically.
I have tried to reduce the batch size (from originally 128 to 64, 48 or 8), as well as reducing the corpus from complete (~100k docs/4.7gig to 16k documents). I’m running on Ubuntu 20.4, a gtx 1650 and I9-9980HK.
First I thought the SentencePieceTokenizer might be the issue but even when not using it the kernel crashes after some time. I can see a progress bar multiple times before the kernel crashes.
Sometimes it also gets stuck in the progress bar.

Does anybody have a suggestion?

bs=48
lang="de"
tok = SentencePieceTokenizer(lang=lang)
dblock = DataBlock(blocks=TextBlock.from_folder(path, is_lm=True, tok=tok, encoding='utf8'),
               get_items=get_files,
               splitter=RandomSplitter(valid_pct=0.1, seed=42),
              )

My next attempt is to run it on a different Machine using a gtx 1070, but I’m afraid that that won’t help since I suspect a cpu memory leak is the issue. That machine also has 32gig. (I also tried Colab, but it has issues with the thousands of files in one google drive folder)

Once I get the model build I’m happy to share it!
Greetings,
Felix

m1active520 · December 11, 2020, 12:11pm

Hi Felix,

I am afraid I am not of much help yet, but my team is also in the process of setting up fast.ai 2.0 and training a German language model. I will notify you if we have the same problem.

Best,
Axel

msivanes · December 11, 2020, 3:59pm

Not sure what is going on. But this would be my next attempts.

sentencepiece version that comes along with fastai
reducing the vocab size to 10K or 15K rather than the default max vocab size 30K
try the DataBlock.summary to see where things are going wrong
May be create datasets first to isolate the problem

FelixRe · December 11, 2020, 4:21pm

@m1active520 @msivanes
Thanks for your replies! I got everything setup on the second pc and it ran smoothly.
Let’s see if the Model trains fine over the Weekend.
I’ll try your suggestions anyway. If I find a solution I’ll share it.

florianl · December 11, 2020, 7:27pm

I trained a German language model recently with sentence piece (vocab size 30k for now). Here are the notebooks:

Did some experiments with QRNN and vocab size of 15k but didn’t really work better or faster than LSTM with 30k.

If someone is interested I can provide you with the pre-trained model too.

Florian

m1active520 · December 11, 2020, 8:00pm

Florian, from the bottom of my heart, Dankeschön

You just made my weekend a lot more interesting! May I ask what machine and how long the training took?

Best regards,
Axel

florianl · December 11, 2020, 8:38pm

Das freut mich

I trained it on a RTX3090 and the training on 160k wiki articles took about 25 minutes per epoch - so about 4 hours. I did the fine tuning on GermEval 2018 + 2019 tweets which took minutes and then trained a classifier on one of the 2018 GermEval tasks.

The notebooks don’t have any documentation right now … I’ll try to add that later. The preparation (notebooks 1+2) might not work (had issues with wikiextract) but the notebooks 3-5 should work fine.

FelixRe · December 13, 2020, 11:14am

I found the solution:

Allocate more RAM swap as much as needed. Of course this will slow down the process but prevented the crashes. I used gparted.

Here you can find a tutorial:

tobila · December 20, 2020, 5:40pm

Hi @florianl,

nice work! I’m currently working on a lyrics generator for various german artists as a learning / toy project over the holidays. However, since I’m fairly new to data science / AI and Python in general, I’m struggling to run / debug your notebooks 1+2. Is it possible to provide me with the pretrained language model, so that I can fine-tune this with my own lyrics data as shown in your notebook 4?

Thanks in advance and best regards,
Tobias

florianl · December 20, 2020, 8:17pm

Hi Tobias,

yes notebooks 1 and 2 were a lot of hacking unfortunately . I’ve uploaded the pretrained weights - see the README for the links. Yes you can fine tune the LM like in notebook 4.

I haven’t tried text generation myself but as far as I know transformer models (BERT, GPT2, etc …) work better for generation tasks.

See https://huggingface.co and Blurr (https://github.com/ohmeow/blurr huggingface integration for fastai).