NLU - German Language Model

Am I right that fastai uses a language model that’s trained on English WikiTexts. Is there also a German model or any instructions on how to train the model on my own?
Thanks!

1 Like
1 Like

The German model is the one from n-waves. There haven’t been any commits for 2 years? Has anyone already tried this model with the new fastai version?

The readme says: “This is early commits based on the Poleval2018, it won’t work well for the time being.” This doesn’t sound great, does it?

Hi, i trained a German language model last year on wikipedia. I actually didn’t use it (besides playing around) because I used plain tf-idf for the project that time.

I can provide you with the (undocumented ;)) notebooks and model files if you like. I’d be interested in training a German sentence piece based model. I’d like to do it right (last year I didn’t really know what I was doing ;)) is there a guide how to preprocess wiki and train the model correctly?

Right now I don’t know what’s the right way to go for German NLP. ULMFiT? MultiFit? BERT?

1 Like

Hi Thanks! Yes, I would be interested in your notebooks. I have also done some experiments with n-wave but some lost the models. I am surprised if this will work with version 2 of fastai.

I want to try a language model to generate texts but I am not quite sure if I am doing it correctly. I followed the instruction on using the pretrained weights mentioned in this post (and used these pretrained weights):

Then I setup my dataloaders with textes about curcuma that I had scraped from different blogs.

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(BASE, is_lm=True),
    get_items=get_text_files, splitter=RandomSplitter(0.1)
).dataloaders(BASE, path=BASE, bs=64, seq_len=80)

I set up my learner and trained for 1 epoch:

learn = language_model_learner(dls_lm, 
                               AWD_LSTM, 
                               config=config, 
                               pretrained_fnames=[FILE_LM_ENCODER, FILE_ITOS],
                               drop_mult=0.3,
                               metrics=[accuracy, Perplexity()]).to_fp16()

Then I unfreezed to layer -1 and trained. Unfreezed to layer -2 and trained again.

Now I get up to an accuracy of 55% and I tried to produce a text. But this text often contains repeats of the same group of words.

I am uncertain if I am retraining an English-Language model with German texts? (Because the texts are in German…)

What could I improve to get more meaningful texts?

If want to have a look at the whole notebook, please see my github repo: https://github.com/we-make-ai/kurkuma_textgenerator

There you can also see repeats like this one:

Diese können die Symptome von Entzündungen , Entzündungen , Entzündungen , Entzündungen , Pickel und Krankheiten lindern

Here the term Entzündungen is repeated 4 times within one sentence,

a bit late but here are the notebooks to train a language model with sentence piece with fastai2.

2 Likes