ULMFiT - German

Skeptic · March 22, 2019, 6:48pm

Hi @jyr1,
So you pretained the model on Amazon reviews. Right?

jyr1 · March 28, 2019, 11:37am

No, on wikipedia data.

piotr.czapla · April 5, 2019, 8:38pm

I can’t check your notebook as the github viewer isn’t working but I have a feeling what can be the issue.

Have you finetune the ulmfit on the entire dataset? I think the point with “extreme sample-efficiency” is that you use all your unspervised text that can be very large. And then only use few labeled examples. The reasoning behind this is that the raw text is cheep, labels are expensive.

cduguet · April 17, 2019, 2:07pm

Has anyone trained a German ULM on TransformerXL?

piotr.czapla · April 20, 2019, 12:18pm

@sylvain tried to train Transformer XL and then use as LSTM in ULMFiT but I’m not sure if that was working.

pheusel · April 23, 2019, 1:18pm

Hi @jyr1,

I’m currently working with your pretrained weights and it works great!
Thank you very much for your work.

Can you please priovide additional information about the weights?
Why did you “only” choose a vocab-size of 30k?
How many wikipedia articles have you used in total?
And how many words do the articles contain in total?

Best regards
Philipp

fitler · April 29, 2019, 2:08pm

I as well trained a German language model with sub-word tokenization. You can download it on GitHub and I also give more background information there: https://github.com/jfilter/ulmfit-for-german

I experimented on 10kGNAD and achieved an accuracy on the validation set (1k samples from the training set) and test set of 91% and 88.3%, respectively. Check out the notebook for the details.

Skeptic · April 29, 2019, 2:37pm

@fitler,
Thanks a lot for sharing.
Appreciate your effort for the community.

jolackner · May 6, 2019, 9:07am

I trained a German Language Model on TransformerXL, just slightly adapting your Spanish LM code, (Thank you for your work and for @kaspar’s hyperparameter experiments !)

on a dataset of ≈ 100 million words from German Wikipedia (10% of all articles with > 1000 chrs)
with a 60k vocabulary size, mem_len=50 (to increase training speed)
reaching a perplexity of best validation loss of 18.9, but:
the OOV/out of vocabulary percentage is quite high: 6.5%-6.9% (train and valid set respectively), making the low validation error & perplexity less meaningful (as @piotr.czapla explained further up)

Notebook and link to weights and vocab here.

EDIT: it seems that under fastai 1.0.52 Transformer.py isn’t fully functional - the fastai wizards already repaired the problem, we should either wait for the next fastai release or stay with 1.0.51, if using Transformer, Sylvain Gugger suggests.

nke · May 12, 2019, 3:17am

Hello, I am trying to clone the multilingual ULMFiT from GitHub in Google Colab. However, I keep running into the error below. I tried the various suggestions on StackExchange to no avail. I’m quite new to fast.ai so I have no idea what’s going on. Any help would be appreciated.

ERROR: Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-req-build-rrjon9hs/

malte.pietsch · June 19, 2019, 1:30pm

As an alternative: We have just decided to share our German BERT model with the community. It outperforms the multilingual BERT in 4 out of 5 tasks. You can find model and evaluation details here. Hope this helps some of you working on German NLP downstream tasks!

jolackner · June 19, 2019, 2:37pm

Thank you for sharing this! Reading your blog post re: training another model with more data - Are you considering training a German GPT-2 (345M)? That could take Natural language generation in German (which seems hard/unfeasible with BERT and is generally lagging behind English NLG) to a new level.

malte.pietsch · June 19, 2019, 3:27pm

Our priority for now is training more BERT models on larger datasets (incl. domain specific ones) and simplify the usage for standard downstream tasks like document classification, NER, QA …

But maybe we afterwards move on to GPT-2 (or whatever NLG model is out there by then).

piotr.czapla · June 25, 2019, 8:59am

Have you considered getting the model exposed through pytorch.hub either through fastai repo or huggingface?

It would be nice if someone could simply state:
tokenizer = torch.hub.load('huggingface/pytorch-pretrained-BERT', 'bertTokenizer', 'bert-de-deepset-base', do_basic_tokenize=True, do_lower_case=False)

faib · July 3, 2019, 8:46am

From my understanding there is no pretrained german language model for ULMFiT working for the current fastai v1.0.54?

Even @jyr1’s model gives me an error when I try to load it:

weights='30k-pre-ger'
vocs='30k-pre-ger-itos'
learn = language_model_learner(data_lm, AWD_LSTM, pretrained=False, drop_mult=0.5, pretrained_fnames=[weights, vocs])

jolackner · July 3, 2019, 2:36pm

That’s due to all language model shapes being divisible by 8 as of fastai 1.0.53 by default (reason: half precision training is much faster this way). Thankfully, Sylvain Gugger posted a workaround here:

faib · July 4, 2019, 8:58am

I was completely unaware of that. Will try today, thank you!

felixsmueller · September 10, 2019, 8:25am

Hi
Is there a trained German model in the meantime that can be downloaded somewhere or will the ‘official’ German model come out soon (in the official model zoo)?
Regards, Felix

jolackner · September 10, 2019, 8:58am

Hi @felixsmueller. @jyr1 trained a language model a while back that works well for me:

If using a recent fastai version (1.0.53 and later), you will have to read in the weights according to Sylvain Gugger’s workaround (see post a bit further up).

felixsmueller · September 11, 2019, 7:59pm

Thanks a lot.
Just some more descriptions for newbies like me:
Download the language model files (from https://drive.google.com/open?id=1gkuY3Tz6LBmcehAnZ95jssV80CBQh7L1) and store them into your Google drive account.
#Then the following code allows to access your Google drive:
from google.colab import drive
drive.mount(’/content/gdrive’, force_remount=True)
root_dir = “/content/gdrive/My Drive/”
base_dir = root_dir + ‘Colab Notebooks/FastAIGermanModel’ #Adapt path
#The following code then reads in the model:
FILE_LM_ENCODER = base_dir + ‘/30k-pre-ger’
FILE_ITOS = base_dir + ‘/30k-pre-ger-itos’
config = awd_lstm_lm_config.copy()
config[‘n_hid’] = 1150
learn = language_model_learner(data_lm, AWD_LSTM, config=config, pretrained_fnames=[FILE_LM_ENCODER, FILE_ITOS], drop_mult=0.3)