ULMFiT - German

Great summary of your experiments, Thank you!

Let’s update the table above! Can you give me the the accuracy for dataset 1 and 2?

@aayush Let’s try to tackle GermEval 2018, @mkardas achieved SOTA asfar as I can tell but maybe with this nice preprocessing we can tackle it as well.

Btw. re sentence piece have you used the same preprocessing, what values have you got?

Sure, I’ll run the tests and confirm.

A little caught up at work but I will do my best.

Yes, the pre-processing steps are common for all experiments on a given dataset. I’m not sure I understand the question. How do you mean, “values”?

1 Like

By “values” - I meant the perplexity and the accuracy you managed to achieve using sentence piece.

The score for the model corresponds to the SENTP GE17 experiment in the table I posted. The perplexity is 52.45 and the accuracy is 0.33*. I haven’t fine-tuned on the GE '18 dataset as yet, mainly because the results weren’t very good on GE '17.

* I’m considering the perplexity score on the validation set.

I created and published a German topic classification dataset bases on ten thousand German news articles categorized into nine classes. I thought this might be interesting for some one looking here.

I trained a German LM, fine tuned it and build a classifier on top which has a 89% test accuracy. Additionally I compared the lowshot learning part of the ULMFiT-paper to fastText, a linear SVM and a TensorFlow NN. I’ll post the results here in the following weeks.

@tblock let me know how it is going? Btw. Are you sure you can include the scraped text in the csv format? It might be better to include just the links and the code to fetch them from websites, otherwise your repo should have a license: noncomercial research only.

@piotr.czapla I’ll keep you updated. I’m finishing my thesis about it at the moment.

Regarding the licensing, please check out https://github.com/tblock/10kGNAD for more detail on the dataset. I didn’t scrape the news articles, they are extracted form the One Million Post Corpus. I detail the license in the project readme and on the project page. But thanks for the heads up!

3 Likes

Hi @tblock,
I am also doing similar works and I am using the dataset from your repo.
Unfortunately, the dataset has more than one delimiters for certain rows.
Is “;” the delimiter, if yes then there are more than one for certain rows.
Is my understanding is right or?

Yes that’s correct.

I tried to keep changes from the original source to a minimum. Some texts could contain one or even multiple “;”. However this should not be a problem, since texts containing separators are quoted in the typical pythonic manner.

See the code folder for examples using the python csv lib.

Hey there,

I’m quite new to this field and I’m wondering if there are already language models for German that I could use? So If I’m considering doing a new project on text classification, what would you recomment?

Going through the UMLfit Steps myself?

Thanks in advance!

Dear @tblock,
Thanks a lot for your reply.
Regards,
Pappu Prasad

1 Like

Dear @LisaN,
There are number of approcahes that achieved state-of-the-art(SOTA) scores.
Its a evolving field where an approach which published few months ago became a “classical” approach and something new takes over. See openAI GPT-2 for example.
Since you are new to this field, I would recommend you to go through some baseline models such as RandomForest, SVM, …etc.
See their approach and build you basics.
Then go for ULMfiT.
Other models to look for after mastering ULMfIT: Google BERT, ELMo, openAI, tranformer XL(please check in internet for correct names).
I hope these tips will be useful to you :slight_smile:
Regards,
Pappu

Dear @Skeptic,

Thanks for the advice. I did some basic Neural Network and SGD models in Scikit Learn and Tensorflow, but I’m trying a little project on german language now - of course I could do it without using a pretrained model, but I’m wondering if anyone of you already used a pretrained model on German language and if so, would you only “trust” your own or would you reuse others?

And if you reuse others, which ones? I found one here in this Thread - thanks by the way, (https://lernapparat.de/german-lm/) but it didn’t work on the first try. So do you think I should invest time on making that one run or should I create my own?

Being state of the art is not that important to me, I just want to do good :slight_smile:

Hello @LisaN ,
That language model (generously provided by @t-v, thank you!) didn’t work for me either on my first attempt. I tried:

learn = language_model_learner(data_lm, arch = AWD_LSTM, pretrained = False, drop_mult=0.3)

learn.load_pretrained(wgts_fname=“German-LM/DE_model_dropout_0.1_1cycle_10epochs.pt”, itos_fname=“German-LM/DE_spacy_itos.pkl”)

… but I got this error message:

I’m not sure, it might have something to do with the fastai library developing/changing fast? Thomas’ model was from last summer, I believe.
What problems did you have while trying to import the language model?
If we don’t find an easy fix we might ask Thomas for inspiration&enlightenment.
I am very interested in using the same model as a basis for further fine tuning my pet project (text regression on German fiction).

All the best,
Johannes

1 Like

Hey @jolackner,

wow, cool (for me) that you are encountering the same problem! So your code looks as if you want an empty Learner first and then load the pretrained in, right? That was also one of my attempts. I downloaded them into the working directory - but also renamed them (DE_model.pt and DE_itos.pkl) because I considered two dots in a Filename might not be good.

The second attempt was going in the pytorch direction as described here. This one can load the Model weights via

DE_model = torch.load(’/kaggle/working/DE_model.pt’, map_location=lambda storage, loc: storage)

and the pickle goes like this:
DE_itos = pickle.load((Path(’/kaggle/working/DE_itos.pkl’)).open(‘rb’))

So I was able to load the weights and the Pickle but didn’t know how to build the Architecture from there…

I tried

learn = language_model_learner(data_lm, arch=DE_model, model_dir=path vocab=DE_itos)

Which gives me this error:


TypeError: unhashable type: ‘collections.OrderedDict’

I must admit - I don’t know what I’m doing, sometimes I copy and run…

1 Like

I think there’s only three predefined architectures you can load for the time being, see the fastai docs here:
https://docs.fast.ai/text.learner.html#language_model_learner

Trying to do arch=DE_model wouldn’t work, according to the docs. I guess you’ll need arch= AWD_LSTM . And then you would load your custom DE_model via
load_pretrained (https://docs.fast.ai/text.learner.html#RNNLearner.load_pretrained)

This just for now, I’ll provide updates if I can get the pretrained language model to work.

I pretrained on german wikipedia data and scraped about 60k~ german amazon reviews to ultimately create a sentiment classifier. It achieves about 93% accuracy on training, validation as well as an independent test set so I’m happy about that.

If it’s helpful for anyone I’ll upload my pretrained model here.

2 Likes

Hi,
that would be wonderful! How long did it take you to train?

Hi @jyr1,

I’d try it out, thanks :slight_smile:

Please do an upload!