ULMFiT - German

Yeah, we assume that emoji are escaped with <e> and </e> tags (or <emoji> and </emoji> to avoid confusion). So a sentence

Ich liebe das :heart:.

could be encoded as

Ich liebe das <emoji> rotes Herz </emoji>.

It is hard to tell at the moment if such an encoding will be helpful, it mostly depends on emoji usage by twitter users.

1 Like

Fyi, user needs trust level 3 to be able to
“* make their own posts wiki (that is, editable by any TL1+ users)”
according to https://blog.discourse.org/2018/06/understanding-discourse-trust-levels/

Dear Marcin,

the first and the second list looks very good.

Could you already get the first list with jQuery?
(The approx 2,7k emoji are so much because they have every type of emoji, like different color versions, as well in their list. With that it is not so much. :wink: )

I am playing around with beautifulsoup and python more or less successfully to get the list from the last link.

If you have the first list already in python I would guess we wouldn’t need other lists on top and check this one more in detail?

I can have a look at the lists (german is my mother tongue), and then upload it somewhere.

Best regards
Michael

Hi Michael,

thanks for verifying the lists and implementing an import script. The jQuery code was broken, because I formatted it here using a block quote instead of a code block and as a result apostrophes were changed into unicode qutoes ‘’. It’s now edited and works (you can paste it into a web console), but the great thing is that we don’t need it. After your message I’ve started checking a license of the first list and found the source on https://unicode.org/repos/cldr/trunk/common/annotations with much more languages :slight_smile: And it’s all in xml. I’m going to test if it helps with classification and I’ll keep you informed.

Good to know that your a native German speaker. Have you already played with ULMFiT, i.e., predicting an end of sentence based on the beginning (Language model text generation giving weird results in a consistent manner)? I was fiddling with Polish language model and I’m surprised how well it is in declension and conjugation, maintaining even long distance dependencies.

Hi Marcin,

that XML-file is great!
If you have results let us/me know. :slight_smile:

So far I just played around with the ULMFiT RNN with non-language sequence data.
But I am still trying to get a setup which trains in a reasonable time.
I have to see how I will proceed with this, as I will need more GPU power than I have currently available.

Best regards
Michael

Model Parameters bptt 70, em_sz 300, nh 1150, nl 3

Training Parameters lr 1.2e-2, bs 64, use_clr_beta 10, 10, 0.95, 0.85/0.95, wd 3e-6, clip 0.2

GERMAN WIKIPEDIA (PRE-TRAINING)

EXP. TRAIN VAL PPX ACC NCYC., CL
SENTP DEWIKI (25K) 5.57 4.72 112.17 0.28 1, 12
SENTP GE17 (25K)* 5.26 3.96 52.45 0.33 1,12
SENTP GE18 (25K)^
SPACY (80K) 5.07 4.32 75.19 0.34 1, 12

GERMEVAL ‘17 (LM)

EXP. TRAIN VAL PPX ACC NCYC., CL
SENTP DEWIKI (25K)
SENTP GE17 (25K)* 4.24 4.41 82.27 0.32 1,80
SENTP GE18 (25K)^ NA NA NA NA NA
SPACY (80K) 4.19 4.07 58.55 0.35 2, 20

*GE '17 data pre-processing steps

  • Clean dirty characters
  • Remove double+ occurrences (for eg. !!! ---- …) or substitute with single occurence.
  • URLs are coded as <url> and Emails with <email>
  • Any @mentions of Deutsche Bahn are coded as <dbahn> and all other @mentions are - coded as @mention
  • Emojis and emoticons are coded as <e> Description </e> as recommended by @mkardas

^GE '18 data pre-processing steps

  • Clean dirty characters
  • Remove double+ occurrences (for eg. !!! ---- …) or substitute with single occurence.
  • @mentions are chosen based on a frequency count. All @mentions below frequency 10 are simply coded as @mention.
  • Emojis were kept as they were because no visible improvements were seen from using the encodings in GE '17. Moreover, the tokenization method should very well be able to characterize emojis as a separate unicode entity and the language model should be able to model the occurence of emojis just as well as any other word/character. One possible thing that could be done here is to either space pad continuous emojis such as :facepunch::fire: with :facepunch: :fire: and susbtitute double+ occurences such as :face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth: with :face_with_symbols_over_mouth:.

Performance on downstream task

  1. GermEval '17 sentiment classification task: (SPACY 80K on Wikipedia corpus) Accuracy on validation: 77.89%

These are my numbers so far. EDIT: I will upload my language model, fine-tuned model and the datasets The language model can be downloaded from here, and the pre-processing scripts can be found here and here* for others to experiment with.

Key observation: In terms of LM performance the vanilla spacy tokenization method seems to work better in practice than the sentence-piece implementation. I have not been able to train the SPM based classifier, but I’ll try to get those numbers by tomorrow as well – my guess is that it will not be better than the vanilla implementation.


*@piotr.czapla: sorry for committing directly into your repository, I wanted to commit to my fork but only later did I remember that I had write access to n-waves/ulmfit4de. Please let me know if you would like for me to revert the commit.

4 Likes

Thank you for commiting the notebooks, nice preprocessing. I’m now working on GermEval '18 and use similar preprocessing, the biggest difference is that I encode all mentions, frequent or not. Great idea to left the popular ones untouched. I didn’t encode emails (and it seems neither did you :wink: (https://github.com/n-waves/ulmfit4de/blob/master/kernels/germeval17-prep.ipynb?short_path=2591b1f#L260 should be _re3?))

It’s hard to tell whether encoding emoji as text helps. In my experiments on GE18 it resulted in test macro F1 increase of 1pp., but the results of experiments vary a lot (training set is almost 20x smaller than in GE17).

Interesting observation, how do you measure LM performance?

1 Like

It should be _re3, thanks for pointing it out. It must have happened at some later point because all my per-processed data is already encoded correctly.

By “performance” here, I meant the perplexity score on the fine-tuned data – which would naturally affect whatever downstream task we choose to do with the LM.

If you’ll see in the table, the sentencepiece trained model (SENTP GE17)'s perplexity actually dropped after fine-tuning which leads me to conclude that the learned sentencepiece tokens actually failed to capture important “language characteristics” of the secondary (in this case the GermEval '17) dataset.

This could also be a flaw in my implementation: The sentencepieces are trained on the GE '17 corpus and then the pre-training is carried out on the Wikipedia dataset with the GE '17 learned tokens. A second way of sentencepiecing could be as in SENTP DEWIKI experiment where the sentencepieces are learned from the Wikipedia dataset.

I agree, this is something that still needs to be ascertained. Maybe @MicPie will be able to give us some insights.

1 Like

You mean 52.45 -> 82.27? I see that Piotrek had similar increase (https://github.com/n-waves/ulmfit4de/issues/4), but in these experiments the sentencepiece model was trained on wiki and there was no preprocessing, so LM had to predict links, user names, etc. The spacy result (75.19 -> 58.55) is promising, but for the complete picture we would need out-of-vocabulary rate (i.e., number of unknown tokens / all tokens) for both wiki and GE17 datasets.

Great summary of your experiments, Thank you!

Let’s update the table above! Can you give me the the accuracy for dataset 1 and 2?

@aayush Let’s try to tackle GermEval 2018, @mkardas achieved SOTA asfar as I can tell but maybe with this nice preprocessing we can tackle it as well.

Btw. re sentence piece have you used the same preprocessing, what values have you got?

Sure, I’ll run the tests and confirm.

A little caught up at work but I will do my best.

Yes, the pre-processing steps are common for all experiments on a given dataset. I’m not sure I understand the question. How do you mean, “values”?

1 Like

By “values” - I meant the perplexity and the accuracy you managed to achieve using sentence piece.

The score for the model corresponds to the SENTP GE17 experiment in the table I posted. The perplexity is 52.45 and the accuracy is 0.33*. I haven’t fine-tuned on the GE '18 dataset as yet, mainly because the results weren’t very good on GE '17.

* I’m considering the perplexity score on the validation set.

I created and published a German topic classification dataset bases on ten thousand German news articles categorized into nine classes. I thought this might be interesting for some one looking here.

I trained a German LM, fine tuned it and build a classifier on top which has a 89% test accuracy. Additionally I compared the lowshot learning part of the ULMFiT-paper to fastText, a linear SVM and a TensorFlow NN. I’ll post the results here in the following weeks.

@tblock let me know how it is going? Btw. Are you sure you can include the scraped text in the csv format? It might be better to include just the links and the code to fetch them from websites, otherwise your repo should have a license: noncomercial research only.

@piotr.czapla I’ll keep you updated. I’m finishing my thesis about it at the moment.

Regarding the licensing, please check out https://github.com/tblock/10kGNAD for more detail on the dataset. I didn’t scrape the news articles, they are extracted form the One Million Post Corpus. I detail the license in the project readme and on the project page. But thanks for the heads up!

3 Likes

Hi @tblock,
I am also doing similar works and I am using the dataset from your repo.
Unfortunately, the dataset has more than one delimiters for certain rows.
Is “;” the delimiter, if yes then there are more than one for certain rows.
Is my understanding is right or?

Yes that’s correct.

I tried to keep changes from the original source to a minimum. Some texts could contain one or even multiple “;”. However this should not be a problem, since texts containing separators are quoted in the typical pythonic manner.

See the code folder for examples using the python csv lib.

Hey there,

I’m quite new to this field and I’m wondering if there are already language models for German that I could use? So If I’m considering doing a new project on text classification, what would you recomment?

Going through the UMLfit Steps myself?

Thanks in advance!