ULMFiT - German


(Piotr Czapla) #41

@MicPie
You wanted to participate, and I think we have really cool way that you can join.

We would like to try escaping emojis to german words. Here is how this looks in english:

The idea is to use such dictionary such as (https://emojipedia.org/) translate the emoji description to german and replace the emojis with this translations.

So this ":smile: " would be translated to "<e>Lächeln</e>".
That way model will be able to use it knowledge about words to interpret emojis and we might get better results.

Let me know what you think.


(Michael) #42

Dear Piotr,

emojipedia seems to have 2,823 emoji. This sounds like a lot.
Would we need all?

I would guess 20% of the emoji make up 80% of the used emoji?
Can we estimate it from the text data used?

What do you think?


(Piotr Czapla) #43

I’m more for testing multiple vocab sizes. in Poleval lager, vocab did not correspond to better accuracy. In fact we decided to go with 25k vocab size as it was better performing than 50k.

That changes a lot, thank you for explanation!

We haven’t tested the vanilla ulmfit, that much. with Sentence Piece the vocab size 20k-25k worked the best. Attached is our experiments (up_low means that we haven’t normalized for the case, most_low means that we lowercased all first letters of words and left other letters untouched )

Definitely just get the most popular emoji translated, you can extract all emojis from text here is how

and then just translate the ones that are most frequent.


(Leon Dummer) #44

Wouldn’t it be possibly better to process the emojis simply as emojis, because the weight of their meaning can be stronger than the usage of a similar word. Or shouldn’t they at least be somehow tagged by a token to mark that it is an emoji transferred to a word?


(Marcin Kardas) #45

3k emoji doesn’t sound like a lot :wink: I’ve found a few lists with names in German, if you speak German well you can select one (or find a better list) that is most appropriate for our task:

Using some jQuery kung fu you can export such list to python dict. For example, to get the first list it would be something like:

var x={};$('#unicode_code_points tbody tr').each(
    function(i, tr){
        x[tr.children[1].textContent.trim()]=tr.children[2].textContent.trim();
    });
console.log(JSON.stringify(x))

The dict can be than used to escape emoji as a preprocessing step.


(Marcin Kardas) #46

Yeah, we assume that emoji are escaped with <e> and </e> tags (or <emoji> and </emoji> to avoid confusion). So a sentence

Ich liebe das :heart:.

could be encoded as

Ich liebe das <emoji> rotes Herz </emoji>.

It is hard to tell at the moment if such an encoding will be helpful, it mostly depends on emoji usage by twitter users.


(Piotr Czapla) #47

Fyi, user needs trust level 3 to be able to
“* make their own posts wiki (that is, editable by any TL1+ users)”
according to https://blog.discourse.org/2018/06/understanding-discourse-trust-levels/


(Michael) #48

Dear Marcin,

the first and the second list looks very good.

Could you already get the first list with jQuery?
(The approx 2,7k emoji are so much because they have every type of emoji, like different color versions, as well in their list. With that it is not so much. :wink: )

I am playing around with beautifulsoup and python more or less successfully to get the list from the last link.

If you have the first list already in python I would guess we wouldn’t need other lists on top and check this one more in detail?

I can have a look at the lists (german is my mother tongue), and then upload it somewhere.

Best regards
Michael


(Marcin Kardas) #49

Hi Michael,

thanks for verifying the lists and implementing an import script. The jQuery code was broken, because I formatted it here using a block quote instead of a code block and as a result apostrophes were changed into unicode qutoes ‘’. It’s now edited and works (you can paste it into a web console), but the great thing is that we don’t need it. After your message I’ve started checking a license of the first list and found the source on https://unicode.org/repos/cldr/trunk/common/annotations with much more languages :slight_smile: And it’s all in xml. I’m going to test if it helps with classification and I’ll keep you informed.

Good to know that your a native German speaker. Have you already played with ULMFiT, i.e., predicting an end of sentence based on the beginning (Language model text generation giving weird results in a consistent manner)? I was fiddling with Polish language model and I’m surprised how well it is in declension and conjugation, maintaining even long distance dependencies.


(Michael) #50

Hi Marcin,

that XML-file is great!
If you have results let us/me know. :slight_smile:

So far I just played around with the ULMFiT RNN with non-language sequence data.
But I am still trying to get a setup which trains in a reasonable time.
I have to see how I will proceed with this, as I will need more GPU power than I have currently available.

Best regards
Michael


#51

Model Parameters bptt 70, em_sz 300, nh 1150, nl 3

Training Parameters lr 1.2e-2, bs 64, use_clr_beta 10, 10, 0.95, 0.85/0.95, wd 3e-6, clip 0.2

GERMAN WIKIPEDIA (PRE-TRAINING)

EXP. TRAIN VAL PPX ACC NCYC., CL
SENTP DEWIKI (25K) 5.57 4.72 112.17 0.28 1, 12
SENTP GE17 (25K)* 5.26 3.96 52.45 0.33 1,12
SENTP GE18 (25K)^
SPACY (80K) 5.07 4.32 75.19 0.34 1, 12

GERMEVAL ‘17 (LM)

EXP. TRAIN VAL PPX ACC NCYC., CL
SENTP DEWIKI (25K)
SENTP GE17 (25K)* 4.24 4.41 82.27 0.32 1,80
SENTP GE18 (25K)^ NA NA NA NA NA
SPACY (80K) 4.19 4.07 58.55 0.35 2, 20

*GE '17 data pre-processing steps

  • Clean dirty characters
  • Remove double+ occurrences (for eg. !!! ---- …) or substitute with single occurence.
  • URLs are coded as <url> and Emails with <email>
  • Any @mentions of Deutsche Bahn are coded as <dbahn> and all other @mentions are - coded as @mention
  • Emojis and emoticons are coded as <e> Description </e> as recommended by @mkardas

^GE '18 data pre-processing steps

  • Clean dirty characters
  • Remove double+ occurrences (for eg. !!! ---- …) or substitute with single occurence.
  • @mentions are chosen based on a frequency count. All @mentions below frequency 10 are simply coded as @mention.
  • Emojis were kept as they were because no visible improvements were seen from using the encodings in GE '17. Moreover, the tokenization method should very well be able to characterize emojis as a separate unicode entity and the language model should be able to model the occurence of emojis just as well as any other word/character. One possible thing that could be done here is to either space pad continuous emojis such as :facepunch::fire: with :facepunch: :fire: and susbtitute double+ occurences such as :face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth::face_with_symbols_over_mouth: with :face_with_symbols_over_mouth:.

Performance on downstream task

  1. GermEval '17 sentiment classification task: (SPACY 80K on Wikipedia corpus) Accuracy on validation: 77.89%

These are my numbers so far. EDIT: I will upload my language model, fine-tuned model and the datasets The language model can be downloaded from here, and the pre-processing scripts can be found here and here* for others to experiment with.

Key observation: In terms of LM performance the vanilla spacy tokenization method seems to work better in practice than the sentence-piece implementation. I have not been able to train the SPM based classifier, but I’ll try to get those numbers by tomorrow as well – my guess is that it will not be better than the vanilla implementation.


*@piotr.czapla: sorry for committing directly into your repository, I wanted to commit to my fork but only later did I remember that I had write access to n-waves/ulmfit4de. Please let me know if you would like for me to revert the commit.


(Marcin Kardas) #52

Thank you for commiting the notebooks, nice preprocessing. I’m now working on GermEval '18 and use similar preprocessing, the biggest difference is that I encode all mentions, frequent or not. Great idea to left the popular ones untouched. I didn’t encode emails (and it seems neither did you :wink: (https://github.com/n-waves/ulmfit4de/blob/master/kernels/germeval17-prep.ipynb?short_path=2591b1f#L260 should be _re3?))

It’s hard to tell whether encoding emoji as text helps. In my experiments on GE18 it resulted in test macro F1 increase of 1pp., but the results of experiments vary a lot (training set is almost 20x smaller than in GE17).

Interesting observation, how do you measure LM performance?


#53

It should be _re3, thanks for pointing it out. It must have happened at some later point because all my per-processed data is already encoded correctly.

By “performance” here, I meant the perplexity score on the fine-tuned data – which would naturally affect whatever downstream task we choose to do with the LM.

If you’ll see in the table, the sentencepiece trained model (SENTP GE17)'s perplexity actually dropped after fine-tuning which leads me to conclude that the learned sentencepiece tokens actually failed to capture important “language characteristics” of the secondary (in this case the GermEval '17) dataset.

This could also be a flaw in my implementation: The sentencepieces are trained on the GE '17 corpus and then the pre-training is carried out on the Wikipedia dataset with the GE '17 learned tokens. A second way of sentencepiecing could be as in SENTP DEWIKI experiment where the sentencepieces are learned from the Wikipedia dataset.

I agree, this is something that still needs to be ascertained. Maybe @MicPie will be able to give us some insights.


(Marcin Kardas) #54

You mean 52.45 -> 82.27? I see that Piotrek had similar increase (https://github.com/n-waves/ulmfit4de/issues/4), but in these experiments the sentencepiece model was trained on wiki and there was no preprocessing, so LM had to predict links, user names, etc. The spacy result (75.19 -> 58.55) is promising, but for the complete picture we would need out-of-vocabulary rate (i.e., number of unknown tokens / all tokens) for both wiki and GE17 datasets.