Model Parameters bptt 70, em_sz 300, nh 1150, nl 3
Training Parameters lr 1.2e-2, bs 64, use_clr_beta 10, 10, 0.95, 0.85/0.95, wd 3e-6, clip 0.2
GERMAN WIKIPEDIA (PRE-TRAINING)
EXP. | TRAIN | VAL | PPX | ACC | NCYC., CL |
---|---|---|---|---|---|
SENTP DEWIKI (25K) | 5.57 | 4.72 | 112.17 | 0.28 | 1, 12 |
SENTP GE17 (25K)* | 5.26 | 3.96 | 52.45 | 0.33 | 1,12 |
SENTP GE18 (25K)^ | |||||
SPACY (80K) | 5.07 | 4.32 | 75.19 | 0.34 | 1, 12 |
GERMEVAL ‘17 (LM)
EXP. | TRAIN | VAL | PPX | ACC | NCYC., CL |
---|---|---|---|---|---|
SENTP DEWIKI (25K) | |||||
SENTP GE17 (25K)* | 4.24 | 4.41 | 82.27 | 0.32 | 1,80 |
SENTP GE18 (25K)^ | NA | NA | NA | NA | NA |
SPACY (80K) | 4.19 | 4.07 | 58.55 | 0.35 | 2, 20 |
*GE '17 data pre-processing steps
- Clean dirty characters
- Remove double+ occurrences (for eg. !!! ---- …) or substitute with single occurence.
- URLs are coded as <url> and Emails with <email>
- Any @mentions of Deutsche Bahn are coded as <dbahn> and all other @mentions are - coded as @mention
- Emojis and emoticons are coded as <e> Description </e> as recommended by @mkardas
^GE '18 data pre-processing steps
- Clean dirty characters
- Remove double+ occurrences (for eg. !!! ---- …) or substitute with single occurence.
- @mentions are chosen based on a frequency count. All @mentions below frequency 10 are simply coded as @mention.
- Emojis were kept as they were because no visible improvements were seen from using the encodings in GE '17. Moreover, the tokenization method should very well be able to characterize emojis as a separate unicode entity and the language model should be able to model the occurence of emojis just as well as any other word/character. One possible thing that could be done here is to either space pad continuous emojis such as
with
and susbtitute double+ occurences such as
with
.
Performance on downstream task
- GermEval '17 sentiment classification task: (SPACY 80K on Wikipedia corpus) Accuracy on validation: 77.89%
These are my numbers so far. EDIT: I will upload my language model, fine-tuned model and the datasets The language model can be downloaded from here, and the pre-processing scripts can be found here and here* for others to experiment with.
Key observation: In terms of LM performance the vanilla spacy tokenization method seems to work better in practice than the sentence-piece implementation. I have not been able to train the SPM based classifier, but I’ll try to get those numbers by tomorrow as well – my guess is that it will not be better than the vanilla implementation.
*@piotr.czapla: sorry for committing directly into your repository, I wanted to commit to my fork but only later did I remember that I had write access to n-waves/ulmfit4de. Please let me know if you would like for me to revert the commit.