ULMFiT - German

aayushy · October 6, 2018, 6:59am

Model Parameters bptt 70, em_sz 300, nh 1150, nl 3

Training Parameters lr 1.2e-2, bs 64, use_clr_beta 10, 10, 0.95, 0.85/0.95, wd 3e-6, clip 0.2

GERMAN WIKIPEDIA (PRE-TRAINING)

EXP.	TRAIN	VAL	PPX	ACC	NCYC., CL
SENTP DEWIKI (25K)	5.57	4.72	112.17	0.28	1, 12
SENTP GE17 (25K)*	5.26	3.96	52.45	0.33	1,12
SENTP GE18 (25K)^
SPACY (80K)	5.07	4.32	75.19	0.34	1, 12

GERMEVAL ‘17 (LM)

EXP.	TRAIN	VAL	PPX	ACC	NCYC., CL
SENTP DEWIKI (25K)
SENTP GE17 (25K)*	4.24	4.41	82.27	0.32	1,80
SENTP GE18 (25K)^	NA	NA	NA	NA	NA
SPACY (80K)	4.19	4.07	58.55	0.35	2, 20

*GE '17 data pre-processing steps

Clean dirty characters
Remove double+ occurrences (for eg. !!! ---- …) or substitute with single occurence.
URLs are coded as <url> and Emails with <email>
Any @mentions of Deutsche Bahn are coded as <dbahn> and all other @mentions are - coded as @mention
Emojis and emoticons are coded as <e> Description </e> as recommended by @mkardas

^GE '18 data pre-processing steps

Clean dirty characters
Remove double+ occurrences (for eg. !!! ---- …) or substitute with single occurence.
@mentions are chosen based on a frequency count. All @mentions below frequency 10 are simply coded as @mention.
Emojis were kept as they were because no visible improvements were seen from using the encodings in GE '17. Moreover, the tokenization method should very well be able to characterize emojis as a separate unicode entity and the language model should be able to model the occurence of emojis just as well as any other word/character. One possible thing that could be done here is to either space pad continuous emojis such as with and susbtitute double+ occurences such as with .

Performance on downstream task

GermEval '17 sentiment classification task: (SPACY 80K on Wikipedia corpus) Accuracy on validation: 77.89%

These are my numbers so far. EDIT: ~~I will upload my language model, fine-tuned model and the datasets~~ The language model can be downloaded from here, and the pre-processing scripts can be found here and here* for others to experiment with.

Key observation: In terms of LM performance the vanilla spacy tokenization method seems to work better in practice than the sentence-piece implementation. I have not been able to train the SPM based classifier, but I’ll try to get those numbers by tomorrow as well – my guess is that it will not be better than the vanilla implementation.

*@piotr.czapla: sorry for committing directly into your repository, I wanted to commit to my fork but only later did I remember that I had write access to n-waves/ulmfit4de. Please let me know if you would like for me to revert the commit.