Hello. I’m working on ULMFiT for Russian language. I forked from https://github.com/n-waves/ulmfit-multilingual and mostly inspired by @piotr.czapla work Multilingual ULMFiT
As for now:
- pretrain language model from wiki-dump on 100M
- finetune LM and experimenting on ruSentEval task (http://www.dialog-21.ru/evaluation/2016/sentiment/) with different sizes of data samples and achieved ~ .98 F1 score (which is probably beat the benchmark), but only positive vs negative (data from http://study.mokoron.com/)
- work on “Rusentiment” classification tas, which is multiclass and noisier than the previous one. I’ve performed some experiments and managed to replicate SOTA (~0.73 F1 score)
Benchmark
Type | Model | Dataset | Metric | Value |
---|---|---|---|---|
Language Model | ULMFiT | Russian Wikipedia | Perplexity | 27.11 |
Classification | NN + FastText | Rusentiment | F1-score | 0.728 |
Classification | ULMFiT | Rusentiment | F1-score | 0.732 |
Training was performed with standard fastai tokenization.
My fork is https://github.com/ademyanchuk/ulmfit-multilingual. It has all readmes from parent repo and my experiments are in experiments folder. This work is on fastai v1. All notebooks are self-explanatory and have some comments. Feel free to ask questions, comment and provide suggestions.
Also, I would like to mention previous work