ULMFiT - Russian


(Alexey) #1

Hello. I’m working on ULMFiT for Russian language. I forked from https://github.com/n-waves/ulmfit-multilingual and mostly inspired by @piotr.czapla work Multilingual ULMFiT

As for now:

Benchmark

Type Model Dataset Metric Value
Language Model ULMFiT Russian Wikipedia Perplexity 27.11
Classification NN + FastText Rusentiment F1-score 0.728
Classification ULMFiT Rusentiment F1-score 0.732

Training was performed with standard fastai tokenization.

My fork is https://github.com/ademyanchuk/ulmfit-multilingual. It has all readmes from parent repo and my experiments are in experiments folder. This work is on fastai v1. All notebooks are self-explanatory and have some comments. Feel free to ask questions, comment and provide suggestions.

Also, I would like to mention previous work


Multilingual ULMFiT
#2

Great!
It will be interesting to see your results on SentiRuEval-2016. I also trained a Russian language model on Wikipedia and tried to beat state of the art on it, but did not succeed.


(Alexey) #3

Go ahead)) It might be I did some silly mistake there and that is why I get so good result. But at least, I couldn’t find any flow in the code myself.


#4

I mean, I have already conducted the experiment, and it failed :slight_smile: I mean this task: https://drive.google.com/drive/folders/0BxlA8wH3PTUfV1F1UTBwVTJPd3c


(Alexey) #5

Sorry, my previous message might be a bit confusing. I understand that you did experiments already. I meant that there might be some bugs in my code and it would be great if someone take a look on it)))


(Alexey) #6

Actually, as for now, I only did positive/negative classification from all data which is located

, so I would continue my work and try multiclass as in the original task (that was my mistake - I didn’t understand originfl task, now I see).


#7

By the way, what perplexity did you manage to achieve?


(Alexey) #8

I’m a bit newbie in all that. But according to Jeremy, given the default loss function for training language model, we can roughly compute perplexity with exp(valid_loss). If it’s correct I achieved perplexity ~28 for wiki language model and ~62 on finetuning of LM.
Now I’m working on fine-tuning LM with the much bigger dataset (near 2 millions of tweets). Hope it would be better.


#9

That is fine. I am working on a language model based on news media, and I achieve about 22.4. But newspaper language is more restricted and predictable.

You are correct about the way to calculate perplexity (you can see some reference on LM evaluation here: http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture06-rnnlm.pdf, page 41).


(Alexey) #10

Thank you) By the way, recently I joined study group (DL in NLP from NLP lab in MIPT) for this course. In case you are interested this is a link to join https://docs.google.com/forms/d/e/1FAIpQLSe_iP5pfx2eKvWOjja_lMNcGZacuAg0d7Q229vxJ_8lFIxZ7A/viewform


(Piotr Czapla) #11

@noisefield, do you have some results for the News Classification and a previous benchmark?
@ademyanchuk have you finished with the classification?

Fyi, we’ve tested ULMFIT + sentencepiece (30k vocab) on russian MLDoc and we have qutie encouraging results (better than Laser and previous baseline for MLDoc)


#12

Hi! Could you please provide links to the tasks? I will be happy to try them out. As for now, I use it for some personal tasks (and quite happy with the results).
EDIT: If you mean MLDoc, I can do that by the end of the week :slight_smile:


(Alexey) #13

@piotr.czapla, I finished with rusentiment and get similar to SOTA result (even a bit better).


(Piotr Czapla) #14

Alexey, I’m a bit in a rush today would you be so kind and make the table showing your results vs the previous STOA. It would be awesome if you would annotate what tokenziation you have used .
Here is an excellent example:

Thx


(Alexey) #15

Piotr, I edited thread main post and added benchmark table there.


#16

Hi!

I have published a language model trained on a newspaper subset of Taiga corpus. You can get it here:

As mentioned previously, it achieves 21.98 perplexity on a 20 million token validation set.


Language Model Zoo :gorilla: