ULMFiT - Russian

Hello. I’m working on ULMFiT for Russian language. I forked from https://github.com/n-waves/ulmfit-multilingual and mostly inspired by @piotr.czapla work Multilingual ULMFiT

As for now:

Benchmark

Type Model Dataset Metric Value
Language Model ULMFiT Russian Wikipedia Perplexity 27.11
Classification NN + FastText Rusentiment F1-score 0.728
Classification ULMFiT Rusentiment F1-score 0.732

Training was performed with standard fastai tokenization.

My fork is https://github.com/ademyanchuk/ulmfit-multilingual. It has all readmes from parent repo and my experiments are in experiments folder. This work is on fastai v1. All notebooks are self-explanatory and have some comments. Feel free to ask questions, comment and provide suggestions.

Also, I would like to mention previous work

6 Likes

Great!
It will be interesting to see your results on SentiRuEval-2016. I also trained a Russian language model on Wikipedia and tried to beat state of the art on it, but did not succeed.

Go ahead)) It might be I did some silly mistake there and that is why I get so good result. But at least, I couldn’t find any flow in the code myself.

I mean, I have already conducted the experiment, and it failed :slight_smile: I mean this task: https://drive.google.com/drive/folders/0BxlA8wH3PTUfV1F1UTBwVTJPd3c

Sorry, my previous message might be a bit confusing. I understand that you did experiments already. I meant that there might be some bugs in my code and it would be great if someone take a look on it)))

Actually, as for now, I only did positive/negative classification from all data which is located

, so I would continue my work and try multiclass as in the original task (that was my mistake - I didn’t understand originfl task, now I see).

By the way, what perplexity did you manage to achieve?

I’m a bit newbie in all that. But according to Jeremy, given the default loss function for training language model, we can roughly compute perplexity with exp(valid_loss). If it’s correct I achieved perplexity ~28 for wiki language model and ~62 on finetuning of LM.
Now I’m working on fine-tuning LM with the much bigger dataset (near 2 millions of tweets). Hope it would be better.

That is fine. I am working on a language model based on news media, and I achieve about 22.4. But newspaper language is more restricted and predictable.

You are correct about the way to calculate perplexity (you can see some reference on LM evaluation here: http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture06-rnnlm.pdf, page 41).

Thank you) By the way, recently I joined study group (DL in NLP from NLP lab in MIPT) for this course. In case you are interested this is a link to join https://docs.google.com/forms/d/e/1FAIpQLSe_iP5pfx2eKvWOjja_lMNcGZacuAg0d7Q229vxJ_8lFIxZ7A/viewform

1 Like

@noisefield, do you have some results for the News Classification and a previous benchmark?
@ademyanchuk have you finished with the classification?

Fyi, we’ve tested ULMFIT + sentencepiece (30k vocab) on russian MLDoc and we have qutie encouraging results (better than Laser and previous baseline for MLDoc)

Hi! Could you please provide links to the tasks? I will be happy to try them out. As for now, I use it for some personal tasks (and quite happy with the results).
EDIT: If you mean MLDoc, I can do that by the end of the week :slight_smile:

@piotr.czapla, I finished with rusentiment and get similar to SOTA result (even a bit better).

Alexey, I’m a bit in a rush today would you be so kind and make the table showing your results vs the previous STOA. It would be awesome if you would annotate what tokenziation you have used .
Here is an excellent example:

Thx

Piotr, I edited thread main post and added benchmark table there.

Hi!

I have published a language model trained on a newspaper subset of Taiga corpus. You can get it here:

As mentioned previously, it achieves 21.98 perplexity on a 20 million token validation set.

1 Like

Does anyone still have active link for Rusentiment dataset? Looks like all mentioned here are dead now :frowning:

It seems like due to Vkontakte request suspended it.
Quote: “Access to the data is temporarily suspended due to a request from VKontakte.”

Hi.

Sorry, I’m new here.
Is there any way to get your final model to experiment?

Hi. Unfortunately, no. This final model weights located on the server to which I have no longer access.
Sorry.