I’m working on implementing ULMFit for the Serbian language using fastai v1.
This post is intended as a part of the Language Model Zoo
Specificities of Serbian is that it has 2 official letter systems (Latin and Cyrillic) and 2 accents: see Serbian wikipedia explanation
Datasets
- WT103: downloaded and prepared with
prepare_wiki.sh
script, language codesr
- SerbMR: curated, balanced, movie review dataset used for fine-tuning and classification, from here: http://vukbatanovic.github.io/SerbMR/
I have used Fastai v1 to train LM. Current progress could be tracked in this fork, in notebooks under experiments directory.
[WIP] Results
Perplexity on 60k vocab after 3 epochs of training ~ 73.5