ULMFIT - Serbian

I’m working on implementing ULMFit for the Serbian language using fastai v1.
This post is intended as a part of the Language Model Zoo

Specificities of Serbian is that it has 2 official letter systems (Latin and Cyrillic) and 2 accents: see Serbian wikipedia explanation

Datasets

  • WT103: downloaded and prepared with prepare_wiki.sh script, language code sr
  • SerbMR: curated, balanced, movie review dataset used for fine-tuning and classification, from here: http://vukbatanovic.github.io/SerbMR/

I have used Fastai v1 to train LM. Current progress could be tracked in this fork, in notebooks under experiments directory.

[WIP] Results

Perplexity on 60k vocab after 3 epochs of training ~ 73.5

Hi, just noted your work, have you had any progress since. Have you used Latin or Cyrcllic corpus when working on ULMFIT?

As I fast checked prepare_wiki.sh is not meant to convert Latin to Cyrillic or reverse.