ULMFIT - Serbian

duxan · March 8, 2019, 12:45pm

I’m working on implementing ULMFit for the Serbian language using fastai v1.
This post is intended as a part of the Language Model Zoo

Specificities of Serbian is that it has 2 official letter systems (Latin and Cyrillic) and 2 accents: see Serbian wikipedia explanation

WT103: downloaded and prepared with prepare_wiki.sh script, language code sr
SerbMR: curated, balanced, movie review dataset used for fine-tuning and classification, from here: http://vukbatanovic.github.io/SerbMR/

I have used Fastai v1 to train LM. Current progress could be tracked in this fork, in notebooks under experiments directory.

[WIP] Results

Perplexity on 60k vocab after 3 epochs of training ~ 73.5

prosti · December 11, 2019, 6:20pm

Hi, just noted your work, have you had any progress since. Have you used Latin or Cyrcllic corpus when working on ULMFIT?

As I fast checked prepare_wiki.sh is not meant to convert Latin to Cyrillic or reverse.