fast.ai Course Forums

ULMFIT - Hungarian

Part 2 & Alumni (2018)

pmamico (Miklós Papp (Guidance Zrt.)) April 26, 2020, 9:49am 1

Continuing the discussion from Language Model Zoo :

Starting this thread to share the progress on the Hungarian Language Model (aka ‘Lángos’ ) and classification results.

repository: github/pmamico/langos

Current results

Wiki based Language Model

Perplexity of language model: ~26

Sentiment Classifier

Accuracy of classification model: ~89%

Dataset

Download Hungarian Wikipedia Articles Dataset (108k articles) which I used to train the language model
Download Hungarian Movie Review Dataset which I scraped and used to train the classifier.
(46k clean review from port.hu)
The scraper script: notebook(hu)

Pretrained Language Model

Download pretrained Language Model from here and the vocab here

Classifier

Download classifier from here and the encoder here

Next steps

Classifier tuning
Goal on sentiment classification is 93% accuracy!
LM fine-tuning
In the future I plan to fine-tune the Wiki LM on classical literature.
For this task I have 75 books from BME Corpus Project (see here), and the goal is 500.
If you have any classical books in plain-text utf8, that can helps to achieve!