Continuing the discussion from Language Model Zoo :
Starting this thread to share the progress on the Hungarian Language Model (aka ‘Lángos’ ) and classification results.
repository: github/pmamico/langos
Current results
Wiki based Language Model
- Perplexity of language model: ~26
Sentiment Classifier
- Accuracy of classification model: ~89%
Dataset
- Download Hungarian Wikipedia Articles Dataset (108k articles) which I used to train the language model
- Download Hungarian Movie Review Dataset which I scraped and used to train the classifier.
(46k clean review from port.hu)
The scraper script: notebook(hu)
Pretrained Language Model
Download pretrained Language Model from here and the vocab here
Classifier
Download classifier from here and the encoder here
Next steps
-
Classifier tuning
Goal on sentiment classification is 93% accuracy! -
LM fine-tuning
In the future I plan to fine-tune the Wiki LM on classical literature.
For this task I have 75 books from BME Corpus Project (see here), and the goal is 500.
If you have any classical books in plain-text utf8, that can helps to achieve!