Continuing the discussion from Language Model Zoo :
Starting this thread to share the progress on the Hungarian Language Model (aka ‘Lángos’ ) and classification results.
Wiki based Language Model
- Perplexity of language model: ~26
- Accuracy of classification model: ~89%
- Download Hungarian Wikipedia Articles Dataset (108k articles) which I used to train the language model
- Download Hungarian Movie Review Dataset which I scraped and used to train the classifier.
(46k clean review from port.hu)
The scraper script: notebook(hu)
Pretrained Language Model
Goal on sentiment classification is 93% accuracy!
In the future I plan to fine-tune the Wiki LM on classical literature.
For this task I have 75 books from BME Corpus Project (see here), and the goal is 500.
If you have any classical books in plain-text utf8, that can helps to achieve!