Starting this thread to share the progress on the Malyalam LM and classification results @piotr.czapla @Moody
Repository: NLP for Malyalam
Dataset
- Download Malyalam Wikipedia Articles Dataset (12,388 articles) which I scraped, cleaned and used to train the language model
- Download Malyalam News classification Dataset which I scraped and used to train the classifier
Results
Language Model
on 30% validation set
- Perplexity of language model: ~26
Classifier
- Accuracy of classification model: ~94%
- Kappa score of classification model: ~91
Pretrained Language Model
Download pretrained Language Model from here
Classifier
Download classifier from here
Tokenizer
Trained tokenizer using Google’s sentencepiece
Download the trained model and vocabulary from here