Starting this thread to share the progress on the Sanskrit LM and classification results @piotr.czapla @Moody
Repository: NLP for Sanskrit
Dataset
* Download Sanskrit Wikipedia Articles Dataset (22,273 articles) which I scraped, cleaned and
used to train the language model
* Download Sanskrit Shlokas Dataset which I scraped and used to train
the classifier
Results
Language Model
* Perplexity of language model: ~6 (on 30% validation set)
Classifier
* Accuracy of classification model: ~70%
* Kappa score of classification model: ~56
Pretrained Language Model
Download pretrained Language Model from here
Classifier
Download classifier from here
Tokenizer
Trained tokenizer using Google’s sentencepiece
Download the trained model and vocabulary from here