ULMFIT - Sanskrit

(Gaurav) #1

Starting this thread to share the progress on the Sanskrit LM and classification results @piotr.czapla @Moody

Repository: NLP for Sanskrit

Dataset

* Download Sanskrit Wikipedia Articles Dataset (22,273 articles) which I scraped, cleaned and
used to train the language model

* Download Sanskrit Shlokas Dataset which I scraped and used to train
the classifier

Results

Language Model

* Perplexity of language model: ~6 (on 30% validation set)

Classifier

* Accuracy of classification model: ~70%
* Kappa score of classification model: ~56

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Trained tokenizer using Google’s sentencepiece

Download the trained model and vocabulary from here

3 Likes

Language Model Zoo :gorilla: