Starting this thread to share the progress on the Punjabi LM and classification results @piotr.czapla @Moody
Datasets:
- Download Wikipedia Articles Dataset (44,000 articles) which I scraped, cleaned and trained model on from here
- Checkout BBC Punjabi News dataset which I scraped, cleaned and trained model on from here
Results:
Perplexity of Language Model: ~13 (on 20% validation set)
Kappa Score of classification model: ~49
Pretrained Language Model
Download pretrained Language Model from here
Classifier
Download classifier from here
Tokenizer
Unsupervised training using Google’s sentencepiece
Download the trained model and vocabulary from here