ULMFIT - Punjabi

Starting this thread to share the progress on the Punjabi LM and classification results @piotr.czapla @Moody

Repository: NLP for Punjabi

Datasets:

  • Download Wikipedia Articles Dataset (44,000 articles) which I scraped, cleaned and trained model on from here
  • Checkout BBC Punjabi News dataset which I scraped, cleaned and trained model on from here

Results:

Perplexity of Language Model: ~13 (on 20% validation set)

Kappa Score of classification model: ~60

Accuracy of classification model: 89%

The above results for classification have been obtained on validation set which had ~84% negatives and ~16% positives.

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Unsupervised training using Google’s sentencepiece

Download the trained model and vocabulary from here

2 Likes

HI Gaurav, good work! What accuracy have you got? I want to compare the results to the Laser results on MLDoc

Accuracy would have been a wrong metric with the above dataset, as it was highly unbalanced, with

114 Positive Examples
670 Negative Examples

Hence, I calculated Kappa Score (~49) and didn’t calculate accuracy.

I see, although we use accuracy for our evaluations, maybe you can cut out a balanced test data set?

Do you have any similar corpus that would have sentences with sentiment. It does not have to have labels, Tweets would be fine, or product reviews / comments.
If so you could finetune LM on that data set and you should get much better results. It would be interesting to see how much you can improve.

yes sure, I’ll do this and report.

Unfortunately no. :frowning: But I’ll check again if I can get/scrape a balanced/better dataset from somewhere!

1 Like

Hey, I’ve the notebook and github repo to reflect that the above results [89% accuracy and ~60 kappa score] for classification have been obtained on validation set which had ~84% negatives and ~16% positives. Do you think that would be helpful while ensuring reproducibility?