ULMFIT - Punjabi

(Gaurav) #1

Starting this thread to share the progress on the Punjabi LM and classification results @piotr.czapla @Moody


  • Download Wikipedia Articles Dataset (44,000 articles) which I scraped, cleaned and trained model on from here
  • Checkout BBC Punjabi News dataset which I scraped, cleaned and trained model on from here


Perplexity of Language Model: ~13 (on 20% validation set)

Kappa Score of classification model: ~49

Pretrained Language Model

Download pretrained Language Model from here


Download classifier from here


Unsupervised training using Google’s sentencepiece

Download the trained model and vocabulary from here

Language Model Zoo :gorilla:
Language Model Zoo :gorilla:
(Piotr Czapla) #2

HI Gaurav, good work! What accuracy have you got? I want to compare the results to the Laser results on MLDoc

(Gaurav) #3

Accuracy would have been a wrong metric with the above dataset, as it was highly unbalanced, with

114 Positive Examples
670 Negative Examples

Hence, I calculated Kappa Score (~49) and didn’t calculate accuracy.

(Piotr Czapla) #4

I see, although we use accuracy for our evaluations, maybe you can cut out a balanced test data set?

Do you have any similar corpus that would have sentences with sentiment. It does not have to have labels, Tweets would be fine, or product reviews / comments.
If so you could finetune LM on that data set and you should get much better results. It would be interesting to see how much you can improve.

(Gaurav) #5

yes sure, I’ll do this and report.

Unfortunately no. :frowning: But I’ll check again if I can get/scrape a balanced/better dataset from somewhere!