ULMFIT - Punjabi


(Gaurav) #1

Starting this thread to share the progress on the Punjabi LM and classification results @piotr.czapla @Moody

Datasets:

  • Download Wikipedia Articles Dataset (44,000 articles) which I scraped, cleaned and trained model on from here
  • Checkout BBC Punjabi News dataset which I scraped, cleaned and trained model on from here

Results:

Perplexity of Language Model: ~13 (on 20% validation set)

Kappa Score of classification model: ~49

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Unsupervised training using Google’s sentencepiece

Download the trained model and vocabulary from here


Language Model Zoo :gorilla:
Language Model Zoo :gorilla:
(Piotr Czapla) #2

HI Gaurav, good work! What accuracy have you got? I want to compare the results to the Laser results on MLDoc


(Gaurav) #3

Accuracy would have been a wrong metric with the above dataset, as it was highly unbalanced, with

114 Positive Examples
670 Negative Examples

Hence, I calculated Kappa Score (~49) and didn’t calculate accuracy.


(Piotr Czapla) #4

I see, although we use accuracy for our evaluations, maybe you can cut out a balanced test data set?

Do you have any similar corpus that would have sentences with sentiment. It does not have to have labels, Tweets would be fine, or product reviews / comments.
If so you could finetune LM on that data set and you should get much better results. It would be interesting to see how much you can improve.


(Gaurav) #5

yes sure, I’ll do this and report.

Unfortunately no. :frowning: But I’ll check again if I can get/scrape a balanced/better dataset from somewhere!