ULMFIT - Punjabi

disisbig · February 2, 2019, 11:23am

Starting this thread to share the progress on the Punjabi LM and classification results @piotr.czapla @Moody

Repository: NLP for Punjabi

Datasets:

Download Wikipedia Articles Dataset (44,000 articles) which I scraped, cleaned and trained model on from here
Checkout BBC Punjabi News dataset which I scraped, cleaned and trained model on from here

Results:

Perplexity of Language Model: ~13 (on 20% validation set)

Kappa Score of classification model: ~60

Accuracy of classification model: 89%

The above results for classification have been obtained on validation set which had ~84% negatives and ~16% positives.

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Unsupervised training using Google’s sentencepiece

Download the trained model and vocabulary from here

piotr.czapla · February 11, 2019, 5:49pm

HI Gaurav, good work! What accuracy have you got? I want to compare the results to the Laser results on MLDoc

disisbig · February 12, 2019, 5:34pm

Accuracy would have been a wrong metric with the above dataset, as it was highly unbalanced, with

114 Positive Examples
670 Negative Examples

Hence, I calculated Kappa Score (~49) and didn’t calculate accuracy.

piotr.czapla · February 12, 2019, 5:42pm

I see, although we use accuracy for our evaluations, maybe you can cut out a balanced test data set?

Do you have any similar corpus that would have sentences with sentiment. It does not have to have labels, Tweets would be fine, or product reviews / comments.
If so you could finetune LM on that data set and you should get much better results. It would be interesting to see how much you can improve.

disisbig · February 12, 2019, 5:52pm

yes sure, I’ll do this and report.

Unfortunately no. But I’ll check again if I can get/scrape a balanced/better dataset from somewhere!

disisbig · March 20, 2019, 2:48pm

Hey, I’ve the notebook and github repo to reflect that the above results [89% accuracy and ~60 kappa score] for classification have been obtained on validation set which had ~84% negatives and ~16% positives. Do you think that would be helpful while ensuring reproducibility?