ULMFIT - Gujarati

Starting this thread to share the progress on the Gujarati LM and classification results @piotr.czapla @Moody

Repository: NLP for Gujarati

Dataset

Results

Language Model

on 30% validation set

  • Perplexity of language model: ~34

Classifier

  • Accuracy of classification model: ~91%
  • Kappa score of classification model: ~85

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Trained tokenizer using Google’s sentencepiece

Download the trained model and vocabulary from here

4 Likes

Hey, how do I increase the accuracy here? I have seen the thai model to be performing really great by @cstorm125. I am trying to classify the data in Gujarati script into three classes. I do not have the resources to train my own language model so I am using your pre-trained model as of now.

I think more data for classification would be helpful.

https://colab.research.google.com/gist/AParasramka/15d698c26080fca4e5b8eb16c5a7d423/gujarati_classification_model.ipynb
I can not understand why my accuracy in the classifier decreases with increasing epoch, can you help me out? @disisbig

@aditya8952 Looks like you’re overfitting! You training loss is going down, validation loss is increasing with more number of epochs! The classification dataset you have is too less. Increase the size of your dataset, Use regularization techniques (maybe increase dropout ), if you can’t do either - train on only 1 epoch, because then there’s no chance of overfitting.

I did that, but how do I increase my accuracy? Have you used fasttext? With the same amount of data it is showing much better accuracy but I can’t understand why. @disisbig

@aditya8952 Can you share the accuracy numbers and notebook where you’re using fasttext?