Starting this thread to share the progress on the Gujarati LM and classification results @piotr.czapla @Moody
Repository: NLP for Gujarati
on 30% validation set
- Perplexity of language model: ~34
- Accuracy of classification model: ~91%
- Kappa score of classification model: ~85
Pretrained Language Model
Download pretrained Language Model from here
Download classifier from here
Trained tokenizer using Google’s sentencepiece
Download the trained model and vocabulary from here
Hey, how do I increase the accuracy here? I have seen the thai model to be performing really great by @cstorm125. I am trying to classify the data in Gujarati script into three classes. I do not have the resources to train my own language model so I am using your pre-trained model as of now.
I think more data for classification would be helpful.
@aditya8952 Looks like you’re overfitting! You training loss is going down, validation loss is increasing with more number of epochs! The classification dataset you have is too less. Increase the size of your dataset, Use regularization techniques (maybe increase dropout ), if you can’t do either - train on only 1 epoch, because then there’s no chance of overfitting.
I did that, but how do I increase my accuracy? Have you used fasttext? With the same amount of data it is showing much better accuracy but I can’t understand why. @disisbig
@aditya8952 Can you share the accuracy numbers and notebook where you’re using fasttext?