ULMFIT - Gujarati

disisbig · March 20, 2019, 3:10pm

Starting this thread to share the progress on the Gujarati LM and classification results @piotr.czapla @Moody

Repository: NLP for Gujarati

Dataset

Download Gujarati Wikipedia Articles Dataset (31,913 articles) which I scraped, cleaned and used to train the language model
Download Gujarati News classification Dataset which I scraped and used to train the classifier

Results

Language Model

on 30% validation set

Perplexity of language model: ~34

Classifier

Accuracy of classification model: ~91%
Kappa score of classification model: ~85

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Trained tokenizer using Google’s sentencepiece

Download the trained model and vocabulary from here

aditya8952 · June 13, 2019, 11:56am

Hey, how do I increase the accuracy here? I have seen the thai model to be performing really great by @cstorm125. I am trying to classify the data in Gujarati script into three classes. I do not have the resources to train my own language model so I am using your pre-trained model as of now.

disisbig · June 15, 2019, 1:30pm

I think more data for classification would be helpful.

aditya8952 · June 17, 2019, 10:35am

https://colab.research.google.com/gist/AParasramka/15d698c26080fca4e5b8eb16c5a7d423/gujarati_classification_model.ipynb
I can not understand why my accuracy in the classifier decreases with increasing epoch, can you help me out? @disisbig

disisbig · June 20, 2019, 2:48pm

@aditya8952 Looks like you’re overfitting! You training loss is going down, validation loss is increasing with more number of epochs! The classification dataset you have is too less. Increase the size of your dataset, Use regularization techniques (maybe increase dropout ), if you can’t do either - train on only 1 epoch, because then there’s no chance of overfitting.

aditya8952 · June 24, 2019, 9:33am

I did that, but how do I increase my accuracy? Have you used fasttext? With the same amount of data it is showing much better accuracy but I can’t understand why. @disisbig

disisbig · June 24, 2019, 10:13am

@aditya8952 Can you share the accuracy numbers and notebook where you’re using fasttext?