ULMFiT - Hindi


(Gaurav) #1

Starting this thread to share the progress on the Punjabi LM and classification results @piotr.czapla @Moody

Dataset

Download Wikipedia Articles Dataset (55,000 articles) which I scraped, cleaned and trained model on from here.

Note: There are more than 1.5 lakh Hindi wikipedia articles, whose urls you can find in the pickled object in the above folder. I chose to work with 55,000 articles only because of computational constraints.

Get Hindi Movie Reviews Dataset which I scraped, cleaned and trained classification model from the repository path datasets-preparation/hindi-movie-review-dataset

Thanks to nirantk for BBC Hindi News dataset

Results

Perplexity of Language Model: ~36 (on 20% validation set)

Accuracy of Movie Review classification model: ~53

Kappa Score of Movie Review classification model: ~30

Accuracy of BBC News classification model: ~79

Kappa Score of BBC News classification model: ~72

Note: nirantk has done previous SOTA work with Hindi Language Model and achieved perplexity of ~46. I have achieved better perplexity i.e ~35, but these scores aren’t directly comparable because he used hindi wikipedia dumps for training whereas I scraped 55,000 articles and cleaned them through scripts in datasets-preparation . Though, one big reason I feel the results I have achieved should be better because I’m using sentencepiece for unsupervized tokenization whereas nirantk was using spacy .

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download Movie Review classifier from here

Download BBC News classifier from here

Tokenizer

Unsupervised training using Google’s sentencepiece

Download the trained model and vocabulary from here


Language Model Zoo :gorilla:
Language Model Zoo :gorilla:
(Dhairya Patel) #2

I am not able to find the hindi movie review dataset … can you please share the link to that ?


(Gaurav) #3

You can find it here: https://github.com/goru001/nlp-for-hindi/tree/master/datasets-preparation/hindi-movie-review-dataset


(Dhairya Patel) #4

Thanks!!


(Piotr Czapla) #5

Guarav The accuracy of 53% isn’t quite low (close to random for 2 value classification) do you know why is that? I had similar problems with imdb when lm was trained on sentences instead of full wikipedia articles.

Would you like to try to retrain you models using ulmfit-multilingual?


(Gaurav) #6

The movie review classification data set has 3 classes [Positive, Neutral, Negative], and not 2. I settled with accuracy of 53% (which is better than just random for 3 classes) because the data set had only

335 Positive Examples
270 Neutral Examples
293 Negative Examples

which I thought were too less to give higher accuracy.

But I can try with ulmfit-multilingual as well to see if it gives better results. One question though, what do you mean when you say

when lm was trained on sentences instead of full wikipedia articles

According to my understanding, for training lm, we concatenate all wikipedia articles and break them in to chunks of batch size, and then train. So, where does your sentence fit in this ?


(Piotr Czapla) #7

Gaurav, yeah that is right. To be exact previous version of ulmfit was shuffling all training examples and the concatenated them together. The problem starts when examples are just sentences then when they are shuffled the Language Model learns but it has terrible accuracy on imdb dataset.


(Gaurav) #8

I see! That’s weird but, it shouldn’t happen right?

Also, I don’t see that kind of problem here, do you?. I’ll check with ulmfit-multilingual though, and see what accuracy I get with it.


(Piotr Czapla) #9

Here is how to add a dataset:

  • data/<dataset_name>/<lang>-<size+other info>/

    • <lang>.train.csv
    • <lang>.test.csv
    • <lang>.unsup.csv
    • <lang>.dev.csv

Then adding a new classifier to work with the scripts would be as easy as adding this 3 lines to configure it:

Then you can check the commands for training here: https://github.com/n-waves/ulmfit-multilingual/blob/master/results/logs/de.md

If you can try to use sentencepiece 30k.


(Gaurav) #10

Thanks for that! I’ll get back to you with the results I get with ulmfit-multilingual!


(Piotr Czapla) #11

If you are going to use ulmfit_multlingual:master then use n-waves/fastai:ulmfit_multilingual , as master does not yet work with fastai/fastai:master


(Piotr Czapla) #12

@disisbig One question if you are collecting the reviews why don’t you fetch more ? Is the website you collected the review from is so small?


(Gaurav) #13

Yes, right. No more reviews are available. Even these reviews are a collection from two websites.