ULMFiT - Hindi

Starting this thread to share the progress on the Punjabi LM and classification results @piotr.czapla @Moody


Download Wikipedia Articles Dataset (55,000 articles) which I scraped, cleaned and trained model on from here.

Note: There are more than 1.5 lakh Hindi wikipedia articles, whose urls you can find in the pickled object in the above folder. I chose to work with 55,000 articles only because of computational constraints.

Get Hindi Movie Reviews Dataset which I scraped, cleaned and trained classification model from the repository path datasets-preparation/hindi-movie-review-dataset

Thanks to nirantk for BBC Hindi News dataset


Perplexity of Language Model: ~36 (on 20% validation set)

Accuracy of Movie Review classification model: ~53

Kappa Score of Movie Review classification model: ~30

Accuracy of BBC News classification model: ~79

Kappa Score of BBC News classification model: ~72

Note: nirantk has done previous SOTA work with Hindi Language Model and achieved perplexity of ~46. I have achieved better perplexity i.e ~35, but these scores aren’t directly comparable because he used hindi wikipedia dumps for training whereas I scraped 55,000 articles and cleaned them through scripts in datasets-preparation . Though, one big reason I feel the results I have achieved should be better because I’m using sentencepiece for unsupervized tokenization whereas nirantk was using spacy .

Pretrained Language Model

Download pretrained Language Model from here


Download Movie Review classifier from here

Download BBC News classifier from here


Unsupervised training using Google’s sentencepiece

Download the trained model and vocabulary from here


I am not able to find the hindi movie review dataset … can you please share the link to that ?

You can find it here: https://github.com/goru001/nlp-for-hindi/tree/master/datasets-preparation/hindi-movie-review-dataset


Guarav The accuracy of 53% isn’t quite low (close to random for 2 value classification) do you know why is that? I had similar problems with imdb when lm was trained on sentences instead of full wikipedia articles.

Would you like to try to retrain you models using ulmfit-multilingual?

The movie review classification data set has 3 classes [Positive, Neutral, Negative], and not 2. I settled with accuracy of 53% (which is better than just random for 3 classes) because the data set had only

335 Positive Examples
270 Neutral Examples
293 Negative Examples

which I thought were too less to give higher accuracy.

But I can try with ulmfit-multilingual as well to see if it gives better results. One question though, what do you mean when you say

when lm was trained on sentences instead of full wikipedia articles

According to my understanding, for training lm, we concatenate all wikipedia articles and break them in to chunks of batch size, and then train. So, where does your sentence fit in this ?

Gaurav, yeah that is right. To be exact previous version of ulmfit was shuffling all training examples and the concatenated them together. The problem starts when examples are just sentences then when they are shuffled the Language Model learns but it has terrible accuracy on imdb dataset.

I see! That’s weird but, it shouldn’t happen right?

Also, I don’t see that kind of problem here, do you?. I’ll check with ulmfit-multilingual though, and see what accuracy I get with it.

Here is how to add a dataset:

  • data/<dataset_name>/<lang>-<size+other info>/

    • <lang>.train.csv
    • <lang>.test.csv
    • <lang>.unsup.csv
    • <lang>.dev.csv

Then adding a new classifier to work with the scripts would be as easy as adding this 3 lines to configure it:

Then you can check the commands for training here: https://github.com/n-waves/ulmfit-multilingual/blob/master/results/logs/de.md

If you can try to use sentencepiece 30k.

Thanks for that! I’ll get back to you with the results I get with ulmfit-multilingual!

If you are going to use ulmfit_multlingual:master then use n-waves/fastai:ulmfit_multilingual , as master does not yet work with fastai/fastai:master

@disisbig One question if you are collecting the reviews why don’t you fetch more ? Is the website you collected the review from is so small?

Yes, right. No more reviews are available. Even these reviews are a collection from two websites.

@disisbig hey I am training an LM for hindi from scratch, using your code in google colab. However, I prepared the dataset myself (not using your wiki-articles folder).

Using the following code to create a databunch for ~60k train+test set. Always gets the ‘session crashed because the consumption of all RAM memory’. Do we have an iterative solution to create a data bunch in chunks?

data_lm = TextLMDataBunch.from_folder(path='./HindiDataset/', tokenizer=tokenizer, vocab=hindi_vocab)

Using sentencePiece for tokenization.

Also, what cloud setup (or local) did you use to train the model?

1 Like

Thanks for reaching out. Checkout the discussion here:

Let me know if that helps.

1 Like

yeah, it works fine for small datasets on colab. Vocab I tried both 30k and 40k. Wanted to now train on bigger dataset. Google colab is ~11GB RAM. I think I’ll then stick to smaller datasets till I get more RAM (or if I figure out some iterative process to create this data bunch).

You have RAM issues while tokenizing or while creating data bunch?

Databunch. While running the following line

data_lm = TextLMDataBunch.from_folder(path='./HindiDataset/', tokenizer=tokenizer, bs=32, vocab=hindi_vocab)

How big is your dataset?
There’re interesting discussions here and here. These should definitely help you out!

cool. thanks! ~60k files I have.