ULMFiT - Hindi

disisbig · February 2, 2019, 11:33am

Starting this thread to share the progress on the Punjabi LM and classification results @piotr.czapla @Moody

Dataset

Download Wikipedia Articles Dataset (55,000 articles) which I scraped, cleaned and trained model on from here.

Note: There are more than 1.5 lakh Hindi wikipedia articles, whose urls you can find in the pickled object in the above folder. I chose to work with 55,000 articles only because of computational constraints.

Get Hindi Movie Reviews Dataset which I scraped, cleaned and trained classification model from the repository path datasets-preparation/hindi-movie-review-dataset

Thanks to nirantk for BBC Hindi News dataset

Results

Perplexity of Language Model: ~36 (on 20% validation set)

Accuracy of Movie Review classification model: ~53

Kappa Score of Movie Review classification model: ~30

Accuracy of BBC News classification model: ~79

Kappa Score of BBC News classification model: ~72

Note: nirantk has done previous SOTA work with Hindi Language Model and achieved perplexity of ~46. I have achieved better perplexity i.e ~35, but these scores aren’t directly comparable because he used hindi wikipedia dumps for training whereas I scraped 55,000 articles and cleaned them through scripts in datasets-preparation . Though, one big reason I feel the results I have achieved should be better because I’m using sentencepiece for unsupervized tokenization whereas nirantk was using spacy .

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download Movie Review classifier from here

Download BBC News classifier from here

Tokenizer

Unsupervised training using Google’s sentencepiece

Download the trained model and vocabulary from here

dhairya_patel · February 5, 2019, 9:40am

I am not able to find the hindi movie review dataset … can you please share the link to that ?

disisbig · February 5, 2019, 4:49pm

You can find it here: https://github.com/goru001/nlp-for-hindi/tree/master/datasets-preparation/hindi-movie-review-dataset

dhairya_patel · February 6, 2019, 12:33pm

Thanks!!

piotr.czapla · February 12, 2019, 3:23pm

Guarav The accuracy of 53% isn’t quite low (close to random for 2 value classification) do you know why is that? I had similar problems with imdb when lm was trained on sentences instead of full wikipedia articles.

Would you like to try to retrain you models using ulmfit-multilingual?

disisbig · February 12, 2019, 5:10pm

The movie review classification data set has 3 classes [Positive, Neutral, Negative], and not 2. I settled with accuracy of 53% (which is better than just random for 3 classes) because the data set had only

335 Positive Examples
270 Neutral Examples
293 Negative Examples

which I thought were too less to give higher accuracy.

But I can try with ulmfit-multilingual as well to see if it gives better results. One question though, what do you mean when you say

when lm was trained on sentences instead of full wikipedia articles

According to my understanding, for training lm, we concatenate all wikipedia articles and break them in to chunks of batch size, and then train. So, where does your sentence fit in this ?

piotr.czapla · February 12, 2019, 5:38pm

Gaurav, yeah that is right. To be exact previous version of ulmfit was shuffling all training examples and the concatenated them together. The problem starts when examples are just sentences then when they are shuffled the Language Model learns but it has terrible accuracy on imdb dataset.

disisbig · February 12, 2019, 5:42pm

I see! That’s weird but, it shouldn’t happen right?

Also, I don’t see that kind of problem here, do you?. I’ll check with ulmfit-multilingual though, and see what accuracy I get with it.

piotr.czapla · February 12, 2019, 5:45pm

Here is how to add a dataset:

data/<dataset_name>/<lang>-<size+other info>/
- <lang>.train.csv
- <lang>.test.csv
- <lang>.unsup.csv
- <lang>.dev.csv

Then adding a new classifier to work with the scripts would be as easy as adding this 3 lines to configure it:

github.com

n-waves/ulmfit-multilingual/blob/master/ulmfit/train_clas.py#L117-L119


if 'mldoc' in str(self.dataset_dir):
    add_trn_to_lm = False  # False as trn_df is contained in unsup already
    lang = self.lang

Then you can check the commands for training here: https://github.com/n-waves/ulmfit-multilingual/blob/master/results/logs/de.md

If you can try to use sentencepiece 30k.

disisbig · February 12, 2019, 5:48pm

Thanks for that! I’ll get back to you with the results I get with ulmfit-multilingual!

piotr.czapla · February 12, 2019, 5:53pm

If you are going to use ulmfit_multlingual:master then use n-waves/fastai:ulmfit_multilingual , as master does not yet work with fastai/fastai:master

piotr.czapla · February 12, 2019, 8:15pm

@disisbig One question if you are collecting the reviews why don’t you fetch more ? Is the website you collected the review from is so small?

disisbig · February 13, 2019, 2:21am

Yes, right. No more reviews are available. Even these reviews are a collection from two websites.

shruti_01 · May 17, 2019, 10:46am

@disisbig hey I am training an LM for hindi from scratch, using your code in google colab. However, I prepared the dataset myself (not using your wiki-articles folder).

Using the following code to create a databunch for ~60k train+test set. Always gets the ‘session crashed because the consumption of all RAM memory’. Do we have an iterative solution to create a data bunch in chunks?

data_lm = TextLMDataBunch.from_folder(path='./HindiDataset/', tokenizer=tokenizer, vocab=hindi_vocab)

Using sentencePiece for tokenization.

Also, what cloud setup (or local) did you use to train the model?

disisbig · May 17, 2019, 11:01am

Thanks for reaching out. Checkout the discussion here:

Let me know if that helps.

shruti_01 · May 17, 2019, 11:11am

yeah, it works fine for small datasets on colab. Vocab I tried both 30k and 40k. Wanted to now train on bigger dataset. Google colab is ~11GB RAM. I think I’ll then stick to smaller datasets till I get more RAM (or if I figure out some iterative process to create this data bunch).

disisbig · May 17, 2019, 11:33am

You have RAM issues while tokenizing or while creating data bunch?

shruti_01 · May 17, 2019, 11:42am

Databunch. While running the following line

data_lm = TextLMDataBunch.from_folder(path='./HindiDataset/', tokenizer=tokenizer, bs=32, vocab=hindi_vocab)

disisbig · May 17, 2019, 11:49am

How big is your dataset?
There’re interesting discussions here and here. These should definitely help you out!

shruti_01 · May 17, 2019, 11:52am

cool. thanks! ~60k files I have.