Training Bangla LM from wikipedia data

abyaadrafid · June 11, 2019, 9:13am

Hi,
I am trying to train a Bangla language model using wikipedia articles. I used this script to download from wiki data dumps and got a 150mb-ish file. But I have found some database dumps in here with 2.5~2.8GB size.
Is the script using a subset of the dump? If so, how can I get the complete dataset?

Regards,
Rafid

abyaadrafid · June 11, 2019, 9:16am

Mentioning @tanny411 for obvious reasons.
Have you used this script for the data you’re using? If so, how good is the model?

tanny411 · June 11, 2019, 12:00pm

No I didnt use this exact script but what I did was basically the same. what this script does is it takes the ‘latest’ dump. see this. I am also not too familiar with wiki dumps, but as i understand, there are lots of them, so maybe it would be a good idea to look into the 2GB+ files you found.
Also I havent yet trained the model completely (Because colab!).

Edit: sorry, but where are the 2gb files? I cant find any downloads of that size. i could be mistaken, but the “all files” download link probably refers to the download of all the files provided, but we just want text, which is “pages-articles”

nirantk · June 11, 2019, 12:59pm

Hey @abyaadrafid, two quick pointers:

You can find pretrained models in Bengali/Bangla at iNLTK here: https://github.com/goru001/inltk
You can find the latest full wiki dump for Bangla using this script: https://github.com/NirantK/bharatNLP/blob/dev/prepare_wiki.sh

From my best guess, working with Hindi and Indonesian wiki dumps - the compressed (*.tar.gz) file might be small, the extracted full dumps are often larger and can go upto 2-2.5G as we see in archive.

abyaadrafid · June 11, 2019, 2:09pm

@nirantk Thank you for the head start.

You’re right, the extracted file is 1.1GB. Of XML
After the wikiextractor worked its magic though, it shrank to ~250mb.

One question : did you use wikitext103 pre-trained model for Hindi/Indonesian, or did you train it from scratch?

abyaadrafid · June 11, 2019, 2:14pm

I was mistakenly assuming the whole thing to be relevant. Exactly as you mentioned, we only need “pages-articles”. Also “abstracts” are first two paragraphs from every article; could be useful for toy purposes.
“meta-history” might provide us with some more data, unless of course if wikiextractor already uses it.

nirantk · June 11, 2019, 5:02pm

250M of pure language text sounds like a reasonable starting point.

I make the Language Models for both Hindi and Indonesian using the code that I shared above.
The wikitext-103 is an English only pretrained model. That cannot be used for any other language.

abyaadrafid · June 11, 2019, 5:32pm

That makes sense. Jeremy mentioned something about languages being semantically too different, like Chinese. Still wanted to double check; wouldn’t want to miss out if there was a way.
Cheers.