I am trying to train a Bangla language model using wikipedia articles. I used this script to download from wiki data dumps and got a 150mb-ish file. But I have found some database dumps in here with 2.5~2.8GB size.
Is the script using a subset of the dump? If so, how can I get the complete dataset?
Mentioning @tanny411 for obvious reasons.
Have you used this script for the data you’re using? If so, how good is the model?
No I didnt use this exact script but what I did was basically the same. what this script does is it takes the ‘latest’ dump. see this. I am also not too familiar with wiki dumps, but as i understand, there are lots of them, so maybe it would be a good idea to look into the 2GB+ files you found.
Also I havent yet trained the model completely (Because colab!).
Edit: sorry, but where are the 2gb files? I cant find any downloads of that size. i could be mistaken, but the “all files” download link probably refers to the download of all the files provided, but we just want text, which is “pages-articles”
Hey @abyaadrafid, two quick pointers:
From my best guess, working with Hindi and Indonesian wiki dumps - the compressed (
*.tar.gz) file might be small, the extracted full dumps are often larger and can go upto 2-2.5G as we see in archive.
@nirantk Thank you for the head start.
You’re right, the extracted file is 1.1GB. Of XML
After the wikiextractor worked its magic though, it shrank to ~250mb.
One question : did you use wikitext103 pre-trained model for Hindi/Indonesian, or did you train it from scratch?
I was mistakenly assuming the whole thing to be relevant. Exactly as you mentioned, we only need “pages-articles”. Also “abstracts” are first two paragraphs from every article; could be useful for toy purposes.
“meta-history” might provide us with some more data, unless of course if wikiextractor already uses it.
250M of pure language text sounds like a reasonable starting point.
I make the Language Models for both Hindi and Indonesian using the code that I shared above.
The wikitext-103 is an English only pretrained model. That cannot be used for any other language.
That makes sense. Jeremy mentioned something about languages being semantically too different, like Chinese. Still wanted to double check; wouldn’t want to miss out if there was a way.