Hi,
I am trying to train a Bangla language model using wikipedia articles. I used this script to download from wiki data dumps and got a 150mb-ish file. But I have found some database dumps in here with 2.5~2.8GB size.
Is the script using a subset of the dump? If so, how can I get the complete dataset?
No I didnt use this exact script but what I did was basically the same. what this script does is it takes the âlatestâ dump. see this. I am also not too familiar with wiki dumps, but as i understand, there are lots of them, so maybe it would be a good idea to look into the 2GB+ files you found.
Also I havent yet trained the model completely (Because colab!).
Edit: sorry, but where are the 2gb files? I cant find any downloads of that size. i could be mistaken, but the âall filesâ download link probably refers to the download of all the files provided, but we just want text, which is âpages-articlesâ
From my best guess, working with Hindi and Indonesian wiki dumps - the compressed (*.tar.gz) file might be small, the extracted full dumps are often larger and can go upto 2-2.5G as we see in archive.
I was mistakenly assuming the whole thing to be relevant. Exactly as you mentioned, we only need âpages-articlesâ. Also âabstractsâ are first two paragraphs from every article; could be useful for toy purposes.
âmeta-historyâ might provide us with some more data, unless of course if wikiextractor already uses it.
250M of pure language text sounds like a reasonable starting point.
I make the Language Models for both Hindi and Indonesian using the code that I shared above.
The wikitext-103 is an English only pretrained model. That cannot be used for any other language.
That makes sense. Jeremy mentioned something about languages being semantically too different, like Chinese. Still wanted to double check; wouldnât want to miss out if there was a way.
Cheers.