Training Bangla LM from wikipedia data

tanny411 · June 11, 2019, 12:00pm

No I didnt use this exact script but what I did was basically the same. what this script does is it takes the ‘latest’ dump. see this. I am also not too familiar with wiki dumps, but as i understand, there are lots of them, so maybe it would be a good idea to look into the 2GB+ files you found.
Also I havent yet trained the model completely (Because colab!).

Edit: sorry, but where are the 2gb files? I cant find any downloads of that size. i could be mistaken, but the “all files” download link probably refers to the download of all the files provided, but we just want text, which is “pages-articles”