Language Model Zoo 🦍

pierreguillou · September 9, 2019, 11:09pm

I downloaded the latest Wikipedia French corpus and with the following code that counts the number of tokens (mostly words) by text file in the folder docs created thanks to nlputils.py, I got the number of about 492 millions of tokens.

If I understand well your post (and the one of Jeremy), I should keep only 100 millions tokens in docs (ie, a number of articles with a total sum of 100 millions tokens) before to create my LM databunch.

I’m going to delete a lot of training data. Can you confirm the process to follow? Thanks.

dest = path/'docs'
files = dest.ls()
num_tokens = 0

for i,f in enumerate(files):
    words = open(f, 'r', encoding='utf8').read()
    num_tokens += len(words.split())
print(i+1, num_tokens)