Hi @t-v.
I downloaded the latest Wikipedia French corpus and with the following code that counts the number of tokens (mostly words) by text file in the folder docs created thanks to nlputils.py, I got the number of about 492 millions of tokens.
If I understand well your post (and the one of Jeremy), I should keep only 100 millions tokens in docs (ie, a number of articles with a total sum of 100 millions tokens) before to create my LM databunch.
I’m going to delete a lot of training data. Can you confirm the process to follow? Thanks.
dest = path/'docs'
files = dest.ls()
num_tokens = 0
for i,f in enumerate(files):
words = open(f, 'r', encoding='utf8').read()
num_tokens += len(words.split())
print(i+1, num_tokens)