A note to those folks building langage models: there’s no reason to go beyond 100 million tokens - in my experiments it didn’t help. So if your corpus is bigger than that, remove some of the smaller articles (for instance) until the corpus is down to that size. Really large corpuses are a pain to work with, and don’t have any benefits I saw.