@fredguth: It’s true that regular Wikipedia dumps are not as clean as Wikitext-103. However, I’ve trained a few language models on the Dutch Wikipedia and it performed pretty well, i.e., I was able to train good classifiers on very small datasets.
For example, I achieved 88% accuracy on a binary classification problem (sentiment with two polarities: neutral and positive) on only 250 examples. This can only mean that the LM was able to catch language basics pretty well and it didn’t have to learn them from the target dataset.
The question is whether you really need curated data for training your LM. It might be that instead of spending time on this, it would be more productive to use more data, e.g., CommonCrawl. I don’t know the answer to this question though.