Hello @pierreguillou. It’s really not impressive at all, I simply reduced the Wikipedia dataset successively until I could generate a databunch without the GCP session freezing up on me - Sentencepiece tokenization is more memory intensive than Spacy tokenization.
In the end my German fwd databunch had a size of 939 MB (same size bwd, obviously). Batch size was bs=128 and I don’t think I changed drop_mult from the notebook suggestions (but I can check once home, can’t ssh into GCP from here). I used the default sentencepiece vocab size of 30k.
EDIT: The 939 MB databunch contained 113 Million subword tokens. I followed the “Turkish ULMFit from scratch” notebook, not straying from the suggested drop_mult=0.1 & wd=0.1 .