Thanks for you comments about SP tokenization, they are very insightful!
I think fast.ai does not support subword sampling in the usual setup. You would have to replace the
LanguageModelLoader class in
text.py (Careful, there is one in
nlp.py, too, I hope I didn’t get it wrong).
This could be involved because the iterator tries to return a certain number of tokens and that would mean that how much text you grab with each iteration depends on the subword sampling (and you would need to deal with funny splits). Maybe the best strategy is keep a queue of tokens to be returned next and do the conversion and subword sampling in batches.
You would mean moving quite a bit of the preprocessing into the training itself, probably with some performance impact. On the other hand, in PyTorch it is very common to do data augmentation on the fly, and it might not be too bad given that the vocabulary is much smaller when using SP instead of full words.
I’m looking into testing SP-based training for German, so I hope to learn from your experience and maybe also be able to add some of my own.