Adding SentencePieceTokenizer to fastai.text.data

your code looks reasonable to me. I would think you can just proceed with trying it out on a small amount of data and confirm that you get what you expect to get from the SP tokenizer.

Another ‘gotcha’ I discovered today is that the default .load() method will not load your custom tokenizer as the processor. So, if you save and then load, you will not be able to predict() unless you specify your processor. Have a look at the code below, I think that will get you on your way once you are at that point.

from fastai.text.data import _get_processor
my_processor = _get_processor(mycust_tok,sp_vocab)
data = TextLMDataBunch.load(path=PATH,cache_name='sp_tokenizer',processor=my_processor)