I have recently used SubwordTokenizer
which according to the source code actually uses SentencePieceTokenizer
, however, in fastai v2 (which I stronly recommend).
See this post: Tokenizer with pretrained vocab in fastai
When you create your dataloaders (i.e. databunch in fastai 1 terminology) with SubwordTokenizer
, fastai trains a tokenizer on your corpus and saves it under 'tmp/spm.model'
for later use.