This makes be me believe that sentencepiece is not able to perform subword tokenization. My understanding is SPProcessor need to create a tmp/spm.model and tmp/spm.vocab in the directory where this notebook is located.
Also the code is using fastaiv1. And I haven’t used sentencepiece in V1 version.
I have recently used SubwordTokenizer which according to the source code actually uses SentencePieceTokenizer, however, in fastai v2 (which I stronly recommend).
When you create your dataloaders (i.e. databunch in fastai 1 terminology) with SubwordTokenizer, fastai trains a tokenizer on your corpus and saves it under 'tmp/spm.model' for later use.
Ah I see. Yes I was going off of an nlp notebook jeremy wrote last year so I guess it was using the older version of fastai.
if I was training a language model from scratch do I need to train the tokenizer separately? or can I just include tok=sentencepiecetokenizer() as an argument in my Datablock?
I don’t think you need to explicitely train the tokenizer as I did in the post I shared (there I just wanted to demonstrate how to load a previously trained tokenizer). Setting up the dataloaders with tok=SubwordTokenizer() should create and save the tokenizer automatically.
If I want to substantiate the dls object again without it retraining the subword tokenizer again, bow would I do that? Would it just be tok=‘tmp/some.model’