Ah alright. I assumed you used the default pre-trained language model.
In this case you need to save the tokenizer that was used for the pre-trained model, which is automatically done when calling setup:
sp = SubwordTokenizer(vocab_sz=10000)
sp.setup(texts)
This saves the following file:
{'sp_model': Path('tmp/spm.model')}
Then when you create your DataLoaders for fine-tuning, you need to load this saved model, which will then be applied to your new texts.
dblock_lm = DataBlock(blocks=(TextBlock.from_df('text', is_lm=True, tok=SubwordTokenizer(vocab_sz=10000, sp_model='tmp/spm.model'))),
get_x=ColReader('text'),
splitter=RandomSplitter(0.2))
dls_lm = dblock_lm.dataloaders(df, bs=64)
Does that answer your question?