Tokenizer with pretrained vocab in fastai

stefan-ai · October 27, 2020, 10:59am

Ah alright. I assumed you used the default pre-trained language model.

In this case you need to save the tokenizer that was used for the pre-trained model, which is automatically done when calling setup:

sp = SubwordTokenizer(vocab_sz=10000)
sp.setup(texts)

This saves the following file:

{'sp_model': Path('tmp/spm.model')}

Then when you create your DataLoaders for fine-tuning, you need to load this saved model, which will then be applied to your new texts.

dblock_lm = DataBlock(blocks=(TextBlock.from_df('text', is_lm=True, tok=SubwordTokenizer(vocab_sz=10000, sp_model='tmp/spm.model'))),
                  get_x=ColReader('text'),
                  splitter=RandomSplitter(0.2))

dls_lm = dblock_lm.dataloaders(df, bs=64)

Does that answer your question?