Tokenizer with pretrained vocab in fastai

I am trying to do transfer learning with text classification.

My language requires a SubwordTokenizer, but if i use the subwordtokenizer on my new data, most of the tokens do not exist in the vocab of the old base model and i have a lot of “xxunk” tokens, which results in a bad new model.

My idea is to use only tokens that exist in the vocab of the base language model.
I experimented a lot with SubwordTokenizer, but could not find a way to do this.

Is it somehow possible or does fastai requires to always create a new vocab with each tokenizer?


Hi @chris3,

You cannot use subword tokenization in connection with the default vocab that comes with the pre-trained Wikitext 103 model.

By default, the pre-trained Wikitext 103 language model comes with a vocab from a word-based spaCy tokenizer. When you then create DataLoaders for language model fine-tuning using the default tokenizer, the vocab from pre-training is aligned and extended with the new vocab.

Now, if you use a different tokenization apprach, e.g. SubwordTokenizer, many tokens in your new vocab will be parts of words that do not exist in the word-based vocab from pre-training. In this case, as you found out correctly, these tokens will be mapped to the unknown token.

I think that there is no pre-trained language model using SubwordTokenizer available at the moment (someone please correct me if I’m wrong), so you have two options:

  • Train a language model from scratch using SubwordTokenizer and see how it performs
  • Fine-tune the pre-trained language model using word-based tokenization and compare to above

Hi @stefan-ai,
thanks a lot for your reply!
I think there is some misunderstanding regarding my message. I do not use “Wikitext 103”.
My prelearned model uses german language based on news paper articles. The german language requires subwordtokenization. And my prelearned model also uses a subwordtokenizer.

For example the word “extreme” looks in the prelearnd vocab like “_extrem e”, but in the result of the new data the new subwordtokenizer created “_ex tr eme”, which finally results in “xxunk”.

Now i would like to tell the new subtokenizer to (only) use the vocab based on the previous tokenizer and also use “_extrem e” instead of “_ex tr eme”.

I think this behaviour is really holding my results back, as there is many many times “xxunk” for words that actually could be solved by the vocab of the prelearned model in other ways.

It would be great, if there is any way to solve this in fastai. And i think at least in theory it should be possible to create a tokenizer that parses the text and looks up tokens from the pre-trained vocab.
Thanks again

Ah alright. I assumed you used the default pre-trained language model.

In this case you need to save the tokenizer that was used for the pre-trained model, which is automatically done when calling setup:

sp = SubwordTokenizer(vocab_sz=10000)

This saves the following file:

{'sp_model': Path('tmp/spm.model')}

Then when you create your DataLoaders for fine-tuning, you need to load this saved model, which will then be applied to your new texts.

dblock_lm = DataBlock(blocks=(TextBlock.from_df('text', is_lm=True, tok=SubwordTokenizer(vocab_sz=10000, sp_model='tmp/spm.model'))),

dls_lm = dblock_lm.dataloaders(df, bs=64)

Does that answer your question?

Wow, that was exactly the hint i needed! My accuracy values go through the roof.

Thank you so much. I wished there was a better docu, where i could read up those details.

Btw. if anyone is interested. I am using the following model now:
That also exists in many other languages (downloadable from the same website).

1 Like

Glad it helped :slight_smile:

Thanks for sharing that resource. I am also working with German text and didn’t know about it.

This is a cool resource. I’m in the process of trying to sort it out but I’m curious how you take to vocab and model to load into a pretrained model. It wasn’t clear to me but I’m somewhat assuming if I get the .model file then it will contain the vocab. Is that right?

Similar to you, I don’t want to tokenize my domain specific text differently than the pretrained model.

Hi Aaron
It will be similar (but different) to this below.
Regards Conwyn

Train Wiki IMDB

Google Colab pwd to /content

save the model without the head


#Note it save it in path/models where path is /root/.fastai/data/imdb

#Copy it for safety to my Google Drive

!cp /root/.fastai/data/imdb/models/finetunedF.pth /content/gdrive/MyDrive

#now pickle the data loaders

import pickle

pickle.dump( dls_lm , open( “savelm.p”, “wb” ) )

#And copy for safety
!cp /content/savelm3.p /content/gdrive/MyDrive

If you are on a different machine or new machine (Colab)

import pickle

copy the headless model from above to directory models but note it ignores path

!mkdir /content/models
!cp /content/gdrive/MyDrive/finetunedE.pth /content/models

#copy the Data Loader with the vocab

!cp /content/gdrive/MyDrive/savelm.p /content
dls_lm = pickle.load( open( “/content/savelm.p”, “rb” ) )

Now prepare you actual text but point to the original vocab pickle imported above

dlsr = TextDataLoaders.from_df(df=dfr, text_vocab=dls_lm.vocab,text_col=‘Review’, label_col=‘Latency’, label_delim=";",y_block=MultiCategoryBlock,splitter=RandomSplitter(0.2) )
learnr = text_classifier_learner(dlsr, AWD_LSTM, drop_mult=0.5, n_out=len(dlsr.vocab[1]), metrics=[]).to_fp16()

Now use the imported headless model

learnr.load_encoder(‘finetunedF’) #Described in the Course Book


Here’s a notebook on how to load a sentencepiece model:

I pickle/unpickle the vocab and pass it to the datablock:

with open(lm_ft_fns[1], 'wb') as f:
      pickle.dump(learn.dls.vocab, f)

then to load the model and vocab:

tok = SentencePieceTokenizer(lang=lang, sp_model=spm_path/'spm.model')

with open(f'{lm_fns[1]}.pkl', 'rb') as f:
      vocab = pickle.load(f)

dblocks = DataBlock(blocks=(TextBlock.from_df('text', tok=tok, vocab=vocab, backwards=backwards), CategoryBlock),
dls = dblocks.dataloaders(df, bs=bs, num_workers=num_workers)

you could also reconstruct the vocab from the tokenizer - but I didn’t verify how the special tokens are handled that way.

tok = SentencePieceTokenizer(lang=lang, sp_model=spm_path/'spm.model')
tok_vocab = [tok.tok.id_to_piece(i) for i in range(tok.tok.get_piece_size())]
1 Like