Building a tokenizer with pretrained vocab in fastai

emma_nuel · November 10, 2022, 8:35pm

I am trying to train a text classifier.

I am working with a specialised and small dataset. When I want to train a language model with my data, most of the tokens do not exist in the vocab of the base model and I have a lot of “xxunk” tokens, which results in a bad new model.

I have a vocab.txt file with most of the words that appear in my dataset

Is it possible to build my vocab from that file before I train my language model?

Thanks