Pre-trained and fine-tuned vocabularies in Transfer Learning for NLP

Hi everyone! First of all I’d like to thank @jeremy and the rest of team for providing such a neat library as well as the online course(s), which I really enjoyed.

I am currently working on a transfer learning task in NLP. In the language model fine-tuning step, my intuition at first was that the fine-tuned model vocabulary would be the union of the pre-trained dataset vocabulary (wikitext-103) and the one found in the target corpus(*). A quick glance at the source code revealed instead that wikitext words that do not also appear in the new corpus are discarded.

This way we end up with a specialized (and likely smaller in number of parameters) model for our target domain, which is good. Intuitively, however, it feels weird to me to discard part of the language info learnt in the pre-training part, especially because at test time, or after deployment, words in wikitext - but not in the fine-tuned model vocab - may still turn up.

Am I missing something really obvious here? :thinking:
I would be very curious to know about experiments addressing these considerations, if any. :face_with_monocle:

Thank you all and sorry for the long post!

(*) Actually, considering a maximum vocab size of 60000 (as it is for wikitext), we still would need to drop the N less frequent words of the pre-trained dataset vocabulary (N being the size of the target corpus vocabulary)…

hi isabella,

good question! you can solve this by using different ways of preprocessing your text, e.g. by choosing a different kind of tokeniser, like sentencepiece. say, for example, that you have the word ‘crosswalk’ in your training data, but somehow, in the pre-trained wiki dataset, it never appears in the vocab. sentencepiece could then tokenise crosswalk in something like ‘_cross walk’, instead of disregarding it as unknown. this has some nice side effects: 1. you are training on subwords, 2. because of that, your vocabulary is considerably smaller, 3. you will keep almost all words, because they consist of subsequences which are known.

hope this helps.

best, phillip

1 Like