Hi everyone! First of all I’d like to thank @jeremy and the rest of Fast.ai team for providing such a neat library as well as the online course(s), which I really enjoyed.
I am currently working on a transfer learning task in NLP. In the language model fine-tuning step, my intuition at first was that the fine-tuned model vocabulary would be the union of the pre-trained dataset vocabulary (wikitext-103) and the one found in the target corpus(*). A quick glance at the source code revealed instead that wikitext words that do not also appear in the new corpus are discarded.
This way we end up with a specialized (and likely smaller in number of parameters) model for our target domain, which is good. Intuitively, however, it feels weird to me to discard part of the language info learnt in the pre-training part, especially because at test time, or after deployment, words in wikitext - but not in the fine-tuned model vocab - may still turn up.
Am I missing something really obvious here?
I would be very curious to know about experiments addressing these considerations, if any.
Thank you all and sorry for the long post!
(*) Actually, considering a maximum vocab size of 60000 (as it is for wikitext), we still would need to drop the N less frequent words of the pre-trained dataset vocabulary (N being the size of the target corpus vocabulary)…