Hello @morgan. It is not what fastai (v1 e v2) does: the vocabulary for fine-tuning the pre-trained Language Model is new (built from the fine-tune corpus) and the embeddings of its tokens are the ones of the corresponding tokens in the old vocabulary (vocabulary of pre-trained LM). If there is no corresponding token, the values of embeddings are the mean of embeddings of the old vocabulary (pre-trained one).
In fastai v2, follow this path:
- language_model_learner(dls, arch, config=None, drop_mult=1., pretrained=True, pretrained_fnames=None, **kwargs) (“Create a
Learner
with a language model fromdls
andarch
.”) that returnslearn.load_pretrained(*fnames)
.
- dls = Dataloaders of training and validation datasets, and with the vocabulary of the new corpus (the one for fine-tuning the language model (LM)). Let’s call this (new) vocabulary new_vocab.
- arch and pretrained_fnames: architecture of the pre-trained LM with its weights and (old) vocabulary (the one used for training the first LM). Let’s call the (old) vocabulary old_vocab.
- load_pretrained(self, wgts_fname, vocab_fname, model=None) (“Load a pretrained model and adapt it to the data vocabulary.”) that gets new embeddings (new_wgts) for the new vocabulary vocab2 with this code:
wgts = match_embeds(wgts, old_vocab, new_vocab)
.
- We can understand this code as
new_wgts = match_embeds(old_wgts, old_vocab, new_vocab)
)
- match_embeds(old_wgts, old_vocab, new_vocab) (“Convert the embedding in
old_wgts
to go fromold_vocab
tonew_vocab
.”) that takes old_wgts (embeddings of a token of the old_vocab) when a token of the new_vocab is as well in old_vocab.
- Check this line in particular
Note: the 3-step path is exactly the same in fastai v1 starting with language_model_learner (fastai v1).