MultiFiT vocabulary

pierreguillou · June 17, 2020, 6:25pm

Hello @morgan. It is not what fastai (v1 e v2) does: the vocabulary for fine-tuning the pre-trained Language Model is new (built from the fine-tune corpus) and the embeddings of its tokens are the ones of the corresponding tokens in the old vocabulary (vocabulary of pre-trained LM). If there is no corresponding token, the values of embeddings are the mean of embeddings of the old vocabulary (pre-trained one).

In fastai v2, follow this path:

language_model_learner(dls, arch, config=None, drop_mult=1., pretrained=True, pretrained_fnames=None, **kwargs) (“Create a Learner with a language model from dls and arch.”) that returns learn.load_pretrained(*fnames).

dls = Dataloaders of training and validation datasets, and with the vocabulary of the new corpus (the one for fine-tuning the language model (LM)). Let’s call this (new) vocabulary new_vocab.
arch and pretrained_fnames: architecture of the pre-trained LM with its weights and (old) vocabulary (the one used for training the first LM). Let’s call the (old) vocabulary old_vocab.

load_pretrained(self, wgts_fname, vocab_fname, model=None) (“Load a pretrained model and adapt it to the data vocabulary.”) that gets new embeddings (new_wgts) for the new vocabulary vocab2 with this code: wgts = match_embeds(wgts, old_vocab, new_vocab).

We can understand this code as new_wgts = match_embeds(old_wgts, old_vocab, new_vocab))

match_embeds(old_wgts, old_vocab, new_vocab) (“Convert the embedding in old_wgts to go from old_vocab to new_vocab.”) that takes old_wgts (embeddings of a token of the old_vocab) when a token of the new_vocab is as well in old_vocab.

Check this line in particular

Note: the 3-step path is exactly the same in fastai v1 starting with language_model_learner (fastai v1).