MultiFiT vocabulary

Isabella · June 16, 2020, 2:06pm

Hello everyone,
Can someone help me figure out some doubts I have?

I am testing the (Italian) MultiFiT model (https://github.com/n-waves/multifit).
Looking at the spm.vocab file coming with the model, I have noticed many Japanese/Chinese/… characters appearing towards the end of the file. I was wondering why… the model should have been pretrained on the Italian wikipedia where I doubt many of those characters appear at all.

Moreover, I have also experimented with the Italian ULMFiT model (https://github.com/Quantyca/deepitalian). The way the vocabulary for LM fine-tuning is built looks different from the MultiFIT approach: in fact, my data_lm vocabulary only contains words appearing in the fine-tuning dataset (i.e., words appearing in the wiki pretraining vocab but not in the new dataset are discarded).
This does not seem to happen with it_multifit, otherwise I would expect not to find, as I do, non-latin characters in my data_lm.vocab.

Thank you very much in advance for your help!

cahya · June 16, 2020, 3:57pm

I think if multifit use vocab size of 60k, and it was pretrained on italian wikipedia, it is possible that some non latin character like Japanese or Chinese will be also on the vocabulary (and most probably at the end of the list due to their little frequency). It is also normal that the list of vocabulary from pretraining will be reduced to the vocabulary used in the fine tuning.
But @piotr.czapla might know better

Isabella · June 16, 2020, 4:10pm

Thanks for your reply!

It is also normal that the list of vocabulary from pretraining will be reduced to the vocabulary used in the fine tuning.

I agree (at least that’s the behavior I observed in using the “general” ULMFiT approach). However, when using the MultiFiT model I end up with a fine-tuning vocabulary that still contains those non-latin characters (which don’t appear in the fine-tuning dataset, so I was expecting to lose them).
By the way, multifit uses a 15K tokens vocabulary, in fact.

I guess there might be a problem with the code I am using, probably something involving the tokenizer (sentencepiece vs Spacy).

Another thing I was noticing is that with Spacy I am able to get emojis in my fine-tuning vocab, which I am currently unable to do with sentencepiece. Are you aware of any limitation in this sense?

cahya · June 16, 2020, 4:19pm

maybe multifit uses 15k vocab because it uses sentencepiece. and spacy might use word tokenisation so you will find emoji there, but sentencepiece uses subword tokenisation and therefore the emoji might be spitted to its character.

Isabella · June 17, 2020, 1:49pm

So, I made some other experiments and it looks like the original vocabulary from pretraining does not get replaced with the vocabulary for fine-tuning, at least with this code I am using:

       processor = SPProcessor.load(MODEL_FOLDER)
       data_lm = (TextList.from_df(df = cleaned_df, path = OUTPUT_FOLDER, 
                     cols = 0, processor = processor)
                     .split_by_rand_pct(VAL_PERC, seed = SEED)
                     .label_for_lm()           
                     .databunch(bs = 64, num_workers = 0))
       lm_IT = ['it_multifit_paper_version/lm_best', 'it_multifit_paper_version/itos']
       [...]
       learn_lm = language_model_learner(data_lm, AWD_LSTM, config = config, 
                     pretrained_fnames = lm_IT, path = OUTPUT_FOLDER, model_dir = 'models')

So I tried initializing a new SPProcessor instead of loading the pretrained one:

processor = SPProcessor(lang = 'it', max_vocab_sz = 15000, enc = 'utf8', 
                         tmp_dir = '.', sp_model = None, sp_vocab = None)

This way, my data_lm.vocab contains only tokens from my dataset (cleaned_df), including emojis.
Still, I am unsure whether this is the correct way to go…

Also, the resulting tokens are more at the word than subword level (when compared with the pretraining vocab), but that might be because my finetuning dataset is rather small.

Can someone with previous experience with the sentencepiece tokenizer & the multifit approach please provide some guidance? @pierreguillou perhaps (I am using your notebooks on github as a basis for my experiments ) ?

pierreguillou · June 17, 2020, 6:25pm

Hello @morgan. It is not what fastai (v1 e v2) does: the vocabulary for fine-tuning the pre-trained Language Model is new (built from the fine-tune corpus) and the embeddings of its tokens are the ones of the corresponding tokens in the old vocabulary (vocabulary of pre-trained LM). If there is no corresponding token, the values of embeddings are the mean of embeddings of the old vocabulary (pre-trained one).

In fastai v2, follow this path:

language_model_learner(dls, arch, config=None, drop_mult=1., pretrained=True, pretrained_fnames=None, **kwargs) (“Create a Learner with a language model from dls and arch.”) that returns learn.load_pretrained(*fnames).

dls = Dataloaders of training and validation datasets, and with the vocabulary of the new corpus (the one for fine-tuning the language model (LM)). Let’s call this (new) vocabulary new_vocab.
arch and pretrained_fnames: architecture of the pre-trained LM with its weights and (old) vocabulary (the one used for training the first LM). Let’s call the (old) vocabulary old_vocab.

load_pretrained(self, wgts_fname, vocab_fname, model=None) (“Load a pretrained model and adapt it to the data vocabulary.”) that gets new embeddings (new_wgts) for the new vocabulary vocab2 with this code: wgts = match_embeds(wgts, old_vocab, new_vocab).

We can understand this code as new_wgts = match_embeds(old_wgts, old_vocab, new_vocab))

match_embeds(old_wgts, old_vocab, new_vocab) (“Convert the embedding in old_wgts to go from old_vocab to new_vocab.”) that takes old_wgts (embeddings of a token of the old_vocab) when a token of the new_vocab is as well in old_vocab.

Check this line in particular

Note: the 3-step path is exactly the same in fastai v1 starting with language_model_learner (fastai v1).

pierreguillou · June 17, 2020, 6:35pm

Hello @Isabella.

Here the code in question from my notebook where I fine-tuned (with a new corpus) a Multifit pre-trained with Wikipedia in Portuguese:

dest = path/'corpus2_100'
data_lm = (TextList.from_df(df_trn_val, path, cols=reviews, processor=SPProcessor.load(dest))
    .split_by_rand_pct(0.1, seed=42)
    .label_for_lm()           
    .databunch(bs=bs, num_workers=1))

config = awd_lstm_lm_config.copy()
config['qrnn'] = True
config['n_hid'] = 1550 #default 1152
config['n_layers'] = 4 #default 3

perplexity = Perplexity()
learn_lm = language_model_learner(data_lm, AWD_LSTM, config=config, pretrained_fnames=lm_fns3, drop_mult=1., metrics=[error_rate, accuracy, perplexity]).to_fp16()

This not exactly the same as yours, in particular for the language_model_learner (fastai v1).

Could you try my code?

morgan · June 18, 2020, 9:18am

Ah sorrry, my bad! Thats very cool, thanks for explaining. Didn’t realise thats whats was going on under the hood, will fix my post.

Isabella · June 18, 2020, 9:26am

Hello @pierreguillou,
thank you so much for taking the time to see my question. The only difference I can see with your code is in pretrained_fnames=lm_fns3, as you are using the weights from the Portuguese bidirectional model you trained, while I am using the it_multifit_paper_version files I downloaded from the MULTIFiT project (and the specific folder where the spm.vocab and spm.model files reside).

Moreover, I think that the tokenization and creation of the “new vocabulary” should take place in TextList.from_df(...).databunch() instruction (whereas the call to language_model_learner() should take care of remapping new tokens to old ones as you were explaining in your previous post).
After running that first instruction, I would expect data_lm.vocab.itos to only contain the tokens found in the finetuning dataset, but it is unchanged wrt the original vocabulary. It looks like the SPProcessor.load(dest) instruction only loads the previous vocabulary and tokenizes the new dataset based on that vocab only.

To force the adaptation to the new vocab, the only way I found is to initialize a new processor forcing sp_model and sp_vocab to be None, based on this part of the process method of class SPProcessor:

if self.sp_model is None or self.sp_vocab is None:
    cache_dir = self.train_func(ds.items, ds.path)
    self.sp_model,self.sp_vocab = cache_dir/'spm.model',cache_dir/'spm.vocab'

Still wondering whether this work-around makes sense, though…

pierreguillou · June 18, 2020, 11:04am

As I explained in my post, this is not done by the databunch (fastai v1) but by the learner (language_model_learner in fastai v1).

After running your learner, you will have in learn.data.vocab.itos e learn.data.vocab.stoi the new tokens (tokens found in the finetuning dataset).

muellerzr · June 18, 2020, 11:09am

For exactly where/how, see the convert weights function:

pierreguillou · June 18, 2020, 11:10am

Yes. See the step 3 of my post.

Isabella · June 19, 2020, 4:35pm

It seems to me that the convert_weights function called in method load_pretrained of the language learner takes care of keeping the weight of a word W the same as it was in the pretrained model, even if W in the new vocab has a new id; however, the tokenization of the finetuning dataset and construction of the new vocab (new vocab = list of tokens actually appearing in the new dataset) happens before, and after some tests my understanding is that this is done when label_for_lm is called.

When using the sentencepiece processor (differently from spacy) for tokenization, my understanding so far (based on my tests and a deep dive in the source code, in particular function process of class SPProcessor) is that the finetuning dataset is tokenized using tokens from the original dataset - which means that e.g. emojis are simply discarded because they were not present in the pretrained sentencepiece model.

Perhaps we are looking at different package versions? I am currently using fastai 1.0.60 and multifit 1.0.