Can someone help me figure out some doubts I have?
I am testing the (Italian) MultiFiT model (https://github.com/n-waves/multifit).
Looking at the spm.vocab file coming with the model, I have noticed many Japanese/Chinese/… characters appearing towards the end of the file. I was wondering why… the model should have been pretrained on the Italian wikipedia where I doubt many of those characters appear at all.
Moreover, I have also experimented with the Italian ULMFiT model (https://github.com/Quantyca/deepitalian). The way the vocabulary for LM fine-tuning is built looks different from the MultiFIT approach: in fact, my data_lm vocabulary only contains words appearing in the fine-tuning dataset (i.e., words appearing in the wiki pretraining vocab but not in the new dataset are discarded).
This does not seem to happen with it_multifit, otherwise I would expect not to find, as I do, non-latin characters in my data_lm.vocab.
I think if multifit use vocab size of 60k, and it was pretrained on italian wikipedia, it is possible that some non latin character like Japanese or Chinese will be also on the vocabulary (and most probably at the end of the list due to their little frequency). It is also normal that the list of vocabulary from pretraining will be reduced to the vocabulary used in the fine tuning.
But @piotr.czapla might know better
It is also normal that the list of vocabulary from pretraining will be reduced to the vocabulary used in the fine tuning.
I agree (at least that’s the behavior I observed in using the “general” ULMFiT approach). However, when using the MultiFiT model I end up with a fine-tuning vocabulary that still contains those non-latin characters (which don’t appear in the fine-tuning dataset, so I was expecting to lose them).
By the way, multifit uses a 15K tokens vocabulary, in fact.
I guess there might be a problem with the code I am using, probably something involving the tokenizer (sentencepiece vs Spacy).
Another thing I was noticing is that with Spacy I am able to get emojis in my fine-tuning vocab, which I am currently unable to do with sentencepiece. Are you aware of any limitation in this sense?
maybe multifit uses 15k vocab because it uses sentencepiece. and spacy might use word tokenisation so you will find emoji there, but sentencepiece uses subword tokenisation and therefore the emoji might be spitted to its character.
This way, my data_lm.vocab contains only tokens from my dataset (cleaned_df), including emojis.
Still, I am unsure whether this is the correct way to go…
Also, the resulting tokens are more at the word than subword level (when compared with the pretraining vocab), but that might be because my finetuning dataset is rather small.
Can someone with previous experience with the sentencepiece tokenizer & the multifit approach please provide some guidance? @pierreguillou perhaps (I am using your notebooks on github as a basis for my experiments ) ?
Hello @morgan. It is not what fastai (v1 e v2) does: the vocabulary for fine-tuning the pre-trained Language Model is new (built from the fine-tune corpus) and the embeddings of its tokens are the ones of the corresponding tokens in the old vocabulary (vocabulary of pre-trained LM). If there is no corresponding token, the values of embeddings are the mean of embeddings of the old vocabulary (pre-trained one).
We can understand this code as new_wgts = match_embeds(old_wgts, old_vocab, new_vocab))
match_embeds(old_wgts, old_vocab, new_vocab) (“Convert the embedding in old_wgts to go from old_vocab to new_vocab.”) that takes old_wgts (embeddings of a token of the old_vocab) when a token of the new_vocab is as well in old_vocab.
thank you so much for taking the time to see my question. The only difference I can see with your code is in pretrained_fnames=lm_fns3, as you are using the weights from the Portuguese bidirectional model you trained, while I am using the it_multifit_paper_version files I downloaded from the MULTIFiT project (and the specific folder where the spm.vocab and spm.model files reside).
Moreover, I think that the tokenization and creation of the “new vocabulary” should take place in TextList.from_df(...).databunch() instruction (whereas the call to language_model_learner() should take care of remapping new tokens to old ones as you were explaining in your previous post).
After running that first instruction, I would expect data_lm.vocab.itos to only contain the tokens found in the finetuning dataset, but it is unchanged wrt the original vocabulary. It looks like the SPProcessor.load(dest) instruction only loads the previous vocabulary and tokenizes the new dataset based on that vocab only.
To force the adaptation to the new vocab, the only way I found is to initialize a new processor forcing sp_model and sp_vocab to be None, based on this part of the process method of class SPProcessor:
if self.sp_model is None or self.sp_vocab is None:
cache_dir = self.train_func(ds.items, ds.path)
self.sp_model,self.sp_vocab = cache_dir/'spm.model',cache_dir/'spm.vocab'
Still wondering whether this work-around makes sense, though…
It seems to me that the convert_weights function called in method load_pretrained of the language learner takes care of keeping the weight of a word W the same as it was in the pretrained model, even if W in the new vocab has a new id; however, the tokenization of the finetuning dataset and construction of the new vocab (new vocab = list of tokens actually appearing in the new dataset) happens before, and after some tests my understanding is that this is done when label_for_lm is called.
When using the sentencepiece processor (differently from spacy) for tokenization, my understanding so far (based on my tests and a deep dive in the source code, in particular function process of class SPProcessor) is that the finetuning dataset is tokenized using tokens from the original dataset - which means that e.g. emojis are simply discarded because they were not present in the pretrained sentencepiece model.
Perhaps we are looking at different package versions? I am currently using fastai 1.0.60 and multifit 1.0.