Hello everyone,
Can someone help me figure out some doubts I have?
I am testing the (Italian) MultiFiT model (https://github.com/n-waves/multifit).
Looking at the spm.vocab
file coming with the model, I have noticed many Japanese/Chinese/… characters appearing towards the end of the file. I was wondering why… the model should have been pretrained on the Italian wikipedia where I doubt many of those characters appear at all.
Moreover, I have also experimented with the Italian ULMFiT model (https://github.com/Quantyca/deepitalian). The way the vocabulary for LM fine-tuning is built looks different from the MultiFIT approach: in fact, my data_lm
vocabulary only contains words appearing in the fine-tuning dataset (i.e., words appearing in the wiki pretraining vocab but not in the new dataset are discarded).
This does not seem to happen with it_multifit
, otherwise I would expect not to find, as I do, non-latin characters in my data_lm.vocab
.
Thank you very much in advance for your help!