I am trying to understand
examples/text.ipynb. It creates a DataBunch with its own vocab and itos. And then, when I want to use WT103 pretrained model (which has its own vocab and itos), it needs to convert the weights. I got that, but only that. I don’t understand what the code is doing.
def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
"Convert the model weights to go with a new vocabulary."
dec_bias, enc_wgts = wgts['1.decoder.bias'], wgts['0.encoder.weight']
bias_m, wgts_m = dec_bias.mean(0), enc_wgts.mean(0)
new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()
new_b = dec_bias.new_zeros((len(itos_new),)).zero_()
for i,w in enumerate(itos_new):
r = stoi_wgts[w] if w in stoi_wgts else -1
new_w[i] = enc_wgts[r] if r>=0 else wgts_m
new_b[i] = dec_bias[r] if r>=0 else bias_m
wgts['0.encoder.weight'] = new_w
wgts['0.encoder_dp.emb.weight'] = new_w.clone()
wgts['1.decoder.weight'] = new_w.clone()
wgts['1.decoder.bias'] = new_b
how does it
translates from one vocab to the other?
Using version release-1.0.15
It doesn’t translate anything. What it does instead is look up in the embedding matrix in the pretrained model and put the line for one word there at the place where this word is in the new vocabulary. It’s exactly like what was done in old fastai in the imdb notebook.
I also tried understanding the way it works and it was confusing for me.
Could you tell me if my understanding is correct?
When you say:
What it does instead is look up in the embedding matrix in the pretrained model and put the line for one word there at the place where this word is in the new vocabulary.
What I understand is that I will reuse the weights of the old embedding matrix for the words that are contained in the new vocabulary and that it will initialize new values for the words in the new vocabulary that are not contained in the old one.
For instance, if my previous vocabulary was [‘hi’, ‘apple’] and the new vocabulary is [‘hi’, ‘orange’].Then it will reuse the weights of the word ‘hi’ but create a new not trained embedding vector for the word ‘orange’.
Therefore, if I want to be able to leverage the old vocabulary, I would have to use a set of words that is more closely related to the vocabulary
Thank you very much for the answer.
Yup, it’s just that if you new vocabulary is [‘orange’, ‘hi’], it will place the weights for ‘hi’ in the correct position. As for ‘orange’, it will initialize it with the mean of all the weights.
Will the weights of the word tokens in the old vocab be dropped if they do not appear in the new vocab? Wondering if this would result in the model “forgetting” its previous learnings.
The weights of the embedding of the final network will only contains weights for the words inside its vocab, so the weights will be dropped. If they were not, they would not be trained in the finetuning / classification training since they won’t appear in the training dataset.
However, one could argue that keeping the weights from words that don’t appear in the training dataset(distinct from the pretraining dataset) might lead to better performance than using the unkown tokens and its embedding for those words.
Yes, I was thinking along the same lines too that those pre-trained weights would still be useful, even if their associated word tokens were not present in the training dataset used to fine-tune the language model / train the classifier. They should not be dropped!
E.g. Suppose the pre-trained vocab had the embedding / weight of the word “speed”. In the training dataset, there is a sentence “How fast does light travel?”, and the word “speed” is also not inside. Instead of dropping the weight of “speed”, it would be useful to keep it, in case there is an input query like “What is the speed of light?” during model deployment. Assuming that the weights / vectors for “fast” and “speed” are closer to each other (by cosine similarity) compared to say “fast” and “donut” (also not in the training dataset), the model would be able to predict “What is the speed of light?” as belonging to the same class as “How fast does light travel?” (which was present in the training dataset) with higher probability as compared to “What is the donut of light?”.
Would be happy to hear others’ experiences / thoughts / opinions on this.