Something I am struggling with is understanding how we can use a pre-trained language model, say with wikitext-103 weights. We then create a language model databunch TextLMDataBunch.from_df(target_corpus), however this databunch creates its own vocabulary. How can it take advantage of the pretrained model weights when it is obvious that the vocab used is different (wikipedia corpus vocab).
To me this means that when getting the embedding for word ‘apple’ for instance, which gets numericalized to say 123 in our target corpus, we could potentially get the embedding for any arbitrary word as 123 may be the index of a totally different word in the wikitext-103 vocab.
I remember old word2vec used to save the vocab file so it could be reused (I could be wrong, its been a while).
Yes. Adaptation of a pretrained model vocab to a new corpus data happens in convert_weights() function when you call load_pretrained.
What happens in convert_weights() if old tokens from the pretrained are present in new vocab they will be replaced. As for the unseen new tokens, they will be initialized by mean of embeddings. All other LSTM/QRNN layer weights will be loaded directly so archs should match.
def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
"Convert the model `wgts` to go with a new vocabulary."
dec_bias, enc_wgts = wgts['1.decoder.bias'], wgts['0.encoder.weight']
bias_m, wgts_m = dec_bias.mean(0), enc_wgts.mean(0)
new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()
new_b = dec_bias.new_zeros((len(itos_new),)).zero_()
for i,w in enumerate(itos_new):
r = stoi_wgts[w] if w in stoi_wgts else -1
new_w[i] = enc_wgts[r] if r>=0 else wgts_m
new_b[i] = dec_bias[r] if r>=0 else bias_m
wgts['0.encoder.weight'] = new_w
wgts['0.encoder_dp.emb.weight'] = new_w.clone()
wgts['1.decoder.weight'] = new_w.clone()
wgts['1.decoder.bias'] = new_b
return wgts
It looks like we are assuming tie_weights=True when loading a pretrained model, but not sure.
Hi @kcturgutlu
Do you know how to access the new vocab list?
when i do this
learn.data.vocab.stoi
I get a very huge list that has a length bigger than 60K which is the vocab size?
I’d like to make sure that some important vocabulary are in the new modified vocabulary after language model finetuning, and I’d also like to compare the vocab for pretrained model before and after the language model finetuning.