Something I am struggling with is understanding how we can use a pre-trained language model, say with wikitext-103 weights. We then create a language model databunch
TextLMDataBunch.from_df(target_corpus), however this databunch creates its own vocabulary. How can it take advantage of the pretrained model weights when it is obvious that the vocab used is different (wikipedia corpus vocab).
To me this means that when getting the embedding for word ‘apple’ for instance, which gets numericalized to say 123 in our target corpus, we could potentially get the embedding for any arbitrary word as 123 may be the index of a totally different word in the wikitext-103 vocab.
I remember old word2vec used to save the vocab file so it could be reused (I could be wrong, its been a while).
Can anyone shed some light on this.
Yes. Adaptation of a pretrained model vocab to a new corpus data happens in
convert_weights() function when you call
What happens in
convert_weights() if old tokens from the pretrained are present in new vocab they will be replaced. As for the unseen new tokens, they will be initialized by mean of embeddings. All other LSTM/QRNN layer weights will be loaded directly so archs should match.
def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
"Convert the model `wgts` to go with a new vocabulary."
dec_bias, enc_wgts = wgts['1.decoder.bias'], wgts['0.encoder.weight']
bias_m, wgts_m = dec_bias.mean(0), enc_wgts.mean(0)
new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()
new_b = dec_bias.new_zeros((len(itos_new),)).zero_()
for i,w in enumerate(itos_new):
r = stoi_wgts[w] if w in stoi_wgts else -1
new_w[i] = enc_wgts[r] if r>=0 else wgts_m
new_b[i] = dec_bias[r] if r>=0 else bias_m
wgts['0.encoder.weight'] = new_w
wgts['0.encoder_dp.emb.weight'] = new_w.clone()
wgts['1.decoder.weight'] = new_w.clone()
wgts['1.decoder.bias'] = new_b
It looks like we are assuming
tie_weights=True when loading a pretrained model, but not sure.
Ahh, I did’nt look into that
load_pretrained call when reading the
text_classifier_learner code. Thanks for that.
Do you know how to access the new vocab list?
when i do this
I get a very huge list that has a length bigger than 60K which is the vocab size?
I’d like to make sure that some important vocabulary are in the new modified vocabulary after language model finetuning, and I’d also like to compare the vocab for pretrained model before and after the language model finetuning.
In case anyone else is wondering, there’re two dict, stoi which contains many words that maps to 0 (unk) and itos which contains our vocab
itos length = vocab size
stoi length = all words appeared in the corpus either it’s included in our model or not.