Pretrained language models and training vocabulary

Something I am struggling with is understanding how we can use a pre-trained language model, say with wikitext-103 weights. We then create a language model databunch TextLMDataBunch.from_df(target_corpus), however this databunch creates its own vocabulary. How can it take advantage of the pretrained model weights when it is obvious that the vocab used is different (wikipedia corpus vocab).

To me this means that when getting the embedding for word ‘apple’ for instance, which gets numericalized to say 123 in our target corpus, we could potentially get the embedding for any arbitrary word as 123 may be the index of a totally different word in the wikitext-103 vocab.

I remember old word2vec used to save the vocab file so it could be reused (I could be wrong, its been a while).

Can anyone shed some light on this.


1 Like

Yes. Adaptation of a pretrained model vocab to a new corpus data happens in convert_weights() function when you call load_pretrained.

What happens in convert_weights() if old tokens from the pretrained are present in new vocab they will be replaced. As for the unseen new tokens, they will be initialized by mean of embeddings. All other LSTM/QRNN layer weights will be loaded directly so archs should match.

def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
    "Convert the model `wgts` to go with a new vocabulary."
    dec_bias, enc_wgts = wgts['1.decoder.bias'], wgts['0.encoder.weight']
    bias_m, wgts_m = dec_bias.mean(0), enc_wgts.mean(0)
    new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()
    new_b = dec_bias.new_zeros((len(itos_new),)).zero_()
    for i,w in enumerate(itos_new):
        r = stoi_wgts[w] if w in stoi_wgts else -1
        new_w[i] = enc_wgts[r] if r>=0 else wgts_m
        new_b[i] = dec_bias[r] if r>=0 else bias_m
    wgts['0.encoder.weight'] = new_w
    wgts['0.encoder_dp.emb.weight'] = new_w.clone()
    wgts['1.decoder.weight'] = new_w.clone()
    wgts['1.decoder.bias'] = new_b
    return wgts 

It looks like we are assuming tie_weights=True when loading a pretrained model, but not sure.


Ahh, I did’nt look into that load_pretrained call when reading the text_classifier_learner code. Thanks for that.

Hi @kcturgutlu
Do you know how to access the new vocab list?

when i do this 

I get a very huge list that has a length bigger than 60K which is the vocab size?

I’d like to make sure that some important vocabulary are in the new modified vocabulary after language model finetuning, and I’d also like to compare the vocab for pretrained model before and after the language model finetuning.



In case anyone else is wondering, there’re two dict, stoi which contains many words that maps to 0 (unk) and itos which contains our vocab

itos length = vocab size
stoi length = all words appeared in the corpus either it’s included in our model or not.