Lesson 10: combining corpuses in transfer learning and model conversion

When converting a language model to a classifier model, it seems only weights for tokens in the imdb corpus is loaded:

new_w = np.zeros((vs, em_sz), dtype=np.float32)
for i,w in enumerate(itos):
    r = stoi2[w]
    new_w[i] = enc_wgts[r] if r>=0 else row_m 

wgts['0.encoder.weight'] = T(new_w)
wgts['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_w))
wgts['1.decoder.weight'] = T(np.copy(new_w))

Presumably the wiki model has the larger vocabulary, and this is for the sake of efficiency. But in the subsequent conversion to classifier, the itos from the imdb lm is used:

itos = pickle.load((LM_PATH/'tmp'/'itos.pkl').open('rb'))
stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
len(itos)

If the corpus of your classifier has tokens that are not in the lm data, you’d lose these tokens as ‘unknown’.

Wouldn’t it be better to either use all the wiki103 vocabulary weights, or combine the lm and classifier corpuses for training, whichever is possible?