When converting a language model to a classifier model, it seems only weights for tokens in the imdb corpus is loaded:
new_w = np.zeros((vs, em_sz), dtype=np.float32)
for i,w in enumerate(itos):
r = stoi2[w]
new_w[i] = enc_wgts[r] if r>=0 else row_m
wgts['0.encoder.weight'] = T(new_w)
wgts['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_w))
wgts['1.decoder.weight'] = T(np.copy(new_w))
Presumably the wiki model has the larger vocabulary, and this is for the sake of efficiency. But in the subsequent conversion to classifier, the itos
from the imdb lm is used:
itos = pickle.load((LM_PATH/'tmp'/'itos.pkl').open('rb'))
stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
len(itos)
If the corpus of your classifier has tokens that are not in the lm data, you’d lose these tokens as ‘unknown’.
Wouldn’t it be better to either use all the wiki103 vocabulary weights, or combine the lm and classifier corpuses for training, whichever is possible?