I was wondering if it is possible to add new words to ULMFit vocabulary in fine-tuning step.
There is a section in finetune_lm.py which is dedicated to Change word-to-index mapping. but it puts mean embedding if the token doesn’t exist in the pre-train vocabulary.
itos2 = pickle.load(open(pretrain_path / 'tmp' / f'itos.pkl', 'rb'))
stoi2 = collections.defaultdict(lambda:-1, {v:k for k,v in enumerate(itos2)})
nw = np.zeros((vs, em_sz), dtype=np.float32)
nb = np.zeros((vs,), dtype=np.float32)
for i,w in enumerate(itos):
r = stoi2[w]
if r>=0:
nw[i] = ew[r]
else:
nw[i] = row_mxxxxx
The idea is to extend this new weight matrix embedding ( nw ) to add new word’s pre-trained word vectors. as far as It’s fine-tuning in this step I think it would also fine-tune new word’s embeddings. But I don’t have any idea if there will be any shape incompatibility or any other problems?
@amir97
Good question. I was wondering about this myself.
A good use case would be if the target corpus is very different from the language model corpus. Let’s say wiki103 is used for pertaining the language model but then you want to classify twitter posts. Tweets are much more colloquial. So the model probably should extend its vocab to contain stuff like emojis.
Take a closer look at convert_weights in fastai.text.learner:
def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
"Convert the model `wgts` to go with a new vocabulary."
dec_bias, enc_wgts = wgts['1.decoder.bias'], wgts['0.encoder.weight']
bias_m, wgts_m = dec_bias.mean(0), enc_wgts.mean(0)
new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()
new_b = dec_bias.new_zeros((len(itos_new),)).zero_()
for i,w in enumerate(itos_new):
r = stoi_wgts[w] if w in stoi_wgts else -1
new_w[i] = enc_wgts[r] if r>=0 else wgts_m
new_b[i] = dec_bias[r] if r>=0 else bias_m
wgts['0.encoder.weight'] = new_w
wgts['0.encoder_dp.emb.weight'] = new_w.clone()
wgts['1.decoder.weight'] = new_w.clone()
wgts['1.decoder.bias'] = new_b
return wgts
The replacement happens in the for loop. If a pertained weight and bias exists for a new vocab item we reuse it. If not we initialise it with the mean.
When we fall back to the mean vector it can still be trained in the 2nd step (lm finetuning). Right?
For example, let’s say the word “shinigami” was never seen during lm pre-training. But my corpus has this word and its occurrence is in the top 60000 words (my max vocab size). Then it will be initialized with the mean embedding but it will get trained in the language model fine tuning step.
Is this the correct behavior?
When we fall back to the mean vector it can still be trained in the 2nd step (lm finetuning). Right?
For example, let’s say the word “shinigami” was never seen during lm pre-training. But my corpus has this word and its occurrence is in the top 60000 words (my max vocab size). Then it will be initialized with the mean embedding but it will get trained in the language model fine tuning step.
Is this the correct behavior?
You may want to look at the new fast.ai NLP course and in particular the lessons covering the 5th notebook since this covers the differences between the vocab from the pretrained wikitext-103 model and that from the dataset (in this case imdb) used for transfer learning.
Essentially, words that appear in the imdb but not wikitext-103 vocab, are initialized to the same random vector and then weights are updated through training.
Once training is complete, it can be seen that different weights are learned for the words that were not in the pre-trained wikitext-103 vocab.
Thanks jon, I’ll watch the videos you mentioned, Thanks.
I just needed to know if I have already a vocab of 60K, and there’s around 40 K new vocab in the IMDB dataset, then they will be initialized to a random vector and then they’ll be finetuned, my question is what about the remaining 20K vocab, how should we select them from the 60K items?
and for the vocab access I was wrong as Learn.data.vocab.stoi is a dictionary that most words has a corresponding 0 value as unknown, and the vocab included in the model can be got from Learn.data.vocab.itos
Omar, my understanding is this. You have a vocab for the base model (e.g. wiki-text-103) and a vocab for your specific text (e.g. IMDB). In transfer learning you are trying to fine tune the model to your specific text (e.g. IMDB) so I assume the vocab in the resulting model will be that one. However, prior to training, there are no weights associated with the words that were not in the base vocab so these are all set to the same initial random values. These weights are updated through training.
In this sense, I don’t think it is a case of combining or adding to the original vocab rather we are using the vocab for the text you want to model.
looking at the following piece of code, you are right, the loop only look at the new itos
for the new vocabulary and only borrows the pretrained embedding for the common words only if it exist in the original embedding matrix, I was suspecious as for me both vocab wiki and my finetuned one was exactly the same size.
def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
"Convert the model `wgts` to go with a new vocabulary."
dec_bias, enc_wgts = wgts.get('1.decoder.bias', None), wgts['0.encoder.weight']
wgts_m = enc_wgts.mean(0)
if dec_bias is not None: bias_m = dec_bias.mean(0)
new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()
if dec_bias is not None: new_b = dec_bias.new_zeros((len(itos_new),)).zero_()
for i,w in enumerate(itos_new):
r = stoi_wgts[w] if w in stoi_wgts else -1
new_w[i] = enc_wgts[r] if r>=0 else wgts_m
if dec_bias is not None: new_b[i] = dec_bias[r] if r>=0 else bias_m
wgts['0.encoder.weight'] = new_w
if '0.encoder_dp.emb.weight' in wgts: wgts['0.encoder_dp.emb.weight'] = new_w.clone()
wgts['1.decoder.weight'] = new_w.clone()
if dec_bias is not None: wgts['1.decoder.bias'] = new_b
return wgts
After you send me Rachel notebook, she made the finetuning with imdb dataset and the resulting vocab and embedding size was around 50k token (not 60), so it only create the new embedding from the new vocabulary and only borrows the common words that were in the original vocabulary.
This notebook (thanks to @pabloc ) also helps in supporting how fine tuning the language model helps in handling the Out of Vocabulary problem(for words not appearing in wikitext).