Adding new word to ULMFit in fine-tuning step

I was wondering if it is possible to add new words to ULMFit vocabulary in fine-tuning step.

There is a section in finetune_lm.py which is dedicated to Change word-to-index mapping. but it puts mean embedding if the token doesn’t exist in the pre-train vocabulary.

itos2 = pickle.load(open(pretrain_path / 'tmp' / f'itos.pkl', 'rb'))
stoi2 = collections.defaultdict(lambda:-1, {v:k for k,v in enumerate(itos2)})
nw = np.zeros((vs, em_sz), dtype=np.float32)
nb = np.zeros((vs,), dtype=np.float32)
for i,w in enumerate(itos):
       r = stoi2[w]
       if r>=0:
              nw[i] = ew[r]
       else:
              nw[i] = row_mxxxxx

The idea is to extend this new weight matrix embedding ( nw ) to add new word’s pre-trained word vectors. as far as It’s fine-tuning in this step I think it would also fine-tune new word’s embeddings. But I don’t have any idea if there will be any shape incompatibility or any other problems?

2 Likes

Hey Amir, did you find out how to do this eventually?

1 Like

@amir97
Good question. I was wondering about this myself.
A good use case would be if the target corpus is very different from the language model corpus. Let’s say wiki103 is used for pertaining the language model but then you want to classify twitter posts. Tweets are much more colloquial. So the model probably should extend its vocab to contain stuff like emojis.

Take a closer look at convert_weights in fastai.text.learner:

def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
    "Convert the model `wgts` to go with a new vocabulary."
    dec_bias, enc_wgts = wgts['1.decoder.bias'], wgts['0.encoder.weight']
    bias_m, wgts_m = dec_bias.mean(0), enc_wgts.mean(0)
    new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()
    new_b = dec_bias.new_zeros((len(itos_new),)).zero_()
    for i,w in enumerate(itos_new):
        r = stoi_wgts[w] if w in stoi_wgts else -1
        new_w[i] = enc_wgts[r] if r>=0 else wgts_m
        new_b[i] = dec_bias[r] if r>=0 else bias_m
    wgts['0.encoder.weight'] = new_w
    wgts['0.encoder_dp.emb.weight'] = new_w.clone()
    wgts['1.decoder.weight'] = new_w.clone()
    wgts['1.decoder.bias'] = new_b
    return wgts

The replacement happens in the for loop. If a pertained weight and bias exists for a new vocab item we reuse it. If not we initialise it with the mean.

This is how it works in fastai v1.0

4 Likes

This means that the default behavior is that it will expand the embedding matrix for new words (words which were not seen in pre-training)?

When we fall back to the mean vector it can still be trained in the 2nd step (lm finetuning). Right?
For example, let’s say the word “shinigami” was never seen during lm pre-training. But my corpus has this word and its occurrence is in the top 60000 words (my max vocab size). Then it will be initialized with the mean embedding but it will get trained in the language model fine tuning step.
Is this the correct behavior?

2 Likes

When we fall back to the mean vector it can still be trained in the 2nd step (lm finetuning). Right?
For example, let’s say the word “shinigami” was never seen during lm pre-training. But my corpus has this word and its occurrence is in the top 60000 words (my max vocab size). Then it will be initialized with the mean embedding but it will get trained in the language model fine tuning step.
Is this the correct behavior?

Yes. Exactly

Thanks :slight_smile: @Andreas_Daiminger

@Andreas_Daiminger @yashuseth

Hi folks,

I’m curious about which vocab from wikipedia gets replaced by the new vocab from the new dataset?

We have a max of 60K, does the old embedding stay the same and only new words gets added?

I wasn’t able to get it from the code

Also if you don’t mind how can i access the new vocab list?

Learn.data.vocab.stoi is outputting a huge dictionary with more that 300K length

You may want to look at the new fast.ai NLP course and in particular the lessons covering the 5th notebook since this covers the differences between the vocab from the pretrained wikitext-103 model and that from the dataset (in this case imdb) used for transfer learning.

Essentially, words that appear in the imdb but not wikitext-103 vocab, are initialized to the same random vector and then weights are updated through training.

Once training is complete, it can be seen that different weights are learned for the words that were not in the pre-trained wikitext-103 vocab.

1 Like

Thanks jon, I’ll watch the videos you mentioned, Thanks.

I just needed to know if I have already a vocab of 60K, and there’s around 40 K new vocab in the IMDB dataset, then they will be initialized to a random vector and then they’ll be finetuned, my question is what about the remaining 20K vocab, how should we select them from the 60K items?

and for the vocab access I was wrong as Learn.data.vocab.stoi is a dictionary that most words has a corresponding 0 value as unknown, and the vocab included in the model can be got from Learn.data.vocab.itos

Omar, my understanding is this. You have a vocab for the base model (e.g. wiki-text-103) and a vocab for your specific text (e.g. IMDB). In transfer learning you are trying to fine tune the model to your specific text (e.g. IMDB) so I assume the vocab in the resulting model will be that one. However, prior to training, there are no weights associated with the words that were not in the base vocab so these are all set to the same initial random values. These weights are updated through training.

In this sense, I don’t think it is a case of combining or adding to the original vocab rather we are using the vocab for the text you want to model.

2 Likes

Good, Thank you, Now I understood.

I thought that due to we’re doing transfer learning, we usually don’t have a big vocabulary size (60 K unique token).

That’s why i was asking.

Thanks.

I think it is still worth experimenting yourself as my understanding may not be correct.

1 Like

looking at the following piece of code, you are right, the loop only look at the new itos
for the new vocabulary and only borrows the pretrained embedding for the common words only if it exist in the original embedding matrix, I was suspecious as for me both vocab wiki and my finetuned one was exactly the same size.

def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
    "Convert the model `wgts` to go with a new vocabulary."
    dec_bias, enc_wgts = wgts.get('1.decoder.bias', None), wgts['0.encoder.weight']
    wgts_m = enc_wgts.mean(0)
    if dec_bias is not None: bias_m = dec_bias.mean(0)
    new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()
    if dec_bias is not None: new_b = dec_bias.new_zeros((len(itos_new),)).zero_()
    for i,w in enumerate(itos_new):
        r = stoi_wgts[w] if w in stoi_wgts else -1
        new_w[i] = enc_wgts[r] if r>=0 else wgts_m
        if dec_bias is not None: new_b[i] = dec_bias[r] if r>=0 else bias_m
    wgts['0.encoder.weight'] = new_w
    if '0.encoder_dp.emb.weight' in wgts: wgts['0.encoder_dp.emb.weight'] = new_w.clone()
    wgts['1.decoder.weight'] = new_w.clone()
    if dec_bias is not None: wgts['1.decoder.bias'] = new_b
    return wgts

After you send me Rachel notebook, she made the finetuning with imdb dataset and the resulting vocab and embedding size was around 50k token (not 60), so it only create the new embedding from the new vocabulary and only borrows the common words that were in the original vocabulary.

1 Like

Great, thanks for confirming that.

1 Like

This notebook (thanks to @pabloc ) also helps in supporting how fine tuning the language model helps in handling the Out of Vocabulary problem(for words not appearing in wikitext).

1 Like