ULMFiT Size of vocabulary and lost (?) embeddings

I have a question regarding vocabulary size used to fine-tune a language model and train a classifier on top.

I have trained a custom language model on a large number of texts (about 200mln tokens). It was trained using max_vocab=60000. I check the number of words in the vocab like this:

len(data.vocab.itos)  # 60004
len(data.vocab.stoi)  # 60004

So far, so good.

I saved it and then I loaded it using another (classification) dataset:

lm_ft_data = (TextList.from_df(cls_df_train, path="data_clf", cols="text")
           .random_split_by_pct(0.1)
           .label_for_lm()
           .databunch())

learn = language_model_learner(lm_ft_data,  # this is the dataset to finetune LM to
                               path="./artifacts", 
                               pretrained_fnames=["lm_5_ep_lr2-3_5_stlr", "itos"])

Then I look at vocab size:

len(lm_ft_data.vocab.stoi)  # 71754
len(lm_ft_data.vocab.itos)  # 15400

This looks strange, so let’s look at the model, using learn.model:

SequentialRNN(
  (0): RNNCore(
    (encoder): Embedding(15400, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(15400, 400, padding_idx=1)
    )
...

If you look at stoi like this:

mapping = defaultdict(list)

for k, v in lm_ft_data.vocab.stoi.items():
    mapping[v].append(k)

you can see that 56355 strings are mapped to token 0, which is xxunk.

This looks very strange, seems like all embeddings learned from the big dataset are lost - or are they? If so, this may mean it will have worse generalization afterwards when applied in the wild. Why could this inconsistency occur? Am I wrong in my understanding of how how vocabulary expansion during fine-tuning happens?

P.S. I must note that the classifier trained using encoder from this model performs very well - in fact well beyond my expectations, so I cannot be more grateful to Jeremy, Sebastian and Sylvain for their hard work on this library! However, if this is a bug, it should probably be addressed.

1 Like

I have dug into the code and I see that this is by design:

def convert_weights(wgts:Weights, stoi_wgts:Dict[str,int], itos_new:Collection[str]) -> Weights:
    "Convert the model `wgts` to go with a new vocabulary."
    dec_bias, enc_wgts = wgts['1.decoder.bias'], wgts['0.encoder.weight']
    bias_m, wgts_m = dec_bias.mean(0), enc_wgts.mean(0)
    new_w = enc_wgts.new_zeros((len(itos_new),enc_wgts.size(1))).zero_()

I am sure there were serious considerations to make it work this way, however, why is it done this way instead of, say, add previously unseen words to the matrix using the mean of previously seen embeddings, preserving the old ones?

I’m not sure I follow. This is what that function does.

I see. I am sorry for misunderstanding.
What I meant was that if we have a set of embeddings from the initial model training S_1, and a set of words from the new data S_2, I thought it might me a good idea to use embedding from S_1 OR S_2. But if you think of it, it is not a good idea, because you cannot backprop into words from S_1 - S_2, so my question was kinda dumb from the beginning :slight_smile: