I have a question regarding vocabulary size used to fine-tune a language model and train a classifier on top.
I have trained a custom language model on a large number of texts (about 200mln tokens). It was trained using max_vocab=60000
. I check the number of words in the vocab like this:
len(data.vocab.itos) # 60004
len(data.vocab.stoi) # 60004
So far, so good.
I saved it and then I loaded it using another (classification) dataset:
lm_ft_data = (TextList.from_df(cls_df_train, path="data_clf", cols="text")
.random_split_by_pct(0.1)
.label_for_lm()
.databunch())
learn = language_model_learner(lm_ft_data, # this is the dataset to finetune LM to
path="./artifacts",
pretrained_fnames=["lm_5_ep_lr2-3_5_stlr", "itos"])
Then I look at vocab size:
len(lm_ft_data.vocab.stoi) # 71754
len(lm_ft_data.vocab.itos) # 15400
This looks strange, so let’s look at the model, using learn.model
:
SequentialRNN(
(0): RNNCore(
(encoder): Embedding(15400, 400, padding_idx=1)
(encoder_dp): EmbeddingDropout(
(emb): Embedding(15400, 400, padding_idx=1)
)
...
If you look at stoi like this:
mapping = defaultdict(list)
for k, v in lm_ft_data.vocab.stoi.items():
mapping[v].append(k)
you can see that 56355 strings are mapped to token 0
, which is xxunk
.
This looks very strange, seems like all embeddings learned from the big dataset are lost - or are they? If so, this may mean it will have worse generalization afterwards when applied in the wild. Why could this inconsistency occur? Am I wrong in my understanding of how how vocabulary expansion during fine-tuning happens?
P.S. I must note that the classifier trained using encoder from this model performs very well - in fact well beyond my expectations, so I cannot be more grateful to Jeremy, Sebastian and Sylvain for their hard work on this library! However, if this is a bug, it should probably be addressed.