Incremental LM training of LM with daily new data

All depends on the problem you are solving. If you need to expand the vocab, I suspect you can “reserve” some extra space in the embedding matrix, then add new vectors that are the mean of the old and train on that new vocab. But, in the end, the whole idea of embedding puts a hard limit on the index.

If, for your problem, you want less OOV (out-of-vocab) there are a few approaches to take that can minimize that. I have experimented with SentencePiece and that helps a lot (butrequires some custom coding on your side to fit into Fast.ai - nothing too hard, but some bits and bobs.)

The whole things is based on “unigram” or “byte-pair encoding.” The links are the papers on each topic.

2 Likes