Subword Tokenization

BjoernB · October 5, 2020, 1:43pm

In the course Jeremy hints at some problems with subword tokenization and fine-tuning, but (1) mentions he already has ideas and doesn’t get into details and (2) leaves it open if even the default might change in the future.

Have there been any discussion or commits on that topic since then, that might be worth following?

Since I want to make some experiments with subword tokenization right now I also wonder what these problems might be (to avoid running into them blindly especially since LM training does take quite a lot of time) and I wonder if there are pre-trained models (AWD_LSTM / QRNN, maybe from the multifit paper / others) available somehow? If not, I guess that is not too much of an issue, since I have the luxury of having basically as much in domain text as I could possibly want and maybe train directly on that or I could initially train using Wikitext like shown in the tutorial