What if you want to keep the unknown words (rather than replacing them with xxx)? As long as its used more than 1x, will it be kept?
The core idea is that even if wikitext is not that similar to the corpus you are interested in, you can still pre-train with wikitext, and then fine tune using the corpus you’re interested in.
Is fastai library available by default via Kaggle kernels? It seems they don’t allow to install custom packages into GPU kernels. There is Quora kernel-only competition so I wonder if it is possible to use the library here.
How to change the size of vocabulary? 6000 to 8000?
Can Jeremy confirm?
Please post these kind of questions in the Advanced Section. But yes, BERT is another example of transfer learning in NLP, using a different backbone (transformer) and solving a different task (masked language model + next sentence prediction).
Do we build the vocab again from scratch or we use the vocab from pre-trained wikitext model?
wouldn’t you be missing word if vocab is limited to a small number.?
Jeremy said the traditional approach is to convert everything to lower case. Then how do you predict case? Is there a separate model / technique for that?
you wouldn’t be able to use the built-in language models for that competition because external data is prohibited
We build it from scratch because we have new words that didn’t exist on wikipedia, then we match them.
What about other languages than English?
There is a token that specifies if the next word is caps. In test time, it works like any other token.
Already answered in this same thread.
The language model we use is publicly available, though. No?
Yes, correct, but I was thinking to use the lib instead of writing custom training loop and models =) Also, they provide some embeddings already.
Why not also train the language model on the unsupervised entries in the IMDB dataset?
If we don’t hold out a validation set, does that mean there is no possibility of overfitting when building the language model?
We do!
How to expand the vocab to medical records from wiki text if using transfer learning? Assuming vocab only considers high frequency English words from Wikipedia