I also want to remind everyone that the course content has varied from year to year (this is our 4th year doing the course), so it’s not the case that part 1 always covers X topics.
One change this year was that the machine learning course was subsumed into the deep learning course (previously, whether to start with the ML or DL courses was a very common question we got), so we covered more core ML this year in part 1 than we have in the past.
How can we determine if the given pretrained model (wiki) is suitable/enough for our downstream task? If we have limited vocab overlap, should we need to add additional dataset to create language model from scratch?
Unless you are changing language, you should have the basis of your language that overlaps, which is the most important. But don’t use an English pretrained model for French or Chinese, if that was the question.
In general, you should tokenize the same way your pretrained model did (if possible), that way you won’t surprise your pretrained model on the new corpus.
A model is not tokenized, a model is an architecture with weights.
I’m talking about the way the dataset on which your pretrained model was trained was tokenized. And it’s up to you to make sure you do it the same way on your downstream task. Fastai’s pretrained model was trained with fastai’s tokenization. If you use a HuggingFace pretrained model, they come with their own tokenizers.
Yes Wiki pretrained gives a good basis to start with. If downstream task is associated with technological domain or medical domain but in en lang, that’s where I was wondering may be combine it with additional datasources like wiki + [DOMAIN specific source].
They are not the same though, they are the same word with a grammatical rule. You could always say: I’m not going to do anything to make life easier for my model, it can take care of itself. But in practice, every little bit of help you give it is going to make your final metric better.
fastai allows you to train any model with any data. There is not a function that gives you a learner in one-line of code for word tagging or encoder/decoder, but that does not mean you can’t use fastai to train such a model. You’ll just need to dig a bit more in the mid-level API to build your Learner yourself.