Lesson 8 - Official topic

I also want to remind everyone that the course content has varied from year to year (this is our 4th year doing the course), so it’s not the case that part 1 always covers X topics.

One change this year was that the machine learning course was subsumed into the deep learning course (previously, whether to start with the ML or DL courses was a very common question we got), so we covered more core ML this year in part 1 than we have in the past.

8 Likes

As a suggestion for my peers: Jeremy, Rachel, Zach are great podcasters! Try converting the older lessons into audio and consuming them!

It’s one of my favourite ways to learn from older lectures

3 Likes

How can we determine if the given pretrained model (wiki) is suitable/enough for our downstream task? If we have limited vocab overlap, should we need to add additional dataset to create language model from scratch?

3 Likes

Should we worry more about which based language model (wiki 103) to use or how to tokenize our data? Which one is more important?

1 Like

Unless you are changing language, you should have the basis of your language that overlaps, which is the most important. But don’t use an English pretrained model for French or Chinese, if that was the question.

4 Likes

In general, you should tokenize the same way your pretrained model did (if possible), that way you won’t surprise your pretrained model on the new corpus.

Huggingface also has very fast tokenizers for the SOTA models (ex: GPT2 or BERT). Here is how those could be used: http://dev.fast.ai/tutorial.transformers

3 Likes

Are you saying the way the language model was tokenized? How would we know this?

A model is not tokenized, a model is an architecture with weights.
I’m talking about the way the dataset on which your pretrained model was trained was tokenized. And it’s up to you to make sure you do it the same way on your downstream task. Fastai’s pretrained model was trained with fastai’s tokenization. If you use a HuggingFace pretrained model, they come with their own tokenizers.

5 Likes

In this line of code first(spacy([txt])) what does first do?

First takes the first element, since a tokenizer returns a generator (for easy parallelization internally) and not a list.

2 Likes

Yes Wiki pretrained gives a good basis to start with. If downstream task is associated with technological domain or medical domain but in en lang, that’s where I was wondering may be combine it with additional datasources like wiki + [DOMAIN specific source].

As always, you should try it and see if you get better results :wink:

1 Like

Isn’t the embedding supposed to figure out by distance that “it” and “It” are same ?

Is it possible to deactivate the case sensitive part? I have a dataset where lower and upper case do not have any special meaning

Does fastai2 support other text task apart from text classification? Word Tagging? Encoder-decoder?

1 Like

They are not the same though, they are the same word with a grammatical rule. You could always say: I’m not going to do anything to make life easier for my model, it can take care of itself. But in practice, every little bit of help you give it is going to make your final metric better.

You can specify your lisft of “rules” when defining a tokenizer.

2 Likes

I did this recently. When I was tokenizing I just called the .lower() function on python

fastai allows you to train any model with any data. There is not a function that gives you a learner in one-line of code for word tagging or encoder/decoder, but that does not mean you can’t use fastai to train such a model. You’ll just need to dig a bit more in the mid-level API to build your Learner yourself.

2 Likes