Lesson 8 - Official topic

rachel · May 6, 2020, 1:47am

I also want to remind everyone that the course content has varied from year to year (this is our 4th year doing the course), so it’s not the case that part 1 always covers X topics.

One change this year was that the machine learning course was subsumed into the deep learning course (previously, whether to start with the ML or DL courses was a very common question we got), so we covered more core ML this year in part 1 than we have in the past.

init_27 · May 6, 2020, 1:48am

As a suggestion for my peers: Jeremy, Rachel, Zach are great podcasters! Try converting the older lessons into audio and consuming them!

It’s one of my favourite ways to learn from older lectures

msivanes · May 6, 2020, 1:49am

How can we determine if the given pretrained model (wiki) is suitable/enough for our downstream task? If we have limited vocab overlap, should we need to add additional dataset to create language model from scratch?

gamino · May 6, 2020, 1:50am

Should we worry more about which based language model (wiki 103) to use or how to tokenize our data? Which one is more important?

sgugger · May 6, 2020, 1:50am

Unless you are changing language, you should have the basis of your language that overlaps, which is the most important. But don’t use an English pretrained model for French or Chinese, if that was the question.

sgugger · May 6, 2020, 1:51am

In general, you should tokenize the same way your pretrained model did (if possible), that way you won’t surprise your pretrained model on the new corpus.

ilovescience · May 6, 2020, 1:52am

Huggingface also has very fast tokenizers for the SOTA models (ex: GPT2 or BERT). Here is how those could be used: http://dev.fast.ai/tutorial.transformers

gamino · May 6, 2020, 1:52am

Are you saying the way the language model was tokenized? How would we know this?

sgugger · May 6, 2020, 1:53am

A model is not tokenized, a model is an architecture with weights.
I’m talking about the way the dataset on which your pretrained model was trained was tokenized. And it’s up to you to make sure you do it the same way on your downstream task. Fastai’s pretrained model was trained with fastai’s tokenization. If you use a HuggingFace pretrained model, they come with their own tokenizers.

mario_carrillo · May 6, 2020, 1:54am

In this line of code first(spacy([txt])) what does first do?

sgugger · May 6, 2020, 1:55am

First takes the first element, since a tokenizer returns a generator (for easy parallelization internally) and not a list.

msivanes · May 6, 2020, 1:55am

Yes Wiki pretrained gives a good basis to start with. If downstream task is associated with technological domain or medical domain but in en lang, that’s where I was wondering may be combine it with additional datasources like wiki + [DOMAIN specific source].

sgugger · May 6, 2020, 1:56am

As always, you should try it and see if you get better results

Yolo · May 6, 2020, 1:56am

Isn’t the embedding supposed to figure out by distance that “it” and “It” are same ?

giacomov · May 6, 2020, 1:56am

Is it possible to deactivate the case sensitive part? I have a dataset where lower and upper case do not have any special meaning

pierreg · May 6, 2020, 1:57am

Does fastai2 support other text task apart from text classification? Word Tagging? Encoder-decoder?

sgugger · May 6, 2020, 1:57am

They are not the same though, they are the same word with a grammatical rule. You could always say: I’m not going to do anything to make life easier for my model, it can take care of itself. But in practice, every little bit of help you give it is going to make your final metric better.

sgugger · May 6, 2020, 1:58am

You can specify your lisft of “rules” when defining a tokenizer.

Raymond-Wu · May 6, 2020, 1:58am

I did this recently. When I was tokenizing I just called the .lower() function on python

sgugger · May 6, 2020, 1:59am

fastai allows you to train any model with any data. There is not a function that gives you a learner in one-line of code for word tagging or encoder/decoder, but that does not mean you can’t use fastai to train such a model. You’ll just need to dig a bit more in the mid-level API to build your Learner yourself.