Lesson 8 - Official topic

init_27 · May 6, 2020, 2:01am

jwuphysics · May 6, 2020, 2:02am

I’m thankful for this course for so many reasons, but another to add to the list: Jeremy speaking Mandarin Chinese in the course! Thank you to the Fastai team for consciously choosing to teach universal natural language concepts, and encouraging NLP work in so many other languages, rather than centering it around Western/English ideas.

wgpubs · May 6, 2020, 2:02am

“that might change” …

So what are the other default tokenizers y’all are looking at making the default? And what is the criteria for choosing which is defaulted, as well as, which options are best for what use-case scenarios?

pierreg · May 6, 2020, 2:03am

I am sorry for my poor wording. I was asking about having pretrained models and functionality that were suitable for other NLP tasks. For instance, is it easy to do pos-tagging?

Raymond-Wu · May 6, 2020, 2:03am

Are there pretrained language models for specific tasks? For example, if I was trying to see how similar two words were vs generating a sentence

Yolo · May 6, 2020, 2:04am

setup does a frequency based sorting to determine tokens ? Or what else it does to figure out common occurrence ?

sgugger · May 6, 2020, 2:04am

The pretrained model can work on many downstream tasks. The BERT language model for instance can be fine-tuned for question answering, translation, question answering, entity parsing etc.

sgugger · May 6, 2020, 2:05am

You need to know your most commons token for when you build a vocabulary. To keep memory used reasonable we cap the vocabulary to 60,000 often, only keeping the most common tokens.

mario_carrillo · May 6, 2020, 2:05am

A few things I notice while running the NB, First it asked me to install sentencepiece which I did it by running this => !pip install sentencepiece after doing this and when running code subword(1000) or subword(200) or subword(10000) I get the following error Not found: ""tmp/texts.out"": No such file or directory Error #2

harish3110 · May 6, 2020, 2:07am

For a batch size of 64 how do we get 90 tokens?

sgugger · May 6, 2020, 2:08am

I don’t understand your question.

jwuphysics · May 6, 2020, 2:08am

The batchsize was 6, each batch spans 15 words/tokens, and there are 90 words/tokens in the ~~corpus~~ stream.

ilovescience · May 6, 2020, 2:08am

Would it be helpful to have overlap between the batches? Like part of the sentence from batch 1 is still in batch 2?

0tist · May 6, 2020, 2:09am

why do we make a split across the columns rather than splitting it long the rows ? wont that affect the essence of the sentence for our model to learn?

quantum · May 6, 2020, 2:10am

Do any language models attempt to provide meaning? For instance, “I’m going to the store” is the opposite of “I’m not going to the store”. Or: “I barely understand this stuff” and “that ball came so close to my ear I heard it whistle”. Both contain the idea of something almost happening, being right on the border. Is there a way to indicate this kind of subtlety in a language model?

victor.vargas · May 6, 2020, 2:10am

Wha was the reason why it was done like that? would it affect the training by doing a continuous corpus per batch batch instead of the split done?

sgugger · May 6, 2020, 2:11am

The model has a state, so it needs to read contiguous texts.

Yolo · May 6, 2020, 2:11am

Why is it important to keep each row independant in the mini batch?

sgugger · May 6, 2020, 2:12am

That’s the whole point of training a language model. It won’t just learn things about words individually, but how they work in a sentence, grammar rules, etc.

sgugger · May 6, 2020, 2:12am

If you put the same rows, you won’t have very good gradients for SGD.