Lesson 8 - Official topic

Link to ULMFit Paper

Iā€™m thankful for this course for so many reasons, but another to add to the list: Jeremy speaking Mandarin Chinese in the course! Thank you to the Fastai team for consciously choosing to teach universal natural language concepts, and encouraging NLP work in so many other languages, rather than centering it around Western/English ideas.

9 Likes

ā€œthat might changeā€ ā€¦

So what are the other default tokenizers yā€™all are looking at making the default? And what is the criteria for choosing which is defaulted, as well as, which options are best for what use-case scenarios?

2 Likes

I am sorry for my poor wording. I was asking about having pretrained models and functionality that were suitable for other NLP tasks. For instance, is it easy to do pos-tagging?

Are there pretrained language models for specific tasks? For example, if I was trying to see how similar two words were vs generating a sentence

setup does a frequency based sorting to determine tokens ? Or what else it does to figure out common occurrence ?

The pretrained model can work on many downstream tasks. The BERT language model for instance can be fine-tuned for question answering, translation, question answering, entity parsing etc.

2 Likes

You need to know your most commons token for when you build a vocabulary. To keep memory used reasonable we cap the vocabulary to 60,000 often, only keeping the most common tokens.

A few things I notice while running the NB, First it asked me to install sentencepiece which I did it by running this => !pip install sentencepiece after doing this and when running code subword(1000) or subword(200) or subword(10000) I get the following error Not found: ""tmp/texts.out"": No such file or directory Error #2 :frowning:

3 Likes

For a batch size of 64 how do we get 90 tokens?

I donā€™t understand your question.

The batchsize was 6, each batch spans 15 words/tokens, and there are 90 words/tokens in the corpus stream.

3 Likes

Would it be helpful to have overlap between the batches? Like part of the sentence from batch 1 is still in batch 2?

why do we make a split across the columns rather than splitting it long the rows ? wont that affect the essence of the sentence for our model to learn?

Do any language models attempt to provide meaning? For instance, ā€œIā€™m going to the storeā€ is the opposite of ā€œIā€™m not going to the storeā€. Or: ā€œI barely understand this stuffā€ and ā€œthat ball came so close to my ear I heard it whistleā€. Both contain the idea of something almost happening, being right on the border. Is there a way to indicate this kind of subtlety in a language model?

1 Like

Wha was the reason why it was done like that? would it affect the training by doing a continuous corpus per batch batch instead of the split done?

The model has a state, so it needs to read contiguous texts.

2 Likes

Why is it important to keep each row independant in the mini batch?

Thatā€™s the whole point of training a language model. It wonā€™t just learn things about words individually, but how they work in a sentence, grammar rules, etc.

1 Like

If you put the same rows, you wonā€™t have very good gradients for SGD.

1 Like