Lesson 8 - Official topic

issatahir · May 6, 2020, 2:13am

How do we select the sequence length of the language model?

ilovescience · May 6, 2020, 2:13am

Ah ok this makes sense now. I was wondering if the batch has a short length of tokens, then how much could the model “learn”? But the model just continues to “learn” in the next batch, right?

sgugger · May 6, 2020, 2:13am

That’s an argument you can pass (called seq_len).

quantum · May 6, 2020, 2:14am

So then would it somehow recognize that those two things are similar, say that they have a similar vector value, even though they have few or no words in common?

sgugger · May 6, 2020, 2:14am

That is the ultimate goal, yes.

pierreg · May 6, 2020, 2:14am

I think in the next chapter it will be clear. The way you train the RNNs. You want to continue using the previous hidden states.

sgugger · May 6, 2020, 2:15am

I don’t understand your question. There is no short length of tokens since we built a huge array with all our texts concatenated.

ilovescience · May 6, 2020, 2:16am

PyTorch Lightning is coming to say hi (in relation to the thunderstorm noises Jeremy was mentioning)

jwuphysics · May 6, 2020, 2:16am

Regarding the DataBlock implementation, can I run it as such?

dls_lm = DataBlock(
    blocks=(TextBlock.from_folder(path, is_lm=True),), # I'm putting this in a tuple
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

In other words, if blocks only receives a single object, does it assume it’s being trained for the language model self-supervision task?

sgugger · May 6, 2020, 2:17am

No, that’s the flag is_lm=True that indicates this to the library.

rachel · May 6, 2020, 2:20am

On the topic of NLP for languages other than English, Jeremy taught a lesson on this for the fast.ai NLP course:

video description:

Jeremy continues the discussion of ULMFit, an approach where a language model trained on Wikipedia is fine-tuned for sentiment classification of movie reviews, and shows how to work with non-English languages, using the example of Vietnamese. Jeremy then adapts the model for Turkish, this time considering sub-word pieces in order to capture Turkish morphemes. Rachel begins sharing how RNNs work, to be continued in Video 11.

ilovescience · May 6, 2020, 2:20am

I mean in a single batch I wasn’t sure how the model gets a large enough context to learn from (the number of tokens in a single batch is relatively short). But if the model state is maintained, it would make sense then…

ilovescience · May 6, 2020, 2:21am

I am assuming such language models are also responsible for autocomplete technologies as well, right?
(like in Gmail)

harish3110 · May 6, 2020, 2:21am

Do the tokenizers use any tokenization techniques like stemming or lemmatization? Or is this an outdated approach?

sgugger · May 6, 2020, 2:22am

More basic ones, but yes. (Some of the most basics only use statistical rules, but the latest Gmail auto-complete uses a real language model.)

rahulrav · May 6, 2020, 2:23am

A couple of questions:

When training is there an easy way to save the best model thus far?
After 10 epochs of training the language model we got to ~35% accuracy. What is a good metric to determine when a language model is good enough ?

ilovescience · May 6, 2020, 2:24am

There’s a SaveModelCallback for saving the best model.

giacomov · May 6, 2020, 2:25am

Sometimes when I call .predict on the language model I get a word also at the beginning of the sequence:

learn.predict('a|m e|m a|m e|m a|m g| em| a|m', temperature=1.0, n_words=1, no_unk=False, no_bar=True)

gives:

c| a|m e|m a|m e|m a|m g| c| a|m e|m

(note the “c|” at the beginning of the sequence)

Why is that?

rahulrav · May 6, 2020, 2:26am

I should read the reference documentation more . Thank you !

ilovescience · May 6, 2020, 2:26am

Don’t worry, there is so much functionality available it’s hard to keep track of it all