Lesson 8 - Official topic

How do we select the sequence length of the language model?

Ah ok this makes sense now. I was wondering if the batch has a short length of tokens, then how much could the model “learn”? But the model just continues to “learn” in the next batch, right?

That’s an argument you can pass (called seq_len).

2 Likes

So then would it somehow recognize that those two things are similar, say that they have a similar vector value, even though they have few or no words in common?

That is the ultimate goal, yes.

I think in the next chapter it will be clear. The way you train the RNNs. You want to continue using the previous hidden states.

2 Likes

I don’t understand your question. There is no short length of tokens since we built a huge array with all our texts concatenated.

PyTorch Lightning is coming to say hi :stuck_out_tongue: (in relation to the thunderstorm noises Jeremy was mentioning)

2 Likes

Regarding the DataBlock implementation, can I run it as such?

dls_lm = DataBlock(
    blocks=(TextBlock.from_folder(path, is_lm=True),), # I'm putting this in a tuple
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

In other words, if blocks only receives a single object, does it assume it’s being trained for the language model self-supervision task?

No, that’s the flag is_lm=True that indicates this to the library.

5 Likes

On the topic of NLP for languages other than English, Jeremy taught a lesson on this for the fast.ai NLP course:

video description:

Jeremy continues the discussion of ULMFit, an approach where a language model trained on Wikipedia is fine-tuned for sentiment classification of movie reviews, and shows how to work with non-English languages, using the example of Vietnamese. Jeremy then adapts the model for Turkish, this time considering sub-word pieces in order to capture Turkish morphemes. Rachel begins sharing how RNNs work, to be continued in Video 11.

15 Likes

I mean in a single batch I wasn’t sure how the model gets a large enough context to learn from (the number of tokens in a single batch is relatively short). But if the model state is maintained, it would make sense then…

I am assuming such language models are also responsible for autocomplete technologies as well, right?
(like in Gmail)

Do the tokenizers use any tokenization techniques like stemming or lemmatization? Or is this an outdated approach?

More basic ones, but yes. (Some of the most basics only use statistical rules, but the latest Gmail auto-complete uses a real language model.)

2 Likes

A couple of questions:

  • When training is there an easy way to save the best model thus far?
  • After 10 epochs of training the language model we got to ~35% accuracy. What is a good metric to determine when a language model is good enough ?
1 Like

There’s a SaveModelCallback for saving the best model.

6 Likes

Sometimes when I call .predict on the language model I get a word also at the beginning of the sequence:

learn.predict('a|m e|m a|m e|m a|m g| em| a|m', temperature=1.0, n_words=1, no_unk=False, no_bar=True)

gives:

c| a|m e|m a|m e|m a|m g| c| a|m e|m

(note the “c|” at the beginning of the sequence)

Why is that?

I should read the reference documentation more :slight_smile:. Thank you !

1 Like

Don’t worry, there is so much functionality available it’s hard to keep track of it all :slight_smile:

1 Like