Fastbook Chapter 10 questionnaire solutions (wiki)

Here are the questions:

  1. What is self-supervised learning?

Training a model without the use of labels. An example is a language model.

  1. What is a language model?

A language model is a self-supervised model that tries to predict the next word of a given passage of text.

  1. Why is a language model considered self-supervised learning?

There are no labels (ex: sentiment) provided during training. Instead, the model learns to predict the next word by reading lots of provided text with no labels.

  1. What are self-supervised models usually used for?

Sometimes, they are used by themselves. For example, a language model can be used for autocomplete algorithms! But often, they are used as a pre-trained model for transfer learning.

  1. Why do we fine-tune language models?

We can fine-tune the language model on the corpus of the desired downstream task, since the original pre-trained language model was trained on a corpus that is slightly different than the one for the current task.

  1. What are the three steps to create a state-of-the-art text classifier?
  1. Train a language model on a large corpus of text (already done for ULM-FiT by Sebastian Ruder and Jeremy!)
  2. Fine-tune the language model on text classification dataset
  3. Fine-tune the language model as a text classifier instead.
  1. How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset?

By learning how to predict the next word of a movie review, the model better understands the language style and structure of the text classification dataset and can, therefore, perform better when fine-tuned as a classifier.

  1. What are the three steps to prepare your data for a language model?
  1. Tokenization
  2. Numericalization
  3. Language model DataLoader
  1. What is tokenization? Why do we need it?

Tokenization is the process of converting text into a list of words. It is not as simple as splitting on the spaces. Therefore, we need a tokenizer that deals with complicated cases like punctuation, hypenated words, etc.

  1. Name three different approaches to tokenization.
  1. Word-based tokenization
  2. Subword-based tokenization
  3. Character-based tokenization
  1. What is ‘xxbos’?

This is a special token added by fastai that indicated the beginning of the text.

  1. List 4 rules that fastai applies to text during tokenization.

Here are all the rules:

  • fix_html :: replace special HTML characters by a readable version (IMDb reviews have quite a few of them for instance) ;
  • replace_rep :: replace any character repeated three times or more by a special token for repetition (xxrep), the number of times it’s repeated, then the character ;
  • replace_wrep :: replace any word repeated three times or more by a special token for word repetition (xxwrep), the number of times it’s repeated, then the word ;
  • spec_add_spaces :: add spaces around / and # ;
  • rm_useless_spaces :: remove all repetitions of the space character ;
  • replace_all_caps :: lowercase a word written in all caps and adds a special token for all caps (xxcap) in front of it ;
  • replace_maj :: lowercase a capitalized word and adds a special token for capitalized (xxmaj) in front of it ;
  • lowercase :: lowercase all text and adds a special token at the beginning (xxbos) and/or the end (xxeos).
  1. Why are repeated characters replaced with a token showing the number of repetitions, and the character that’s repeated?

We can expect that repeated characters could have special or different meaning than just a single character. By replacing them with a special token showing the number of repetitions, the model’s embedding matrix can encode information about general concepts such as repeated characters rather than requiring a separate token for every number of repetitions of every character.

  1. What is numericalization?

This refers to the mapping of the tokens to integers to be passed into the model.

  1. Why might there be words that are replaced with the “unknown word” token?

If all the words in the dataset have a token associated with them, then the embedding matrix will be very large, increase memory usage, and slow down training. Therefore, only words with more than min_freq occurrence are assigned a token and finally a number, while others are replaced with the “unknown word” token.

  1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer against the book website.)

a. The dataset is split into 64 mini-streams (batch size)
b. Each batch has 64 rows (batch size) and 64 columns (sequence length)
c. The first row of the first batch contains the beginning of the first mini-stream (tokens 1-64)
d. The second row of the first batch contains the beginning of the second mini-stream
e. The first row of the second batch contains the second chunk of the first mini-stream (tokens 65-128)
I think this visualisation will help us understand this question better.

  1. Why do we need padding for text classification? Why don’t we need it for language modeling?

Since the documents have variable sizes, padding is needed to collate the batch. Other approaches. like cropping or squishing, either to negatively affect training or do not make sense in this context. Therefore, padding is used. It is not required for language modeling since the documents are all concatenated.

  1. What does an embedding matrix for NLP contain? What is its shape?

It contains vector representations of all tokens in the vocabulary. The embedding matrix has the size (vocab_size x embedding_size), where vocab_size is the length of the vocabulary, and embedding_size is an arbitrary number defining the number of latent factors of the tokens.

  1. What is perplexity?

Perplexity is a commonly used metric in NLP for language models. It is the exponential of the loss.

  1. Why do we have to pass the vocabulary of the language model to the classifier data block?

This is to ensure the same correspondence of tokens to index so the model can appropriately use the embeddings learned during LM fine-tuning.

  1. What is gradual unfreezing?

This refers to unfreezing one layer at a time and fine-tuning the pretrained model.

  1. Why is text generation always likely to be ahead of automatic identification of machine generated texts?

The classification models could be used to improve text generation algorithms (evading the classifier) so the text generation algorithms will always be ahead.

15 Likes

Question 16 is a little unclear IMO. Not sure exactly what it is asking.
It doesn’t seem Question 18 is really discussed in the chapter?

4 Likes

Thanks for posting your answers! I’m also looking for the official answers. What is the book’s website though? It seems it should be https://book.fast.ai but it redirects to the course site now. Is there an official place I’m unaware of?

The book and course website have been merged:

2 Likes

In regards to this question:

  1. What is “tokenization”? Why do we need it?

I don’t think the book really dwells with why this is needed, only with some of the challenges. Tokenization is needed because we’re examining texts for patterns. It won’t be possible to find such patterns if we only consider a text at the sentence level, since patterns would only emerge if certain sentences frequently appeared together in sets, and this is a very rare phenomenon. Needless to say, examining texts in their entirety would be even less productive. We only find interesting patterns when we examine texts at the level of individual words, substrings, or individual characters.

  1. What does an embedding matrix for NLP contain? What is its shape?

It contains vector representations of all tokens in the vocabulary. The embedding matrix has the size (vocab_size x embedding_size), where vocab_size is the length of the vocabulary, and embedding_size is an arbitrary number defining the number of latent factors of the tokens.

Please let me know if this is wrong or unclear!

  1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer against the book website.)

a. The batch has 64 rows (batch size) and also 64 columns (sequence length)
b. The second row of the first batch contains the tokens beginning at 1/64th of the dataset
c. The first row of the second batch continues the first row of the first batch (tokens 65-128)

I think the example would be a little clearer if we used different values for batch size and sequence length. For example:

With a batch size of 20, the first row of the tensor representing the first batch contains the first 50 tokens for the dataset.

a. The dataset is split into 20 mini-streams
b. Each batch has 20 rows and 50 columns
c. The first row of the first batch contains the beginning of the first mini-stream (tokens 1-50)
d. The second row of the first batch contains the beginning of the second mini-stream
e. The first row of the second batch contains the second chunk of the first mini-stream (tokens 51-100)

@ilovescience, any thoughts?

2 Likes

I think this looks fine (I have to review some of this material myself lol) but the main post is a wiki so feel free to edit accordingly and add your answers! :slightly_smiling_face:

2 Likes

Wow, almost 3 years passed! I’m not sure if this is helpful, but I as an additional tip about Q18, I believe if you read the passage below carefully you might get yourself a chance to grasp the concept vividly:
“Our vocab will consist of a mix of common words that are already in the vocabulary of our pretrained model and new words specific to our corpus (cinematographic terms or actors names, for instance). Our embedding matrix will be built accordingly: for words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of the pretrained model; but for new words we won’t have anything, so we will just initialize the corresponding row with a random vector.”

So just like almost all (parametric) machine learning algorithms, we initialize our bunch of parameters (can be vectors, matrices, or tensors) with values generated by random numbers, then upon iterating through the algorithm itself (learning phase), will hope this set of random numbers will eventually converge to a proper set of parameters, hence providing us with a good estimate of our embedding matrix. If that’s still unclear, I encourage the curious reader to check out Chris Manning’s 2021 video lecture which will definitely provide enough details about the process of language model algorithms.

Hi, thanks for the solutions!

Regarding #22, the same argument could be used the other way round: one can always improve the detector by training it on the output of a generator trained to evade detection. It’s really symmetric this way. Training a generator/detector pair together in an arms race is the idea of GAN (generational adversarial networks). In my opinion, the symmetry invalidates the argument for why generation will always be ahead of detection.

Please help with the question 16.

The book at fastbook/10_nlp.ipynb at master · fastai/fastbook · GitHub, in chapter “Putting Our Texts into Batches for a Language Model”, paragraph starting with *

“We now have 90 tokens, separated by spaces. Let’s say we want a batch size of 6. We need to break this text into 6 contiguous parts of length 15:”

makes it look like each row (or mini-stream), is the next chunk of the split text.
So for text “abcdefg”, batch size of 2, and batch length of 3, we have:

abc  g???
def

However in the explanation above, every row seems to be a copy of the same text, as if batch size of 64, for example, means 64 identical rows of text as a stream, and then we split the stream into certain sizes.
So for text “abcdefg”, batch size of 2, and batch length of 3, we have:

abc def g
abc def g

Please advise:

  1. How do we actually split the text? What exactly is in the batch row?
  2. What dictates the batch (sequence/chunk/batch row) length?
    And what is the right terminology?

Thank you!
D

I tried your sample text "abcdefg" with a batch size of 2 and sequence length of 3 and got the following result:

#hide_input
stream = "a b c d e f g"
tokens = tkn(stream)
bs,seq_len = 2,3
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

Output:
image

The beginning of text token xxbos takes up one of the 6 available spots so only 5 of the letters are in the two batches. Each row represents a batch (2 batches = 2 rows) and each column represents one element in the sequence (sequence length of 3 = 3 columns).

I’m not sure what you are referring to when you say that:

However in the explanation above, every row seems to be a copy of the same text

In the Chapter 10 examples, every row represents a different batch. For example, here batch size is defined as 6, and there are 6 rows of text, and 5 columns (sequence length = 5) in this array:

image

The next sequence length (5 columns) also has 6 rows (1 per batch) you can see that the first row in this array is a continuation of the sentence in the first row of the first array:

And the final array continues those sentences (and the full 90-token stream is finished with a period at the bottom right-most token):

image

Let me know if any of that helps!

@vbakshi
Thank you for the detailed reply!

1.Please explain how the text “wraps”.
In the example with “abcdefg” string/text, you have a screenshot of the output of a first batch.
In it, the first 6 symbols warp within this batch: “xxbos a b” - on row 1, and “c d e” - on row 2. And I assume the remaining part of “f g” is in the next sequence?

However,
in the Chapter 10 examples, the text is cut by sequences, the line seems to span across multiple batches. For example, the text is split as:

Line # | Sequence 1                   | Sequence 2               | Sequence 3
1      | xxbox xxmaj in this chapter. | , we will go back        | over the example of classifying
2      | movie reviews we studied in  | chapter 1 and dig deeper | under the surface . xxmaj

As you can see, the text stated in row 1 of batch 1 and continued into row 1 of batch 2 etc.
Why in the first example, the text did not continue to batch 2, but rather wrapped in batch 1?

in short:
example 1 is
abc gh
def

example 2 is
abc def
gh

  1. Also, are terms “batch” and “sequence” synonyms?

Thank you!

1 Like

I don’t think “batch” and “sequence” are synonyms, but I do think “batch size” and “number of sequences in a batch” refer to the same thing. In the chapter example, each batch contains 6 sequences, or in other words, the batch size is 6.

seq_len is the number of tokens in each sequence. So we would say that each batch contains 6 sequences of 5 tokens each.

What you have written as “Sequence 1”, “Sequence 2”, and “Sequence 3” are actually “Batch 1”, “Batch 2” and “Batch 3”.

I found it important to look at the following three lines of code in the chapter examples—I recommend making sure you understand what’s happening in the list comprehension:

d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
#d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])

# rewriting this one to match the pattern in the first two
d_tokens = np.array([tokens[i*15+2*seq_len:i*15+3*seq_len] for i in range(bs)])

In each definition of d_tokens you’ll notice the use of 15. Each sequence is 5 tokens long and you have 3 batches.

In each array you skip over 15 tokens at a time and capture 5-token sequences.

d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])

The list comprehension inside np.array captures seq_len (5) tokens at a time, 15 tokens apart.

Comparing all three d_tokens arrays with the original stream visually hopefully shows more clearly how the “wrapping” effect occurs:

1 Like