Fastbook Chapter 10 questionnaire solutions (wiki)

Here are the questions:

  1. What is self-supervised learning?

Training a model without the use of labels. An example is a language model.

  1. What is a language model?

A language model is a self-supervised model that tries to predict the next word of a given passage of text.

  1. Why is a language model considered self-supervised learning?

There are no labels (ex: sentiment) provided during training. Instead, the model learns to predict the next word by reading lots of provided text with no labels.

  1. What are self-supervised models usually used for?

Sometimes, they are used by themselves. For example, a language model can be used for autocomplete algorithms! But often, they are used as a pre-trained model for transfer learning.

  1. Why do we fine-tune language models?

We can fine-tune the language model on the corpus of the desired downstream task, since the original pre-trained language model was trained on a corpus that is slightly different than the one for the current task.

  1. What are the three steps to create a state-of-the-art text classifier?
  1. Train a language model on a large corpus of text (already done for ULM-FiT by Sebastian Ruder and Jeremy!)
  2. Fine-tune the language model on text classification dataset
  3. Fine-tune the language model as a text classifier instead.
  1. How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset?

By learning how to predict the next word of a movie review, the model better understands the language style and structure of the text classification dataset and can, therefore, perform better when fine-tuned as a classifier.

  1. What are the three steps to prepare your data for a language model?
  1. Tokenization
  2. Numericalization
  3. Language model DataLoader
  1. What is tokenization? Why do we need it?

Tokenization is the process of converting text into a list of words. It is not as simple as splitting on the spaces. Therefore, we need a tokenizer that deals with complicated cases like punctuation, hypenated words, etc.

  1. Name three different approaches to tokenization.
  1. Word-based tokenization
  2. Subword-based tokenization
  3. Character-based tokenization
  1. What is ‘xxbos’?

This is a special token added by fastai that indicated the beginning of the text.

  1. List 4 rules that fastai applies to text during tokenization.

Here are all the rules:

  • fix_html :: replace special HTML characters by a readable version (IMDb reviews have quite a few of them for instance) ;
  • replace_rep :: replace any character repeated three times or more by a special token for repetition (xxrep), the number of times it’s repeated, then the character ;
  • replace_wrep :: replace any word repeated three times or more by a special token for word repetition (xxwrep), the number of times it’s repeated, then the word ;
  • spec_add_spaces :: add spaces around / and # ;
  • rm_useless_spaces :: remove all repetitions of the space character ;
  • replace_all_caps :: lowercase a word written in all caps and adds a special token for all caps (xxcap) in front of it ;
  • replace_maj :: lowercase a capitalized word and adds a special token for capitalized (xxmaj) in front of it ;
  • lowercase :: lowercase all text and adds a special token at the beginning (xxbos) and/or the end (xxeos).
  1. Why are repeated characters replaced with a token showing the number of repetitions, and the character that’s repeated?

We can expect that repeated characters could have special or different meaning than just a single character. By replacing them with a special token showing the number of repetitions, the model’s embedding matrix can encode information about general concepts such as repeated characters rather than requiring a separate token for every number of repetitions of every character.

  1. What is numericalization?

This refers to the mapping of the tokens to integers to be passed into the model.

  1. Why might there be words that are replaced with the “unknown word” token?

If all the words in the dataset have a token associated with them, then the embedding matrix will be very large, increase memory usage, and slow down training. Therefore, only words with more than min_freq occurrence are assigned a token and finally a number, while others are replaced with the “unknown word” token.

  1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer against the book website.)

a. The dataset is split into 64 mini-streams (batch size)
b. Each batch has 64 rows (batch size) and 64 columns (sequence length)
c. The first row of the first batch contains the beginning of the first mini-stream (tokens 1-64)
d. The second row of the first batch contains the beginning of the second mini-stream
e. The first row of the second batch contains the second chunk of the first mini-stream (tokens 65-128)

  1. Why do we need padding for text classification? Why don’t we need it for language modeling?

Since the documents have variable sizes, padding is needed to collate the batch. Other approaches. like cropping or squishing, either to negatively affect training or do not make sense in this context. Therefore, padding is used. It is not required for language modeling since the documents are all concatenated.

  1. What does an embedding matrix for NLP contain? What is its shape?

It contains vector representations of all tokens in the vocabulary. The embedding matrix has the size (vocab_size x embedding_size), where vocab_size is the length of the vocabulary, and embedding_size is an arbitrary number defining the number of latent factors of the tokens.

  1. What is perplexity?

Perplexity is a commonly used metric in NLP for language models. It is the exponential of the loss.

  1. Why do we have to pass the vocabulary of the language model to the classifier data block?

This is to ensure the same correspondence of tokens to index so the model can appropriately use the embeddings learned during LM fine-tuning.

  1. What is gradual unfreezing?

This refers to unfreezing one layer at a time and fine-tuning the pretrained model.

  1. Why is text generation always likely to be ahead of automatic identification of machine generated texts?

The classification models could be used to improve text generation algorithms (evading the classifier) so the text generation algorithms will always be ahead.

12 Likes

Question 16 is a little unclear IMO. Not sure exactly what it is asking.
It doesn’t seem Question 18 is really discussed in the chapter?

4 Likes

Thanks for posting your answers! I’m also looking for the official answers. What is the book’s website though? It seems it should be https://book.fast.ai but it redirects to the course site now. Is there an official place I’m unaware of?

The book and course website have been merged:

1 Like

In regards to this question:

  1. What is “tokenization”? Why do we need it?

I don’t think the book really dwells with why this is needed, only with some of the challenges. Tokenization is needed because we’re examining texts for patterns. It won’t be possible to find such patterns if we only consider a text at the sentence level, since patterns would only emerge if certain sentences frequently appeared together in sets, and this is a very rare phenomenon. Needless to say, examining texts in their entirety would be even less productive. We only find interesting patterns when we examine texts at the level of individual words, substrings, or individual characters.

  1. What does an embedding matrix for NLP contain? What is its shape?

It contains vector representations of all tokens in the vocabulary. The embedding matrix has the size (vocab_size x embedding_size), where vocab_size is the length of the vocabulary, and embedding_size is an arbitrary number defining the number of latent factors of the tokens.

Please let me know if this is wrong or unclear!

  1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer against the book website.)

a. The batch has 64 rows (batch size) and also 64 columns (sequence length)
b. The second row of the first batch contains the tokens beginning at 1/64th of the dataset
c. The first row of the second batch continues the first row of the first batch (tokens 65-128)

I think the example would be a little clearer if we used different values for batch size and sequence length. For example:

With a batch size of 20, the first row of the tensor representing the first batch contains the first 50 tokens for the dataset.

a. The dataset is split into 20 mini-streams
b. Each batch has 20 rows and 50 columns
c. The first row of the first batch contains the beginning of the first mini-stream (tokens 1-50)
d. The second row of the first batch contains the beginning of the second mini-stream
e. The first row of the second batch contains the second chunk of the first mini-stream (tokens 51-100)

@ilovescience, any thoughts?

2 Likes

I think this looks fine (I have to review some of this material myself lol) but the main post is a wiki so feel free to edit accordingly and add your answers! :slightly_smiling_face:

2 Likes

Wow, almost 3 years passed! I’m not sure if this is helpful, but I as an additional tip about Q18, I believe if you read the passage below carefully you might get yourself a chance to grasp the concept vividly:
“Our vocab will consist of a mix of common words that are already in the vocabulary of our pretrained model and new words specific to our corpus (cinematographic terms or actors names, for instance). Our embedding matrix will be built accordingly: for words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of the pretrained model; but for new words we won’t have anything, so we will just initialize the corresponding row with a random vector.”

So just like almost all (parametric) machine learning algorithms, we initialize our bunch of parameters (can be vectors, matrices, or tensors) with values generated by random numbers, then upon iterating through the algorithm itself (learning phase), will hope this set of random numbers will eventually converge to a proper set of parameters, hence providing us with a good estimate of our embedding matrix. If that’s still unclear, I encourage the curious reader to check out Chris Manning’s 2021 video lecture which will definitely provide enough details about the process of language model algorithms.

Hi, thanks for the solutions!

Regarding #22, the same argument could be used the other way round: one can always improve the detector by training it on the output of a generator trained to evade detection. It’s really symmetric this way. Training a generator/detector pair together in an arms race is the idea of GAN (generational adversarial networks). In my opinion, the symmetry invalidates the argument for why generation will always be ahead of detection.