Fastbook Chapter 12 questionnaire (wiki)

Here are the questions:

  1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?

Perhaps create the simplest possible dataset that allow for quick and easy prototyping. For example, Jeremy created a “human numbers” dataset.

  1. Why do we concatenate the documents in our dataset before creating a language model?

To create a continuous stream of input/target words, to be able to split it up in batches of significant size

  1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make?
  1. Use the same weight matrix for the three layers.
  2. Use the first word’s embeddings as activations to pass to linear layer, add the second word’s embeddings to the first layer’s output activations, and continues for rest of words.
  1. How can we share a weight matrix across multiple layers in PyTorch?

Define one layer in the PyTorch model class, and use them multiple times in the forward method.

  1. Write a module which predicts the third word given the previous two words of a sentence, without peeking.

Same code as in chapter:

class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2])
        h = F.relu(self.h_h(h))
        return self.h_o(h)
  1. What is a recurrent neural network?

A refactoring of a multi-layer neural network as a loop.

  1. What is hidden state?

The activations updated after each RNN step.

  1. What is the equivalent of hidden state in LMModel1 ?

It is also defined as h in LMModel1.

  1. To maintain the state in an RNN why is it important to pass the text to the model in order?

Because state is maintained over all batches independent of sequence length, this is only useful if the text is passed in order

  1. What is an unrolled representation of an RNN?

A representation without loops, depicted as a standard multilayer network

  1. Why can maintaining the hidden state in an RNN lead to memory and performance problems? How do we fix this problem?

Since the hidden state is maintained through every single call of the model, when performing backpropagation with the model, it has to use the gradients from also all the past calls of the model. This can lead to high memory usage. So therefore after every call, the detach method is called to delete the gradient history of previous calls of the model.

  1. What is BPTT?

Calculating backpropagation only for the given batch, and therefore only doing backprop for the defined sequence length of the batch.

  1. Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches of IMDb data in <>.
  2. What does the ModelReseter callback do? Why do we need it?

It resets the hidden state of the model before every epoch and before every validation run.

  1. What are the downsides of predicting just one output word for each three input words?

There are words in between that are not being predicted and that is extra information for training the model that is not being used. To solve this, we apply the output layer to every hidden state produced to predict three output words for the three input words (offset by one).

  1. Why do we need a custom loss function for LMModel4 ?

CrossEntropyLoss expects flattened tensors

  1. Why is the training of LMModel4 unstable?

Because this network is effectively very deep and this can lead to very small or very large gradients that don’t train well

  1. In the unrolled representation, we can see that a recurrent neural network actually has many layers. So why do we need to stack RNNs to get better results?

Because only one weight matrix is really being used. So multiple layers can improve this.

  1. Draw a representation of a stacked (multilayer) RNN.

  1. Why should we get better results in an RNN if we call detach less often? Why might this not happen in practice with a simple RNN?
  2. Why can a deep network result in very large or very small activations? Why does this matter?

Numbers that are just slightly higher or lower than one can lead to the explosion or disappearance of numbers after repeated multiplications. In deep networks, we have repeated matrix multiplications, so this is a big problem.

  1. In a computer’s floating point representation of numbers, which numbers are the most precise?

Small numbers, that are not too close to zero however

  1. Why do vanishing gradients prevent training?

Gradients that are zero can’t contribute to training because they don’t change any weights

  1. Why does it help to have two hidden states in the LSTM architecture? What is the purpose of each one?

a. One state remembers what happened earlier in the sentence
b. The other predicts the next token

  1. What are these two states called in an LSTM?

a. Cell state (long short-term memory)
b. Hidden state (predict next token)

  1. What is tanh, and how is it related to sigmoid?

It’s just a sigmoid function rescaled to the range of -1 to 1

  1. What is the purpose of this code in LSTMCell ?: h = torch.stack([h, input], dim=1)

This should actually be torch.cat([h, input], dim=1). It joins the hidden state and the new input.

  1. What does chunk to in PyTorch?

Splits a tensor in equal sizes

  1. Study the refactored version of LSTMCell carefully to ensure you understand how and why it does the same thing as the non-refactored version.
  2. Why can we use a higher learning rate for LMModel6 ?

Because LSTM provides a partial solution to exploding/vanishing gradients (?)

  1. What are the three regularisation techniques used in an AWD-LSTM model?
  1. Dropout
  2. Activation regularization
  3. Temporal activation regularization
  1. What is dropout?

Deleting activations at random

  1. Why do we scale the weights with dropout? Is this applied during training, inference, or both?

a. The scale changes if we sum up activations, it makes a difference if all activations are present or they are dropped with probability p. To correct the scale, a division by (1-p) is applied.
b. In the implementation in the book, it is applied during training
c. It should be possible in both ways

  1. What is the purpose of this line from Dropout ?: if not self.training: return x

When not in training mode, don’t apply dropout

  1. Experiment with bernoulli_ to understand how it works.
  2. How do you set your model in training mode in PyTorch? In evaluation mode?

a. Module.train(), Module.eval()

  1. Write the equation for activation regularization (in maths or code, as you prefer). How is it different to weight decay?

loss += alpha * activations.pow(2).mean()
It is different by not decreasing the weights but the activations

  1. Write the equation for temporal activation regularization (in maths or code, as you prefer). Why wouldn’t we use this for computer vision problems?

This focuses on making the activations of consecutive tokens to be similar:
loss += alpha * activations.pow(2).mean()

  1. What is “weight tying” in a language model?

Weights of input-to-hidden layer is the same of weights of hidden-to-output layer is the same. This basically means we assume that the mapping from English words to activations

3 Likes

Question 5:

We should change the model to remove the third word from training in forward method if our sequence length is 2, no?

Like this:

class LMModel1_Modified(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))
        return self.h_o(h)

Also, as per the question, sequence should also be changed:

seqs = L((tensor(nums[i:i+2]), nums[i+2]) for i in range(0,len(nums)-4, 2))

is this right? or am I missing something?

1 Like

Question 13:

few_valid_items = dls.valid.items[:5]
for xs, y in few_valid_items:
    print('xs: {} --> y: {}'.format((vocab[xs[0]], vocab[xs[1]], vocab[xs[2]]), vocab[y]))

Question 20

20a: Why should we get better results in an RNN if we call detach less often?
We will get better results if we call detach less often, and have more layers—this gives our RNN a longer time horizon to learn from, and richer features to create.

20b: Why might this not happen in practice with a simple RNN?
Simple RNNs, comparatively shallower than complex ones, there may not be any meaningful difference between hidden states of two time steps. Hence, we may not see any benefit from using detach less often (?)

Not sure on 20b.

Is it right to say that numbers that are too close to zero are less precise in floating point storage context?

As i understand, numbers close to zero are more precise than numbers that are far. Numbers too close to zero only affect our use case in vanishing gradient context.

Question 38

I’m not sure but I think this is the answer:

38b: Why wouldn’t we use this for computer vision problems?

TAR is linked to the fact we are predicting tokens in a sentence. That means it’s likely that the outputs of our LSTMs should somewhat make sense when we read them in order.

This is not true for computer vision problems as order of input (different images) doesn’t matter.

Does a “shared weight matrix” basically mean a reference copy to the same set of parameters?

If yes, does that mean that RNNs have much fewer (but “longer trained”) parameters than the default FCNN? Would it also make RNNs more interpretable since the parameter space is smaller?

Can somebody clarify this.

I’m not sure if this is the expected answer… but the following makes sense to me:
TAR is “…adding a penalty to the loss to make the difference between two consecutive activations as small as possible…”. In computer vision two consecutive activations could be two different tiles of the same image. One tile could be background, and one tile could contain the eye of a dog. If we want to predict for instance dog breeds, it would harm our models performance to reduce the difference between the activations.