Fastbook Chapter 12 questionnaire (wiki)

Here are the questions:

  1. If the dataset for your project is so big and complicated that working with it takes a significant amount of time, what should you do?

Perhaps create the simplest possible dataset that allow for quick and easy prototyping. For example, Jeremy created a “human numbers” dataset.

  1. Why do we concatenate the documents in our dataset before creating a language model?

To create a continuous stream of input/target words, to be able to split it up in batches of significant size

  1. To use a standard fully connected network to predict the fourth word given the previous three words, what two tweaks do we need to make?
  1. Use the same weight matrix for the three layers.
  2. Use the first word’s embeddings as activations to pass to linear layer, add the second word’s embeddings to the first layer’s output activations, and continues for rest of words.
  1. How can we share a weight matrix across multiple layers in PyTorch?

Define one layer in the PyTorch model class, and use them multiple times in the forward method.

  1. Write a module which predicts the third word given the previous two words of a sentence, without peeking.

Same code as in chapter:

class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2])
        h = F.relu(self.h_h(h))
        return self.h_o(h)
  1. What is a recurrent neural network?

A refactoring of a multi-layer neural network as a loop.

  1. What is hidden state?

The activations updated after each RNN step.

  1. What is the equivalent of hidden state in LMModel1 ?

It is also defined as h in LMModel1.

  1. To maintain the state in an RNN why is it important to pass the text to the model in order?

Because state is maintained over all batches independent of sequence length, this is only useful if the text is passed in order

  1. What is an unrolled representation of an RNN?

A representation without loops, depicted as a standard multilayer network

  1. Why can maintaining the hidden state in an RNN lead to memory and performance problems? How do we fix this problem?

Since the hidden state is maintained through every single call of the model, when performing backpropagation with the model, it has to use the gradients from also all the past calls of the model. This can lead to high memory usage. So therefore after every call, the detach method is called to delete the gradient history of previous calls of the model.

  1. What is BPTT?

Calculating backpropagation only for the given batch, and therefore only doing backprop for the defined sequence length of the batch.

  1. Write code to print out the first few batches of the validation set, including converting the token IDs back into English strings, as we showed for batches of IMDb data in <>.
  2. What does the ModelReseter callback do? Why do we need it?

It resets the hidden state of the model before every epoch and before every validation run.

  1. What are the downsides of predicting just one output word for each three input words?

There are words in between that are not being predicted and that is extra information for training the model that is not being used. To solve this, we apply the output layer to every hidden state produced to predict three output words for the three input words (offset by one).

  1. Why do we need a custom loss function for LMModel4 ?

CrossEntropyLoss expects flattened tensors

  1. Why is the training of LMModel4 unstable?

Because this network is effectively very deep and this can lead to very small or very large gradients that don’t train well

  1. In the unrolled representation, we can see that a recurrent neural network actually has many layers. So why do we need to stack RNNs to get better results?

Because only one weight matrix is really being used. So multiple layers can improve this.

  1. Draw a representation of a stacked (multilayer) RNN.

  1. Why should we get better results in an RNN if we call detach less often? Why might this not happen in practice with a simple RNN?
  2. Why can a deep network result in very large or very small activations? Why does this matter?

Numbers that are just slightly higher or lower than one can lead to the explosion or disappearance of numbers after repeated multiplications. In deep networks, we have repeated matrix multiplications, so this is a big problem.

  1. In a computer’s floating point representation of numbers, which numbers are the most precise?

Small numbers, that are not too close to zero however

  1. Why do vanishing gradients prevent training?

Gradients that are zero can’t contribute to training because they don’t change any weights

  1. Why does it help to have two hidden states in the LSTM architecture? What is the purpose of each one?

a. One state remembers what happened earlier in the sentence
b. The other predicts the next token

  1. What are these two states called in an LSTM?

a. Cell state (long short-term memory)
b. Hidden state (predict next token)

  1. What is tanh, and how is it related to sigmoid?

It’s just a sigmoid function rescaled to the range of -1 to 1

  1. What is the purpose of this code in LSTMCell ?: h = torch.stack([h, input], dim=1)

This should actually be[h, input], dim=1). It joins the hidden state and the new input.

  1. What does chunk to in PyTorch?

Splits a tensor in equal sizes

  1. Study the refactored version of LSTMCell carefully to ensure you understand how and why it does the same thing as the non-refactored version.
  2. Why can we use a higher learning rate for LMModel6 ?

Because LSTM provides a partial solution to exploding/vanishing gradients (?)

  1. What are the three regularisation techniques used in an AWD-LSTM model?
  1. Dropout
  2. Activation regularization
  3. Temporal activation regularization
  1. What is dropout?

Deleting activations at random

  1. Why do we scale the weights with dropout? Is this applied during training, inference, or both?

a. The scale changes if we sum up activations, it makes a difference if all activations are present or they are dropped with probability p. To correct the scale, a division by (1-p) is applied.
b. In the implementation in the book, it is applied during training
c. It should be possible in both ways

  1. What is the purpose of this line from Dropout ?: if not return x

When not in training mode, don’t apply dropout

  1. Experiment with bernoulli_ to understand how it works.
  2. How do you set your model in training mode in PyTorch? In evaluation mode?

a. Module.train(), Module.eval()

  1. Write the equation for activation regularization (in maths or code, as you prefer). How is it different to weight decay?

loss += alpha * activations.pow(2).mean()
It is different by not decreasing the weights but the activations

  1. Write the equation for temporal activation regularization (in maths or code, as you prefer). Why wouldn’t we use this for computer vision problems?

This focuses on making the activations of consecutive tokens to be similar:
loss += alpha * activations.pow(2).mean()

  1. What is “weight tying” in a language model?

Weights of input-to-hidden layer is the same of weights of hidden-to-output layer is the same. This basically means we assume that the mapping from English words to activations


Question 5:

We should change the model to remove the third word from training in forward method if our sequence length is 2, no?

Like this:

class LMModel1_Modified(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
    def forward(self, x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))
        return self.h_o(h)

Also, as per the question, sequence should also be changed:

seqs = L((tensor(nums[i:i+2]), nums[i+2]) for i in range(0,len(nums)-4, 2))

is this right? or am I missing something?


Question 13:

few_valid_items = dls.valid.items[:5]
for xs, y in few_valid_items:
    print('xs: {} --> y: {}'.format((vocab[xs[0]], vocab[xs[1]], vocab[xs[2]]), vocab[y]))

Question 20

20a: Why should we get better results in an RNN if we call detach less often?
We will get better results if we call detach less often, and have more layers—this gives our RNN a longer time horizon to learn from, and richer features to create.

20b: Why might this not happen in practice with a simple RNN?
Simple RNNs, comparatively shallower than complex ones, there may not be any meaningful difference between hidden states of two time steps. Hence, we may not see any benefit from using detach less often (?)

Not sure on 20b.

1 Like

Is it right to say that numbers that are too close to zero are less precise in floating point storage context?

As i understand, numbers close to zero are more precise than numbers that are far. Numbers too close to zero only affect our use case in vanishing gradient context.

Question 38

I’m not sure but I think this is the answer:

38b: Why wouldn’t we use this for computer vision problems?

TAR is linked to the fact we are predicting tokens in a sentence. That means it’s likely that the outputs of our LSTMs should somewhat make sense when we read them in order.

This is not true for computer vision problems as order of input (different images) doesn’t matter.

1 Like

Does a “shared weight matrix” basically mean a reference copy to the same set of parameters?

If yes, does that mean that RNNs have much fewer (but “longer trained”) parameters than the default FCNN? Would it also make RNNs more interpretable since the parameter space is smaller?

Can somebody clarify this.

I’m not sure if this is the expected answer… but the following makes sense to me:
TAR is “…adding a penalty to the loss to make the difference between two consecutive activations as small as possible…”. In computer vision two consecutive activations could be two different tiles of the same image. One tile could be background, and one tile could contain the eye of a dog. If we want to predict for instance dog breeds, it would harm our models performance to reduce the difference between the activations.

Your class is definitely correct now. As for the sequence, I was trying out different combinations of numbers to make sense of the original model of 3 ind var and 1 dep var and found out that we subtract len(nums), which is 63095, minus 4, so then the sequence can be split into equal parts with no decimals.

So 63095 - 4 - 1 is equal to 63090 which is divisible by 3 (number of ind var) resulting in 21030. We then add back the 1 and get the final length of 21031.

Now for this question, I would assume that the seqs will be:
seqs = L((tensor(nums[i:i+2]), nums[i+2]) for i in range(0,len(nums)-2, 2))

This seq will have a length of equal parts of: 63095 - 2 - 1 = 63092 / 2 = 31546 + 1 = 31547.

EDIT: you can use this intuition and think that you don’t need to subtract anything from len(nums) when taking 2 ind var and 1 dep var. If you do the easy calculation you will get an integer 31548. The problem with this is that this integer will be out of bound since 31548x2 = 63096 which is more than our original token length

  1. Why do we scale the weights with dropout? Is this applied during training, inference, or both?

my answer: divide weights by (1-p) during training, OR multiply weights by p during inference, it’s either or, not both.

Use dls.valid instead since it want “batches” not “items”. Hence you could get something like:

for k, (xs, y) in enumerate(dls.valid):
    if k == 5: break  # first five only
    print([vocab[int(x)] for x in xs], vocab[int(y)])

And for Question 33: Dropout is only applied during training, not inference. When you want to infer with test set, you would not want to lose information. So although it can be applied during inference, generally it is not done unless there is a good reason to do so.

And for Question 38: the equation one thinks should be

loss += beta * (activation[:, 1:] - activations[:, :-1]).pow(2).mean()

as the one in the original answer is for activation regularization, not the requested temporal activation regularization.

1 Like

Perhaps one place TAR would make sense for Computer Vision is for video processing. E.g. if we break a video into images - aka frames - then we could expect that the embedding representation of each image should be similar or change slightly - thus we would want the distance between output activations to be small

Conceptually what I’m saying is whenever we are predicting sequences and the sequences should make sense consecutively, then we can use TAR. This can be an image as input and predicting a sentence. I think we can still use TAR because its applied to the prediction - a list of activations, one activation per predicted word.

I think this is incorrect.

We’re applying p during training (e.g. dropping out p% of neurons) and then rescaling their outputs by (1-p) to account for the dropout we just applied. When we divide by 1-p we are actually increasing the magnitude of the activations because remember 1-p is a fraction (e.g. 0.8). When we divide by a fraction we get a bigger number! So we are effectively increasing the activations to ensure it stays at the same scale as if we did not apply dropout.

We do neither (dropout p% neurons or rescale by 1-p) during inference though