Fastbook Chapter 13 questionnaire (wiki)

Here are the questions:

  1. What is a “feature”?

A feature is a transformation of the data which is designed to make it easier to model.

  1. Write out the convolutional kernel matrix for a top edge detector.

\begin{bmatrix} -1 & -1 & -1\\ 0 & 0 & 0 \\ 1 & 1 & 1 \end{bmatrix}

  1. Write out the mathematical operation applied by a 3×3 kernel to a single pixel in an image.
  2. What is the value of a convolutional kernel applied to a 3×3 matrix of zeros?

A zero matrix.

  1. What is “padding”?

Padding is the additional pixels that are added around the outside of the image, allows the kernel to be applied to the edge of the image for a convolution.

  1. What is “stride”?

Stride refers to how many pixels at a time the kernel is moved during the convolution.

  1. Create a nested list comprehension to complete any task that you choose.

To be completed by the reader. Here is an example. If we want to create the following matrix:
\begin{bmatrix} 0 & 1 & -2\\ 3 & 4 & 5 \\ 6 & 7 & 8 \end{bmatrix}
here is one way to do so:

>>> x = np.array([[i*3+j for j in range(3)] for i in range(3)])  > 
>>> print(x)  
[[0 1 2]  
[3 4 5]  
[6 7 8]] 
  1. What are the shapes of the input and weight parameters to PyTorch’s 2D convolution?

input: input tensor of shape (minibatch, in_channels, iH, iW)
weight: filters of shape (out_channels, in_channels, kH, kW)

  1. What is a “channel”?

The term channel (and also feature, often used interchangeably) refers to the size of the second axis of a weight matrix, which is, the number of activations per grid cell after a convolution.

  1. What is the relationship between a convolution and a matrix multiplication?

A convolution can be represented as matrix multiplication. This (weight) matrix has two properties:

  1. It has various zeros
  2. Some of the weights are equal. This is referred to as shared weights
  1. What is a “convolutional neural network”?

When we use convolutions instead of (or in addition to) regular linear layers we create a convolutional neural network (CNN).

  1. What is the benefit of refactoring parts of your neural network definition?

It makes it much less likely you’ll get errors due to inconsistencies in your architectures, and makes it more obvious to the reader which parts of your layers are actually changing.

  1. What is Flatten? Where does it need to be included in the MNIST CNN? Why?

It’s basically the same as PyTorch’s squeeze method, but as a module. It is included at the end of the MNIST CNN to remove the extra 1x1 axes.

  1. What does “NCHW” mean?

It is an abbreviation for the axes of the input of the model. It stands for batch size, channels, height, and width.

  1. Why does the third layer of the MNIST CNN have 7*7*(1168-16) multiplications?

There are 1168 parameters for that layer, and ignoring the 16 parameters (=number of filters) of the bias, the (1168-16) parameters is applied to the 7x7 grid.

  1. What is a “receptive field”?

The receptive field is the area of an image that is involved in the calculation of a layer.

  1. What is the size of the receptive field of an activation after two stride 2 convolutions? Why?

The size of the receptive field increases the deeper we are in the network. After two stride 2 convolutions, the receptive field is 7x7.

  1. Run conv-example.xlsx yourself and experiment with trace precedents.

Excel sheet is not available.

  1. Have a look at Jeremy or Sylvain’s list of recent Twitter "like"s, and see if you find any interesting resources or ideas there.

To be completed by the reader! Have fun!

  1. How is a color image represented as a tensor?

It is a rank-3 tensor of shape (3, height, width)

  1. How does a convolution work with a color input?

The convolutional kernel is of size (ch_out, ch_in, ks, ks). For example, with a color input with a kernel size of 3x3 with 7 output channels, that would be (7,3,3,3). The convolution filter for each of the ch_in=3 channels are applied separately to each of the 3 color channels and summed up, and we have ch_out filters like this, giving us a ch_out convolutional kernel tensors of size ch_in=3 x ks x ks. Thus the final size of this tensor is (ch_out, ch_in, ks, ks). Additionally we would have a bias of size ch_out.

  1. What method can we use to see that data in DataLoaders?


  1. Why do we double the number of filters after each stride-2 conv?

This is because we’re decreasing the number of activations in the activation map by a factor of 4; we don’t want to decrease the capacity of a layer by too much at a time.

  1. Why do we use a larger kernel in the first conv with MNIST (with simple_cnn)?

With the first layer, if the kernel size is 3x3, with four output filters, then nine pixels are being used to produce 8 output numbers so there is not much learning since input and output size are almost the same. Neural networks will only create useful features if they’re forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs. To fix this, we can use a larger kernel in the first layer.

  1. What information does ActivationStats save for each layer?

It records the mean, standard deviation, and histogram of activations of every trainable layer.

  1. How can we access a learner’s callback after training?

They are available with the Learner object with the same name as the callback class, but in snake_case. For example, the Recorder callback is available through learn.recorder.

  1. What are the three statistics plotted by plot_layer_stats? What does the x-axis represent?

The mean and standard deviation of the activations, as well as the percentage of activation near zero. The x-axis represents the progress of training (batch number).

  1. Why are activations near zero problematic?

Activations near zero are problematic because it means we have computation in the model that’s doing nothing at all (since multiplying by zero gives zero). When you have some zeros in one layer, they will therefore generally carry over to the next layer… which will then create more zeros.

  1. What are the upsides and downsides of training with a larger batch size?

The gradients are more accurate since they’re calculated from more data, but a larger batch size means fewer batches per epoch, which means less opportunities for the model to update weights.

  1. Why should we avoid using a high learning rate at the start of training?

Our initial weights are not well suited to the task we’re trying to solve. Therefore, it is dangerous to begin training with a high learning rate: we may very well make the training diverge instantly.

  1. What is 1cycle training?

1cycle training is a type of learning rate schedule developed by Leslie Smith that combines learning rate warmup and annealing, which allows us to train with higher learning rates.

  1. What are the benefits of training with a high learning rate?

Training with a high learning rate gives two benefits:

  • By training with higher learning rates, we train faster—a phenomenon Smith named super-convergence.
  • By training with higher learning rates, we overfit less because we skip over the sharp local minima to end up in a smoother (and therefore more generalizable) part of the loss.
  1. Why do we want to use a low learning rate at the end of training?

A lower learning rate at the end of training allows us to find the best part of loss landscape and further minimize the loss.

  1. What is “cyclical momentum”?

It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rates, we use less momentum, and we use more again in the annealing phase.

  1. What callback tracks hyperparameter values during training (along with other information)?

The Recorder callback.

  1. What does one column of pixels in the color_dim plot represent?

It represents the histogram of activations for the specified layer for that batch.

  1. What does “bad training” look like in color_dim? Why?

We would see a cycle of dark blue, bright yellow at the bottom return. This training is not smooth and effectively starts from stratch during these cycles.

  1. What trainable parameters does a batch normalization layer contain?

There are two learnable parameters, beta and gamma, which allows the model to have any mean and variance for each layer, which are learned during training.

  1. What statistics are used to normalize in batch normalization during training? How about during validation?

During training, we use the mean and standard deviation of the batch to normalize the data, while during validation we instead use a running mean of the statistics calculated during training.

  1. Why do models with batch normalization layers generalize better?

Most researchers believe that the reason for this is that batch normalization adds some extra randomness to the training process.