Fastbook Chapter 5 questionnaire solutions (wiki)

Here are the questions!:

  1. Why do we first resize to a large size on the CPU, and then to a smaller size on the GPU?

This concept is known as presizing. Data augmentation is often applied to the images and in fastai it is done on the GPU. However, data augmentation can lead to degradation and artifacts, especially at the edges. Therefore, to minimize data destruction, the augmentations are done on a larger image, and then RandomResizeCrop is performed to resize to the final image size.

  1. If you are not familiar with regular expressions, find a regular expression tutorial, and some problem sets, and complete them. Have a look on the book website for suggestions.

To be done by the reader.

  1. What are the two ways in which data is most commonly provided, for most deep learning datasets?
  1. Individual files representing items of data, such as text documents or images.
  2. A table of data, such as in CSV format, where each row is an item, each row which may include filenames providing a connection between the data in the table and data in other formats such as text documents and images.
  1. Look up the documentation for L and try using a few of the new methods is that it adds.

To be done by the reader.

  1. Look up the documentation for the Python pathlib module and try using a few methods of the Path class.

To be done by the reader

  1. Give two examples of ways that image transformations can degrade the quality of the data.
  1. Rotation can leave empty areas in the final image
  2. Other operations may require interpolation which is based on the original image pixels but are still of lower image quality
  1. What method does fastai provide to view the data in a DataLoader?


  1. What method does fastai provide to help you debug a DataBlock?


  1. Should you hold off on training a model until you have thoroughly cleaned your data?

No. It is best to create a baseline model as soon as possible.

  1. What are the two pieces that are combined into cross entropy loss in PyTorch?

Cross Entropy Loss is a combination of a Softmax function and Negative Log Likelihood Loss.

  1. What are the two properties of activations that softmax ensures? Why is this important?

It makes the outputs for the classes add up to one. This means the model can only predict one class. Additionally, it amplifies small changes in the output activations, which is helpful as it means the model will select a label with higher confidence (good for problems with definite labels).

  1. When might you want your activations to not have these two properties?

When you have multi-label classification problems (more than one label possible).

  1. Calculate the “exp” and “softmax” columns of <


  1. Why can’t we use torch.where to create a loss function for datasets where our label can have more than two categories?

Because torch.where can only select between two possibilities while for multi-class classification, we have multiple possibilities.

  1. What is the value of log(-2)? Why?

This value is not defined. The logarithm is the inverse of the exponential function, and the exponential function is always positive no matter what value is passed. So the logarithm is not defined for negative values.

  1. What are two good rules of thumb for picking a learning rate from the learning rate finder?

Either one of these two points should be selected for the learning rate:

  1. one order of magnitude less than where the minimum loss was achieved (i.e. the minimum divided by 10)

  2. the last point where the loss was clearly decreasing.

  1. What two steps does the fine_tune method do?
  1. Train the new head (with random weights) for one epoch
  2. Unfreeze all the layers and train them all for the requested number of epochs
  1. In Jupyter notebook, how do you get the source code for a method or function?

Use ?? after the function ex: DataBlock.summary??

  1. What are discriminative learning rates?

Discriminative learning rates refers to the training trick of using different learning rates for different layers of the model. This is commonly used in transfer learning. The idea is that when you train a pretrained model, you don’t want to drastically change the earlier layers as it contains information regarding simple features like edges and shapes. But later layers may be changed a little more as it may contain information regarding facial feature or other object features that may not be relevant to your task. Therefore, the earlier layers have a lower learning rate and the later layers have higher learning rates.

  1. How is a Python slice object interpreted when passed as a learning rate to fastai?

The first value of the slice object is the learning rate for the earliest layer, while the second value is the learning rate for the last layer. The layers in between will have learning rates that are multiplicatively equidistant throughout that range.

  1. Why is early stopping a poor choice when using one cycle training?

If early stopping is used, the training may not have time to reach lower learning rate values in the learning rate schedule, which could easily continue to improve the model. Therefore, it is recommended to retrain the model from scratch and select the number of epochs based on where the previous best results were found.

  1. What is the difference between resnet 50 and resnet101?

The number 50 and 101 refer to the number of layers in the models. Therefore, ResNet101 is a larger model with more layers versus ResNet50. These model variants are commonly as there are ImageNet-pretrained weights available.

  1. What does to_fp16 do?

This enables mixed-precision training, in which less precise numbers are used in order to speed up training.


@muellerzr Please wiki-fy! :slight_smile:


Very good. Thank you.

@sgugger The following question is incomplete:

I assume this is due to the formatting that is there for generating the final book. But could you please clarify what columns this question is referring to? Is it this one?:


Can somebody point me to the code where I can see what learning rate was applied to each layer with discriminative learning rates? It didn’t look like they were included in the recorder.

I see even_mults, but couldn’t find it elsewhere in the repo. Thank you

I have a question about early stopping. I understand the reason for the learning rate schedule. Now suppose we retrain from scratch using the same number of epoches we got the best result previously. Since the learning rates will be different. As highlighted in the answer, towards the end, the learning rate will get much smaller. How can we guarantee that if we train from scratch, it’ll lead to better result than the early stopping one? We might not reach the minimum because the model is undertrained.

If early stopping is used, the training may not have time to reach lower learning rate values in the learning rate schedule, which could easily continue to improve the model. Therefore, it is recommended to retrain the model from scratch and select the number of epochs based on where the previous best results were found.

Hi Yiping,

I am not too sure as well, but probably just share my thoughts to keep the discussion going. In the notebook, we saw that 12 epochs were too many, and that the accuracy plateaued at around 8 epochs.

By retraining with the basis of 8 epochs, could it be that the learning rate schedule can now be properly compressed into the 8 epochs such that the full schedule (LR small to big and back to small) can be completed?

This is based on what I interpreted from the recommended blog on fit_one_cycle (, which mentioned the following:

…do a cycle with two steps of equal lengths, one going from a lower learning rate to a higher one than go back to the minimum. The maximum should be the value picked with the Learning Rate Finder, and the lower one can be ten times lower. Then, the length of this cycle should be slightly less than the total number of epochs…

Hi @klty0988, thanks for your reply! Based on my understanding, I draw the learning rate in the attached screenshot. The red curve is the early stopping one and the green one is training from scratch. Since the max/min LR is specified by us, they’ll be the same for both. The difference is for the early stopping one, the min-LR won’t be reached. From the graph, it’s obvious that the average learning rate of training from scratch is lower (by looking at the area under the curve). So if we train from scratch, the total amount of update should be smaller than the early stopping one? Please correct me if I’m wrong. Screenshot 2020-12-07 at 8.40.57 AM|650x430 Screenshot 2020-12-07 at 8.40.57 AM

Hi Yiping,

Thanks for the illustration, its a good way to visualize our discussion. I think our common understanding seems to be aligned.

If we start by setting 12 epochs and do fit_one_cycle, it will hit the red vertical line at 8 epochs as you illustrated above.

However, when we retrain from scratch with 8 epochs (after we realise from the above experiment that 8 epochs is where it starts to overfit), the entire learning rate schedule will be performed such that the green curve (complete learning) is spaced out nicely for 8 epochs now (instead of 12 epochs earlier).

Does that agree with what you perceive it to be?