Lesson 6 - Official topic

nn.BCEWithLogitsLoss differs from the binary_cross_entropy function in that

  • it is a module and not a function (its functional form is F.binary_cross_entropy_with_logits) and
  • it applies/includes the Softmax function before doing binary_cross_entropy

From the book:

F.binary_cross_entropy , and it’s module equivalent nn.BCELoss , calculate cross entropy on a one-hot encoded target, but do not include the initial sigmoid . Normally for one-hot encoded targets you’ll want F.binary_cross_entropy_with_logits (or nn.BCEWithLogitsLoss ), which do both sigmoid and binary cross entropy in a single function, as in our example above.

I think the second and third arguments of torch.where should be interchanged. Also, if I’m not mistaken, there should be a negative sign before it. That’s just my guess, after looking at your post. I haven’t tried it out yet. Hopefully that helps. I can check it out later today, should it not be the reason for the discrepancy.

EDIT: Just realised that the same answer has been it has been proposed by @hallvagi in https://forums.fast.ai/t/lesson-6-official-topic/69306/355?u=gautam_e

2 Likes

Sylvain addressed it during the class and said it has been updated in the book(are you not seeing the changes? ) Those were the changes (-ve sign and interchanging the where arguments) :slight_smile:

Any tips on how to train faster?

I followed the first lectures, scrapped some data and I am trying to build my classifier. Until the dataloaders, all good. The problem comes when the actual training starts since my dataset is quite large. First, I am trying to choose a good learning rate:

I am training resnet50 on a V100 (p3.2xlarge on AWS) with 250K images. I resized the images to 500x500 since they were really high quality and I felt it was a pity to throw this information away, i.e. downsizing. I am also using mixed-precision training like Jeremy said in this lecture… but it looks as if it will take forever. I thought… is there a way to run fastai2 on multiple GPUs? In Pytorch I would do it like this.

to_fp16() could help.
you could start. with smaller image sizes and increase them (progressive resizing)
not sure if you have already tried these :slight_smile:

Yes you can use nn.DataParallel with fastai as shown here

1 Like

Do you know of any credible (working in production) updates to Smith’s work? Its difficult to parse the 500 citations to figure out which one is directly related to Smith’s work and which works. @sgugger @jeremy

A simple question regarding the learning rate finder:

  • Which learning rate should I pick to train a model?

This paragraph in the book confuses me a bit.

If I understood it correctly, the lr_min is actually the lr_min/10, right?
Moreover, why is Jeremy taking another value?

So the code is suggests automatically computed upper and lower bounds for the learning rates. lr_steep is the learning rate at which the slope is steepest, and is about 0.04 for this case.

But our learning curve has a weird kinky shape, with the maximum slope occurring just to the left of the learning rate minimum.

So we ask ourselves: should we trust the suggested value?
Is lr_steep = 0.04 a good choice for the learning rate?

After some thought, we realize that it is not, because there is no room to learn – SGD will immediately take you to the minimum, and then will be stuck there. So Jeremy pulls back to lr = 0.01, which is still on the steep part of learning rate curve, but not right next to the minimum, so that the model will be able to learn.

2 Likes

Thanks a lot @jcatanza!

(…) the code suggests automatically computed upper and lower bounds for the learning rates.

I would have said, lr_steep should be the lower bound and lr_min the upper bound but since we are dividing it by 10 it is the other way around, right?

In addition, from what I understood, you are saying that generally we want to take lr_steep as the learning rate (not lr_min) although in this case we are just ignoring the recommendation due to the reasons you stated.

Yes. So when do we use lr_min?

lr_min comes into play for transfer learning, which is the process of applying weights from a pretrained model to solve a more specialized problem.

The procedure is, roughly:

  • Import weights from the pre-trained model
  • Chop off the the head (last layer) from the pre-trained model, replacing it with a custom head that is appropriate for our specialized problem.
  • Freeze all the weights except the weights in the custom head
  • Train the last layer with the learning rate set to lr_steep.
  • Unfreeze all the weights
  • Separate the neural net into layer groups, from shallow to deep
  • Fine-tune the weights by training the network with differential learning rates that progressively decrease from lr_steep at the last layer to lr_min for the deepest layers.

The reason for progressively decreasing the learning rate towards the deeper layers is this: we expect that the deeper the layer, the closer the pretrained weights are to their optimal values, and the more gently they need to be trained. Whereas we want the last layer to be trained hard, i.e with the highest learning rate possible.

5 Likes

Hi Joseph,

During transfer learning of resnet, I noticed the batchnorm layers don’t become frozen even if you do learn.freeze(). You have to manually turn off the gradients for the batchnorm layers, if you want everything frozen.

Do you know why the batchnorm layers are not frozen? Is it to get better statistics of the inputs when transfer learning?

Thanks,
Daniel

1 Like

This is interesting! I didn’t know this, and don’t yet know why batchnorm is implemented this way.

@DanielLam, @jcatanza, The short answer is: by unfreezing batchnorm our model get a better accuracy.

Now the why:
When we use a pretrained model, the batchnorm layer contains the mean, the standard deviation, the gamma and beta (2 trainable parameters) of the pretrained dataset (ImageNet in the case of images).

If we freeze our batchnorm layer with our dataset, we are feeding our model with our data (our images) and normalizing our batch with ImageNet mean, standard deviation, gamma and beta: Those values are off specially if our images are different from the ImageNet images. Therefore, your normalized activations are also off which leads to less than optimal results.

We keep the batchnorm layer unfrozen because while we are training the model and for each batch we will calculate the mean, the standard deviation of the activations of our data (batch of images), and updating (training) the corresponding gamma and beta, and using those results to normalize our activations of the current batch: The normalized activations are therefore more aligned with our images (dataset) as opposed to those obtained with a frozen batchnorm.

5 Likes

Thanks @farid, excellent explanation!

1 Like

Hi All,
I am trying to use multi-label classification for the bear classification example. When I use the MultiCategoryBlock , I get labels with 11 classes (corresponding to the unique alphabets in the class names) rather than the expected 3.

What am I missing?

Hi,

Thanks for the reply. Do you know if this has been published somewhere or a notebook showing the differences?

I’m quoting Jeremy from lesson 12 about batchnorm:

“Anytime something weird happens to your neural net it’s almost certain it’s because of the batchnorm because batchnorm makes everything weird!”

To answer your question, you might check out the 11a_transfer_learning.ipynb from lesson 12 - Part-2 2019 course. You can also jump to lesson 12 video portion where Jeremy explains the effect of the mean, the standard deviation, and the batchnorm trainable parameters on training a custom model that I am referring to in my previous post.

Here a little summary about the experiment that he showed in that video:
1 - He created a custom head for his model
2 - He froze the whole body (including the batchnorm layers) of the pretrained model.
3 - He trained his model for 3 epochs, and he got 54% accuracy
4 - He unfroze the whole body, and train the model for 5 epochs, and he got 56% accuracy (which was surprising low)

Then, he decided to unfreeze batchnorm from the beginning (meaning while training the custom head). He showed the following steps:
1 - He froze the whole body except the batchnorm layers of the pretrained model.
2 - He trained his model for 3 epochs, and he got 58% accuracy (which is already better than above)
3 - But more importantly, when he unfroze the whole body, and train the model for 5 epochs, he got 70% accuracy (and that’s a huge jump)

5 Likes

Interesting. Thank you for the response.

1 Like

I believe that yes, this is a binary classification task. You could get pictures of people wearing eyeglasses and people not wearing eyeglasses, using one of the imagenet trained models you should get great results.