Lesson 10 Discussion & Wiki (2019)

The edited video has now been added to the top post.


Arrgh - not clarified yet.

  • Could you show the sequence of functions that turns activations into a loss number, using “binomial loss”?

  • When I run the lesson3-planets multi-label example, the model ends with Linear and the loss function is FlattenedLoss of BCEWithLogitsLoss(). This is defined as sigmoid followed by binary cross entropy loss.

Thanks for sorting this out.

I have some trouble understanding the use of register_buffer().
My questions are:

  1. When should I register a buffer? For what sort of Variables and for which not?
  2. Could someone provide me with a simple example and code snippet of using register_buffer()?
1 Like

From reading about this in the Pytorch forums, here’s some info for your first question:
" If you have parameters in your model, which should be saved and restored in the state_dict , but not trained by the optimizer, you should register them as buffers.
Buffers won’t be returned in model.parameters() , so that the optimizer won’t have a chance to update them."
I hope to do some work with them tomorrow and if so will post a code snippet (assuming somone else doesn’t beat me to it :slight_smile:


Regarding question 2 - here’s the code for batchnorm and you can see how they register params vs buffers. Params are learnable (i.e. gradient) vs buffers are not, so that’s the main difference:

def __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True,
        super(_BatchNorm, self).__init__()
        self.num_features = num_features
        self.eps = eps
        self.momentum = momentum
        self.affine = affine
        self.track_running_stats = track_running_stats
        if self.affine:
            self.weight = Parameter(torch.Tensor(num_features))
            self.bias = Parameter(torch.Tensor(num_features))
            self.register_parameter('weight', None)
            self.register_parameter('bias', None)
        if self.track_running_stats:
            self.register_buffer('running_mean', torch.zeros(num_features))
            self.register_buffer('running_var', torch.ones(num_features))
            self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long))
            self.register_parameter('running_mean', None)
            self.register_parameter('running_var', None)
            self.register_parameter('num_batches_tracked', None)

Hope that helps!


Hmm, do you mean multi-label or multi-CLASS? I think so far
multi-class default is Categorical Cross Entropy using softmax and multi-label default is Binary Cross Entropy using sigmoid. (These are also the defaults in the fastai library based on the labelclass)

So from my understanding of Jeremy in the lecture it would often make sense for real world mutliclass problems to not use softmax but rather the binary cross entropy (multi-label) version and then use thresholds and/or argmax with the results to figure out the single class. In that way we also get the probabilities for the class, undistorted by softmax, in order to be able to differentiate given classes vs. “background”/“no label”, in case probabilities are small for all of the classes. Is this what he meant?

This would finally answer my question asked during v3 part 1 :wink: :

from here.


Thanks @deena-b I’ll look into VScode soon. I’ve used vim while writing bash scripts but I kept forgetting the commands all the time and deleting my code. I’ll text you tomorrow.

Yes, those are great takeaways I’ll write that down.

That is exactly how I understood it, too. I was a bit surprised by framing it as “whether to use softmax or not” rather than emphasizing from the beginning that it is about “single label from multiple classes” vs. “multi-label”, but this might be me not listening closely enough.
What confused me a bit was the “binomial” heading in the rightmost excel table, because I’m more familiar with calling it “multi-label binary classification”, e.g. the Wikipedia article on multi-label classification has some discussion on binary classification being a common approach for multi-label and possible alternatives).

Best regards



I was looking at the new version of the Runner class, and I realised that we may have lost the ability for a callback to return True, is that correct?

Since res is set to False at the start, and we are using the ‘and’ operator, this effectively means that no matter what the callbacks return, res will be ultimately False, right?


I remember Graham Neubig saying that batch size is a hyperparameter. Can someone explain that? What is the difference of having batch size of 32 instead of 128 addition to the speed?

1 Like

One way to see this is to consider the “steps per sample” metric. If you half the batch size, you’re making twice as many steps per sample. This is why often one considers doubling the batch size to be of similar effect as halving the learning rate - if all steps were in the same direction and you used vanilla SGD, you’d be in the same place at the end of an epoch.
One crucial thing for this is that customarily, the loss is normalized (by averaging the loss over the samples), if you sum instead, the learning rate would need to stay the same to get the same result.
What this “back of the envelope” argument neglects is a) normalization and the effect of step size there - as discussed in the lecture b) momentum, adaptiveness (in Adam/RMSProp) and co.

Best regards



Might be some basic mistake here. I’m confused in different behaviors in numpy and torch

np.array([10, 20]).var()

np.array([10, 20]).std()

torch.tensor([10., 20.]).var()

torch.tensor([10., 20.]).std()

in torch’s case they don’t seem to be taking a mean of the sum of the square of the deviations for the variance. Is this a bug ?

I digged further into this and looks like there is an arg called “unbiased” and if i set that, it matches numpy.

torch.tensor([10., 20.]).var(unbiased=False)

torch.tensor([10., 20.]).std(unbiased=False)
If unbiased is False , then the standard-deviation will be calculated via the biased estimator. Otherwise, Bessel’s correction will be used.

1 Like

Oh silly me - I meant to say “binary” but wrote “binomial” then just read what was there rather than actually thinking about it! Thanks for pointing this out.


One thing I didn’t quite understand is Jeremy said softmax should not be used, but everyone uses it. What should be used instead? Or did I misunderstand?

Sigmoid and binary log likelihood.


Just check this paper - https://arxiv.org/pdf/1606.02228.pdf

Came here to ask this question after listening to the softmax part of Lesson 10. I would really appreciate any advice on how we can handle “not any of these” classes in single label classification problems. For example, I am doing the Tensorflow Speech Challenge on Kaggle, and there are 10 classes each for a one word spoken command like “yes”, “stop”, “go”, as well as 2 classes for “silence”, and “unknown” for any other word or utterance that doesn’t match.

To this point I’ve been using resnet34 with 12 classes as if they were all the same. Training “unknown” with words and noises that aren’t silence or any of the other 10 classes but, from what Jeremy is saying, it sounds like it would be better to have 11 classes, and instead of doing softmax as my final activation, do argmax, but if it doesn’t meet a certain absolute threshold to predict “unknown”. My concrete questions are:

  • If I do remove “unknown” as a class in the initial stages of training, is there a way to still use my “unknown” data in a useful way?
  • Where in my code do I go to stop using softmax? I looked in learn.model but don’t see it in the final layers, is it there by another name? or am I misunderstanding and softmax isn’t used in resnet34?

Thank you all!


It can be included in the loss function and, therefore, you would not find it in the model.
See for example the cross entropy loss in PyTorch which “combines nn.LogSoftmax() and nn.NLLLoss() in one single class.”


The loss function is not part of the model. You can see the loss function that was automatically chosen by fastai with

To change the loss function, simply reassign it. Take a look at fastai’s BCEWithLogitsFlat for a likely candidate. The function it returns applies sigmoid, then binary cross entropy.

Once you train using BCEWithLogitsFlat, you’ll need to apply sigmoid to the predicted output activations in order to convert them to probabilities. The last time I checked, learn.get_preds outputs activations when it does not recognize your loss function; if it does recognize, it returns probabilities. But to be sure you should check what it is doing by looking at its outputs or by tracing code.

HTH, and experts please correct my errors!


If it’s helpful, I covered the question of “which loss function do I use for data that’s multi-class AND multi-label” in my talk on the Human Protein Image Classfication Kaggle competition: https://youtu.be/O5eHvucGTk4?t=1150