Lesson 10 Discussion & Wiki (2019)

Hmm, do you mean multi-label or multi-CLASS? I think so far
multi-class default is Categorical Cross Entropy using softmax and multi-label default is Binary Cross Entropy using sigmoid. (These are also the defaults in the fastai library based on the labelclass)

So from my understanding of Jeremy in the lecture it would often make sense for real world mutliclass problems to not use softmax but rather the binary cross entropy (multi-label) version and then use thresholds and/or argmax with the results to figure out the single class. In that way we also get the probabilities for the class, undistorted by softmax, in order to be able to differentiate given classes vs. “background”/“no label”, in case probabilities are small for all of the classes. Is this what he meant?

This would finally answer my question asked during v3 part 1 :wink: :

from here.


Thanks @deena-b I’ll look into VScode soon. I’ve used vim while writing bash scripts but I kept forgetting the commands all the time and deleting my code. I’ll text you tomorrow.

Yes, those are great takeaways I’ll write that down.

I was looking at the new version of the Runner class, and I realised that we may have lost the ability for a callback to return True, is that correct?

Since res is set to False at the start, and we are using the ‘and’ operator, this effectively means that no matter what the callbacks return, res will be ultimately False, right?


I remember Graham Neubig saying that batch size is a hyperparameter. Can someone explain that? What is the difference of having batch size of 32 instead of 128 addition to the speed?

1 Like

Might be some basic mistake here. I’m confused in different behaviors in numpy and torch

np.array([10, 20]).var()

np.array([10, 20]).std()

torch.tensor([10., 20.]).var()

torch.tensor([10., 20.]).std()

in torch’s case they don’t seem to be taking a mean of the sum of the square of the deviations for the variance. Is this a bug ?

I digged further into this and looks like there is an arg called “unbiased” and if i set that, it matches numpy.

torch.tensor([10., 20.]).var(unbiased=False)

torch.tensor([10., 20.]).std(unbiased=False)
If unbiased is False , then the standard-deviation will be calculated via the biased estimator. Otherwise, Bessel’s correction will be used.

1 Like

Oh silly me - I meant to say “binary” but wrote “binomial” then just read what was there rather than actually thinking about it! Thanks for pointing this out.


One thing I didn’t quite understand is Jeremy said softmax should not be used, but everyone uses it. What should be used instead? Or did I misunderstand?

Sigmoid and binary log likelihood.


Just check this paper - https://arxiv.org/pdf/1606.02228.pdf

Came here to ask this question after listening to the softmax part of Lesson 10. I would really appreciate any advice on how we can handle “not any of these” classes in single label classification problems. For example, I am doing the Tensorflow Speech Challenge on Kaggle, and there are 10 classes each for a one word spoken command like “yes”, “stop”, “go”, as well as 2 classes for “silence”, and “unknown” for any other word or utterance that doesn’t match.

To this point I’ve been using resnet34 with 12 classes as if they were all the same. Training “unknown” with words and noises that aren’t silence or any of the other 10 classes but, from what Jeremy is saying, it sounds like it would be better to have 11 classes, and instead of doing softmax as my final activation, do argmax, but if it doesn’t meet a certain absolute threshold to predict “unknown”. My concrete questions are:

  • If I do remove “unknown” as a class in the initial stages of training, is there a way to still use my “unknown” data in a useful way?
  • Where in my code do I go to stop using softmax? I looked in learn.model but don’t see it in the final layers, is it there by another name? or am I misunderstanding and softmax isn’t used in resnet34?

Thank you all!


It can be included in the loss function and, therefore, you would not find it in the model.
See for example the cross entropy loss in PyTorch which “combines nn.LogSoftmax() and nn.NLLLoss() in one single class.”


The loss function is not part of the model. You can see the loss function that was automatically chosen by fastai with

To change the loss function, simply reassign it. Take a look at fastai’s BCEWithLogitsFlat for a likely candidate. The function it returns applies sigmoid, then binary cross entropy.

Once you train using BCEWithLogitsFlat, you’ll need to apply sigmoid to the predicted output activations in order to convert them to probabilities. The last time I checked, learn.get_preds outputs activations when it does not recognize your loss function; if it does recognize, it returns probabilities. But to be sure you should check what it is doing by looking at its outputs or by tracing code.

HTH, and experts please correct my errors!


If it’s helpful, I covered the question of “which loss function do I use for data that’s multi-class AND multi-label” in my talk on the Human Protein Image Classfication Kaggle competition: https://youtu.be/O5eHvucGTk4?t=1150


Hi Stas. I am wondering about the right way to keep fastai updated after installing pytorch-nightly for the course.

And should we keep updating pytorch-nightly using:

conda install -c pytorch pytorch-nightly

My current pytorch is:

pytorch-nightly 1.0.0.dev20190405 py3.7_cuda10.0.130_cudnn7.4.2_0 pytorch
nvidia driver is 418.56

and everything works.

To update fastai I tried both:

conda install -c pytorch -c fastai fastai
conda install -c fastai fastai 


The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::python-graphviz==0.8.4=py36_1
  - pytorch/noarch::torchvision==0.2.1=py_2
  - anaconda/linux-64::py-opencv==3.4.2=py36hb342d67_1
  - defaults/linux-64::psutil==5.4.7=py36h14c3975_0
  - defaults/linux-64::simplegeneric==0.8.1=py36_2
  - defaults/linux-64::qtpy==1.5.2=py36_0
  - fastai/noarch::fastai==1.0.51=1

The former command wants to install:

The following NEW packages will be INSTALLED:

  pytorch            pytorch/linux-64::pytorch-1.0.1-py3.7_cuda10.0.130_cudnn7.4.2_2

The following packages will be UPDATED:

  torchvision                                    0.2.1-py_2 --> 0.2.2-py_3

The latter command wants to install:

The following NEW packages will be INSTALLED:

  cudnn              pkgs/main/linux-64::cudnn-7.3.1-cuda10.0_0
  pytorch            pkgs/main/linux-64::pytorch-1.0.1-cuda100py37he554f03_0

Please advise. BTW, I’m a Linux ignoramus regarding packages. Thanks!

1 Like

Thanks Malcolm this helps a lot. I’ll try to get it working and then report what worked/didnt back here later. Cheers.

You’re correct, @Pomo.

If you’re using pip, this is just a straightforward:

pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu100/torch_nightly.html
# replace cu100 with cu90 if you’re on cuda-9.0

With conda it’s tricky. Installing pytorch-nightly via conda would be a problem with fastai updates which relies on pytorch (different package name), so when you update fastai it may force to re-install pytorch (not nightly) and you would lose pytorch-nightly

There are 3 ways you can go about it:

  1. use a dedicated conda environment for the part2 lessons, which requires no fastai

    conda create -y python=3.7 --name fastai-part2
    conda activate fastai-part2
    conda install -y -c pytorch pytorch-nightly torchvision

    i.e. don’t install fastai in that conda env.

  2. install pytorch-nightly via pip into your normal fastai conda environment - conda won’t know that you did that and won’t overwrite it

    conda install -c pytorch -c fastai fastai
    pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu100/torch_nightly.html
    # replace cu100 with cu90 if you’re on cuda-9.0
  3. always force reinstall pytorch-nightly after pytorch gets installed/updated in conda

    conda install -c pytorch -c fastai fastai
    conda install -c pytorch pytorch-nightly --force-reinstall

I updated the first post with this info. Let me know if you have difficulties with any of them.

And should we keep updating pytorch-nightly using:
conda install -c pytorch pytorch-nightly

You probably won’t need to update it again, pytorch-nightly was required in this lesson due to some pytorch bug that has been fixed recently, so chances are that whatever pytorch-nightly you installed a few days ago will be just fine for the rest of the course. And soon pytorch-1.1.0 will be released, so fastai will require that instead.

To see how the pytorch and pytorch-nightly packages overwrite each other:

conda uninstall pytorch pytorch-nightly torchvision
conda install -c pytorch pytorch torchvision
python -c 'import sys, torch; print(torch.__version__, sys.modules["torch"])'
1.0.1.post2       <module 'torch' from '.../site-packages/torch/__init__.py'>

and once we get pytorch-nightly, it overwrites pytorch, yet conda thinks they are 2 totally different packages.

conda install -c pytorch pytorch-nightly
python -c 'import sys, torch; print(torch.__version__, sys.modules["torch"])'
1.0.0.dev20190405 <module 'torch' from '.../site-packages/torch/__init__.py'>

I trimmed the output path so that it fits into the width, but the purpose is to show that it loads the exact same path in both cases.


I’m curious, does anyone know which features necessitated use of nightly pytorch for this lesson? I guess I haven’t kept up to date enough on the changelog…

Batch size is the number of examples used compute the parameter updates. The larger the batch size, the less ‘noisy’ are the parameter updates. Batch size is a hyperparameter because it affects the model performance. You tune it by varying it till you get the best performance.

1 Like

Hi @tanyaroosta

Jeremy said that softmax is the correct way to compute class probabilities for “multi-class” problems, where every example belongs to one and only one class.

And Jeremy says that softmax should not be used to compute class probabilities for “multi-label” problems, where an example can belong to more than one class. An alternative method to compute class probabilities for the “multi-label” case is to apply a sigmoid followed by binary cross-entropy to the output for each label. This is method Jeremy advocates.

My own question is: why is it wrong to use softmax to compute class probabilities in the “multi-label” case?

It’s really just the var() taking multiple dims. We also hoped to get to some JIT in the lesson, but we’ve pushed that back a little. (Maybe next week.)