Lesson 10 Discussion & Wiki (2019)

Kaspar · April 4, 2019, 6:33pm

guess its the opposite:
“parameters are things we learn
activations are things we calculate. ie we also have activations when doing validation

karthik.subraveti · April 4, 2019, 6:49pm

https://www.youtube.com/watch?v=hcJAWKdawuM we will all win the kaggle competition :-))

tanyaroosta · April 4, 2019, 7:00pm

For anyone who needs to review batch norm, here is Jermey’s part 1 video:
https://course.fast.ai/videos/?lesson=6

tanyaroosta · April 4, 2019, 7:20pm

It was mentioned in the lecture that doing batch norm with RNN/LSTM is not straight forward, but there are a few tensorflow implementations and a couple of papers discussing how to do it. Does it mean it’s possible but not advisable to use batch norm with RNN, or that these papers are not getting it right?

jeremy · April 4, 2019, 7:35pm

Can you link to the papers?

tanyaroosta · April 4, 2019, 7:38pm

https://openreview.net/pdf?id=r1VdcHcxx

jeremy · April 4, 2019, 7:53pm

That’s a student project where it looks like they couldn’t get anything to work.

They say “we recommend using separate statistics for each timestep to preserve information of the initial transient phase in the activations”. That’s the thing we were trying to avoid. It would be OK for short sequences that don’t vary in length too much, but it seems difficult to do that in practice if you need to handle (say) sequences of 3000 words or so and which vary in length a lot (like in IMDb).

fabris · April 4, 2019, 8:25pm

Just a proposal!! Since I see almost no discussions about study groups despite Jeremy’s post, why don’t we use some kaggle competition like the one linked above so as to motivate people to cooperate ?? Sharing a common goal could be very useful. Moreover, let us make randomly study groups! But, of course, this latter it is up to the administrators if this idea is widely supported.

ste · April 4, 2019, 9:10pm

That is the post on colorful histograms

Pomo · April 4, 2019, 9:28pm

To clarify my understanding…

After sigmoid, you’d apply cross entropy to get the loss?
Sigmoid + cross entropy loss is what multi-label classification uses by default?

sgugger · April 4, 2019, 9:33pm

No, sigmoid + binary cross entropy is binary cross entropy, and it’s not the default for multi-label classification. That default is softmax + cross entropy.

jeremy · April 4, 2019, 10:39pm

The edited video has now been added to the top post.

Pomo · April 4, 2019, 11:15pm

Arrgh - not clarified yet.

Could you show the sequence of functions that turns activations into a loss number, using “binomial loss”?
When I run the lesson3-planets multi-label example, the model ends with Linear and the loss function is FlattenedLoss of BCEWithLogitsLoss(). This is defined as sigmoid followed by binary cross entropy loss.

Thanks for sorting this out.

charming · April 5, 2019, 3:13am

I have some trouble understanding the use of register_buffer().
My questions are:

When should I register a buffer? For what sort of Variables and for which not?
Could someone provide me with a simple example and code snippet of using register_buffer()?

LessW2020 · April 5, 2019, 4:32am

From reading about this in the Pytorch forums, here’s some info for your first question:
" If you have parameters in your model, which should be saved and restored in the state_dict , but not trained by the optimizer, you should register them as buffers.
Buffers won’t be returned in model.parameters() , so that the optimizer won’t have a chance to update them."
I hope to do some work with them tomorrow and if so will post a code snippet (assuming somone else doesn’t beat me to it

LessW2020 · April 5, 2019, 4:39am

Regarding question 2 - here’s the code for batchnorm and you can see how they register params vs buffers. Params are learnable (i.e. gradient) vs buffers are not, so that’s the main difference:

def __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True,
                 track_running_stats=True):
        super(_BatchNorm, self).__init__()
        self.num_features = num_features
        self.eps = eps
        self.momentum = momentum
        self.affine = affine
        self.track_running_stats = track_running_stats
        if self.affine:
            self.weight = Parameter(torch.Tensor(num_features))
            self.bias = Parameter(torch.Tensor(num_features))
        else:
            self.register_parameter('weight', None)
            self.register_parameter('bias', None)
        if self.track_running_stats:
            self.register_buffer('running_mean', torch.zeros(num_features))
            self.register_buffer('running_var', torch.ones(num_features))
            self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long))
        else:
            self.register_parameter('running_mean', None)
            self.register_parameter('running_var', None)
            self.register_parameter('num_batches_tracked', None)
        self.reset_parameters()

Hope that helps!

marcmuc · April 5, 2019, 5:46am

Hmm, do you mean multi-label or multi-CLASS? I think so far
multi-class default is Categorical Cross Entropy using softmax and multi-label default is Binary Cross Entropy using sigmoid. (These are also the defaults in the fastai library based on the labelclass)

So from my understanding of Jeremy in the lecture it would often make sense for real world mutliclass problems to not use softmax but rather the binary cross entropy (multi-label) version and then use thresholds and/or argmax with the results to figure out the single class. In that way we also get the probabilities for the class, undistorted by softmax, in order to be able to differentiate given classes vs. “background”/“no label”, in case probabilities are small for all of the classes. Is this what he meant?

This would finally answer my question asked during v3 part 1 :

from here.

radikubwa · April 5, 2019, 6:54am

Thanks @deena-b I’ll look into VScode soon. I’ve used vim while writing bash scripts but I kept forgetting the commands all the time and deleting my code. I’ll text you tomorrow.

Yes, those are great takeaways I’ll write that down.

dreambeats · April 5, 2019, 11:38am

I was looking at the new version of the Runner class, and I realised that we may have lost the ability for a callback to return True, is that correct?

Since res is set to False at the start, and we are using the ‘and’ operator, this effectively means that no matter what the callbacks return, res will be ultimately False, right?

Lankinen · April 5, 2019, 12:00pm

I remember Graham Neubig saying that batch size is a hyperparameter. Can someone explain that? What is the difference of having batch size of 32 instead of 128 addition to the speed?