Lesson 10 Discussion & Wiki (2019)

It was mentioned in the lecture that doing batch norm with RNN/LSTM is not straight forward, but there are a few tensorflow implementations and a couple of papers discussing how to do it. Does it mean itā€™s possible but not advisable to use batch norm with RNN, or that these papers are not getting it right?

Can you link to the papers?


https://openreview.net/pdf?id=r1VdcHcxx

Thatā€™s a student project where it looks like they couldnā€™t get anything to work.

They say ā€œwe recommend using separate statistics for each timestep to preserve information of the initial transient phase in the activationsā€. Thatā€™s the thing we were trying to avoid. It would be OK for short sequences that donā€™t vary in length too much, but it seems difficult to do that in practice if you need to handle (say) sequences of 3000 words or so and which vary in length a lot (like in IMDb).

2 Likes

Just a proposal!! Since I see almost no discussions about study groups despite Jeremyā€™s post, why donā€™t we use some kaggle competition like the one linked above so as to motivate people to cooperate ?? Sharing a common goal could be very useful. Moreover, let us make randomly study groups! But, of course, this latter it is up to the administrators if this idea is widely supported.

That is the post on colorful histograms :wink:

2 Likes

To clarify my understandingā€¦

  • After sigmoid, youā€™d apply cross entropy to get the loss?

  • Sigmoid + cross entropy loss is what multi-label classification uses by default?

No, sigmoid + binary cross entropy is binary cross entropy, and itā€™s not the default for multi-label classification. That default is softmax + cross entropy.

The edited video has now been added to the top post.

6 Likes

Arrgh - not clarified yet.

  • Could you show the sequence of functions that turns activations into a loss number, using ā€œbinomial lossā€?

  • When I run the lesson3-planets multi-label example, the model ends with Linear and the loss function is FlattenedLoss of BCEWithLogitsLoss(). This is defined as sigmoid followed by binary cross entropy loss.

Thanks for sorting this out.

I have some trouble understanding the use of register_buffer().
My questions are:

  1. When should I register a buffer? For what sort of Variables and for which not?
  2. Could someone provide me with a simple example and code snippet of using register_buffer()?
1 Like

From reading about this in the Pytorch forums, hereā€™s some info for your first question:
" If you have parameters in your model, which should be saved and restored in the state_dict , but not trained by the optimizer, you should register them as buffers.
Buffers wonā€™t be returned in model.parameters() , so that the optimizer wonā€™t have a chance to update them."
I hope to do some work with them tomorrow and if so will post a code snippet (assuming somone else doesnā€™t beat me to it :slight_smile:

5 Likes

Regarding question 2 - hereā€™s the code for batchnorm and you can see how they register params vs buffers. Params are learnable (i.e. gradient) vs buffers are not, so thatā€™s the main difference:

def __init__(self, num_features, eps=1e-5, momentum=0.1, affine=True,
                 track_running_stats=True):
        super(_BatchNorm, self).__init__()
        self.num_features = num_features
        self.eps = eps
        self.momentum = momentum
        self.affine = affine
        self.track_running_stats = track_running_stats
        if self.affine:
            self.weight = Parameter(torch.Tensor(num_features))
            self.bias = Parameter(torch.Tensor(num_features))
        else:
            self.register_parameter('weight', None)
            self.register_parameter('bias', None)
        if self.track_running_stats:
            self.register_buffer('running_mean', torch.zeros(num_features))
            self.register_buffer('running_var', torch.ones(num_features))
            self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long))
        else:
            self.register_parameter('running_mean', None)
            self.register_parameter('running_var', None)
            self.register_parameter('num_batches_tracked', None)
        self.reset_parameters()

Hope that helps!

4 Likes

Hmm, do you mean multi-label or multi-CLASS? I think so far
multi-class default is Categorical Cross Entropy using softmax and multi-label default is Binary Cross Entropy using sigmoid. (These are also the defaults in the fastai library based on the labelclass)

So from my understanding of Jeremy in the lecture it would often make sense for real world mutliclass problems to not use softmax but rather the binary cross entropy (multi-label) version and then use thresholds and/or argmax with the results to figure out the single class. In that way we also get the probabilities for the class, undistorted by softmax, in order to be able to differentiate given classes vs. ā€œbackgroundā€/ā€œno labelā€, in case probabilities are small for all of the classes. Is this what he meant?

This would finally answer my question asked during v3 part 1 :wink: :

from here.

6 Likes

Thanks @deena-b Iā€™ll look into VScode soon. Iā€™ve used vim while writing bash scripts but I kept forgetting the commands all the time and deleting my code. Iā€™ll text you tomorrow.

Yes, those are great takeaways Iā€™ll write that down.

I was looking at the new version of the Runner class, and I realised that we may have lost the ability for a callback to return True, is that correct?

Since res is set to False at the start, and we are using the ā€˜andā€™ operator, this effectively means that no matter what the callbacks return, res will be ultimately False, right?

3 Likes

I remember Graham Neubig saying that batch size is a hyperparameter. Can someone explain that? What is the difference of having batch size of 32 instead of 128 addition to the speed?

1 Like

Might be some basic mistake here. Iā€™m confused in different behaviors in numpy and torch

np.array([10, 20]).var()
25.0

np.array([10, 20]).std()
5.0

torch.tensor([10., 20.]).var()
tensor(50.)

torch.tensor([10., 20.]).std()
tensor(7.0711)

in torchā€™s case they donā€™t seem to be taking a mean of the sum of the square of the deviations for the variance. Is this a bug ?

Update:
I digged further into this and looks like there is an arg called ā€œunbiasedā€ and if i set that, it matches numpy.

torch.tensor([10., 20.]).var(unbiased=False)
tensor(25.)

torch.tensor([10., 20.]).std(unbiased=False)
tensor(5.)
If unbiased is False , then the standard-deviation will be calculated via the biased estimator. Otherwise, Besselā€™s correction will be used.

1 Like

Oh silly me - I meant to say ā€œbinaryā€ but wrote ā€œbinomialā€ then just read what was there rather than actually thinking about it! Thanks for pointing this out.

6 Likes

One thing I didnā€™t quite understand is Jeremy said softmax should not be used, but everyone uses it. What should be used instead? Or did I misunderstand?