Lesson 9 Discussion & Wiki (2019)

One of my takeaways from tonight is that if you want to be good at using fast.ai v2, get to love and understand callbacks :slight_smile:

4 Likes

I actually started that book last week, and the Coursera class is very good, too!

1 Like

I would say that programmers haven’t invented anything better than observers (reactive programming, callbacks, events, signals, however you like to call it) to build complex software solutions :smile: Different flavors and features but the same concept as I understand it.

2 Likes

I think it definitely worth it :slight_smile:

2 Likes

I find the Feynman method of learning most effective, where instead of writing on paper I write it in code.

5 Likes

But why not zeroize at the beginning?

I went through that course a couple of years back on Coursera when I first started self learning about data science (I figured learning how to learn would probably make sense since there’s a lot of materials that I need to cover). First exposure to the Pomodoro technique. :slight_smile:

1 Like

An issue that has mystified me about softmax and cross-entropy…

Suppose we have image1’s class activations (x1,x2) and image2’s activations are (x1-k, x2-k).

Now exponentiate and take softmax. The result is identical. logSoftmax is also identical and therefore the cross entropy loss is identical.

What then is the meaning that image2’s activations are less than image1’s? That is, what does this situation look like? And why does it make sense that these two images, with different levels of activation, should contribute the same amount to the total loss?

This question comes out of image segmentation, where two adjacent pixels may have vastly different activations yet identical probability distributions across the classes.

Thanks for any insights that make intuitive sense,

You could also zero the gradients at the beginning of each epoch instead of at the end. The convention is to do it at the end. I think it’s also linked to a common mental model: when you finish the epoch, you want everything to be ready for the next one. Leaving the gradients would feel a bit, I don’t know, unfinished maybe?

Also that way the step and the zeroing of the gradients are back to back in code, and it’s easy to put them under the same conditional (for accumulating gradients for example).

I’m not sure what you mean. If two images have different activations for the same target, they very won’t have the same loss at the end.

Maybe it’s the “LogSumExp Trick” that makes you say that? Remember that in that trick you still would have a “+k” right? Here it’s “+a”.

\log \left ( \sum_{j=1}^{n} e^{x_{j}} \right ) = \log \left ( e^{a} \sum_{j=1}^{n} e^{x_{j}-a} \right ) = a + \log \left ( \sum_{j=1}^{n} e^{x_{j}-a} \right )

http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture06.pdf, I found the stanford 231n lecture also explain the weight initialization problem quite well.

1 Like

Interesting, I didn’t realize that the convention was to do it at the end - many of the examples I’ve seen look something like this, but this is from a Udacity deep learning course:
image
I’ve always thought that we want to be sure that our variables and settings are correct before we start using them, so to call zero_grad before running a training pass seems much more aligned with that idea. The only reason I can think of NOT doing it before starting is if we wanted to inherit some values the very first time we enter the training loop - could that be the reason for not zeroizing at the beginning?

It seems that

init.kaiming_uniform is deprecated for torch.nn.init.kaiming_uniform_

Signature:
nn.init.kaiming_uniform_(
    ['tensor', 'a=0', "mode='fan_in'", "nonlinearity='leaky_relu'"],
)
Source:   
def kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu'):
    r"""Fills the input `Tensor` with values according to the method
    described in "Delving deep into rectifiers: Surpassing human-level
    performance on ImageNet classification" - He, K. et al. (2015), using a
    uniform distribution. The resulting tensor will have values sampled from
    :math:`\mathcal{U}(-\text{bound}, \text{bound})` where

    .. math::
        \text{bound} = \sqrt{\frac{6}{(1 + a^2) \times \text{fan\_in}}}

    Also known as He initialization.

    Args:
        tensor: an n-dimensional `torch.Tensor`
        a: the negative slope of the rectifier used after this layer (0 for ReLU
            by default)
        mode: either 'fan_in' (default) or 'fan_out'. Choosing `fan_in`
            preserves the magnitude of the variance of the weights in the
            forward pass. Choosing `fan_out` preserves the magnitudes in the
            backwards pass.
        nonlinearity: the non-linear function (`nn.functional` name),
            recommended to use only with 'relu' or 'leaky_relu' (default).

    Examples:
        >>> w = torch.empty(3, 5)
        >>> nn.init.kaiming_uniform_(w, mode='fan_in', nonlinearity='relu')
    """
    fan = _calculate_correct_fan(tensor, mode)
    gain = calculate_gain(nonlinearity, a)
    std = gain / math.sqrt(fan)
    bound = math.sqrt(3.0) * std  # Calculate uniform bounds from standard deviation
    with torch.no_grad():
        return tensor.uniform_(-bound, bound)
File:      ~/anaconda3/envs/fastaiv3/lib/python3.6/site-packages/torch/nn/init.py
Type:      function

It seems it was already in the 2a notebook sorry about the repetition. It was a long night :smile:

I’m very far from knowing everything but I can’t think of any reason to inherit values the first we enter the training loop.

Otherwise (again, AFAIK) it really doesn’t matter if it’s done at the beginning or at the end. When you create your model from scratch the gradients are initialised at 0, so when you enter the training loop for the first time they will be at 0. For the next iterations it seems to me that it’s exactly the same if the zeroing is the last thing you do in the current iteration, or the first thing you do in the next one.

I have one suggestion as we all have a different background and knowledge and we have to self teach our self in different areas can we all start share with the basic, which actually is shame sometimes to admit it that you don’t know but we have to explore and find answer to it.

For instance i wasn’t sure about definition of callback and here it is:

it is basically calling function from another functions with part of code
As material is getting more technical i think the worst is something simple holding you up.

Or maybe we can open topic under “Dummies ML question”.
What you think ?

2 Likes

Thanks! :slight_smile:
I got a few likes on the reply too so I’ve added it to the wiki.

I would like to ask about gradient accumulation, the way I understand Pytorch gradients is to be a chain of mathematical operations that were kept track of, so when training we use them to update the parameters and then clear them for the next iteration.
How exactly do we accumulate them? do they get mathematically added to the previous iteration’s gradients or concatenated to the previous chain of operations?

There are no dumb questions. :slight_smile: Please feel free to ask about anything you’re unsure about here.

4 Likes

They get added when you call backward(). That call also removes the gradient history from the tensors.

1 Like