Lesson 13 official topic

We saw the log rules for the quotient and multiplication:
ln(x/y)=ln(x)−ln(y)
ln(x*y)=ln(x)+ln(y)

just a reminder that ln(x) / ln(y) and ln(x) * ln(y) which look similar have no simplification rules.
A tip that helps me remember this: x * y can lead to large numbers and applying log to it converts the multiplication to a summation, which is smaller than the multiplication.
Hope this helps someone :slight_smile:

1 Like

This was the first of the fastai lessons so far where I got REALLY lost, probably because I’ve never studied calculus and because things move faster around that.

I’m going through everything in this week’s notebook really slowly and expanding my explanations to myself in notes as and where something feels like a step or progression is compressed. Same with some of the new terms / shorthands for things we’d been doing in part 1 but that were never referred to with those terms (like ‘back propagation’ etc). Will get there in the end, I hope, and will try not get dissuaded by the forward march of the lessons!

7 Likes

adding to this, Say we have a linear layer.
L is our Loss.

linear layer:
input * w + b = output

we need to find:
input.grad (or dL/dinput)

given:
output.grad (or dL/doutput)

applying chain rule, dL/dinput = (dL/doutput) * (doutput/dinput)
or, input.grad = output.grad * (doutput/dinput)

doutput/dinput = d ( input * w + b ) / dinput
doutput/dinput = w

therefore, input.grad = output.grad * w (w.T to take care of sizes)

the crux of backdrop is to make use of local-derivative and global derivative

as explained by karpathy here, timestamp: 1:06:51:

4 Likes

I don’t think there is a substitute for debugging in the exact environment the code is running in and which it was developed for. Anything else is a compromise.

The most important things to do IMO is ensure you’ve got the pre-reqs - i.e. watch the 3blue1brown ‘essence of calculus’ series and enough of Khan Academy that you’ve covered derivatives and the chain rule. If you didn’t cover that in high school (or did, but have forgotten it), then you’ll need to back-fill that stuff now, and the fast.ai material in lesson 12 won’t make that much sense without it (since we’re starting on the assumption that you’re already comfortable with that).

It’s not any harder than anything else we’ve done up until now in the course, and I’m sure you’ll be able to pick it up with time and practice! I’ll create a new thread for asking calculus questions now - feel free to use it to ask anything you like.

4 Likes

OK here’s the topic:

2 Likes

Folks, in case it proves helpful to anyone, see link below for my rough attempt to connect the code and the math in the 03_backprop.ipynb notebook.

9 Likes

This is a really nice piece of work, thanks for sharing, I have been doing a similar thing but yours is much better layed out

Thanks for the encouragement @johnri99!

I was trying to work through some backprop on pen and paper this past week (along with the code)
and always got a little stuck when I got to higher order Jacobian stuff. I mean the layers in the middle where you have dY/dX and both are tensors and not like a simple scalar loss. I was expressing some of my confusion in one of the study groups and someone recommended this video. It has has some nice heuristics and explanations on how we get those results without computing the entire Jacobian. This was a video that made it make more sense to me. It’s another example of “getting the shapes to work out” which is a theme that keep coming up (broadcasting etc.) Just sharing the link to the video in case anyone else was stuck on similar thing.

9 Likes

This was fantastic! Thank you!

Same here Alex. In addition to the suggestions above, this helped me a lot: Make Your Own Neural Network
also, I watched the V3 version of the second part first (2020) and watched this lesson again. V3 is essentially the same as this, but I believe both are complementary. A slightly different angle to the same thing is very helpful.

2 Likes

This link clearly explains the result:

http://cs231n.stanford.edu/handouts/linear-backprop.pdf

4 Likes

I read this over the weekend and can confirm that it’s quite useful at this low level stuff. Esp going end to end with some simple networks, showing how the various pieces fit together and so on. Filled in some gaps in my understanding, for sure. Thank you for the recommendation, @nikem!

1 Like

That’s great to see so many good explanations of backprop and chain rule. Kudos to @sinhak for writing this down for matrix case! It is not the first time I approach chain rule and trying to implement it from scratch. However, deriving partial derivatives and Jacobians, especially, for multiple layers, has always been a struggle. (And still is, actually…)

This time, I tried to use Jupyter and PyTorch as “copilots”: I started with pen and paper, but figured our the right multilications using pdb. Here is my small note with somewhat trivial derivations, but it helped me to reimplement forward/backward path for linear layers “from scratch”; and it was great to see that my implementation is aligned with what was presented during lectures. (Except that I don’t sum up the last dimension for bias tensor, and keep it as a “column” vector.)

Essentially, I derived the equations for scalar case, and tried to align the shapes of gradients with the shapes of weights, i.e., if W is 4x3, then W.grad should have the same shape in order to do an update. And, it worked! So, an interactive playground is great assistance in this. As well as autograd frameworks that you can use to check if everything is done correctly.

I wonder if it is possible to do the same for conv/attention layers? Like, starting with somewhat lousy math and figuring out the right implementation using interactive playground. Might be an interesting learning experiment.

3 Likes

Yet another presentation about the gradients of transformations and backpropagation - while the concept of derivatives is simple for functions of one argument, making it work for multi-dimensional data requires advanced linear algebra skills and it is better to trust PyTorch :slight_smile:

In my newly created Quarto blog to support my learning I added a post about Derivatives and chain rule.

2 Likes

By the way, I just realized that I don’t quite understand why the bias is a vector and not a matrix in PyTorch’s implementation. (See the following snippet.)

    # torch.nn.Linear
    def __init__(self, in_features: int, out_features: int, bias: bool = True,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(Linear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
        if bias:
            # bias is a vector?
            self.bias = Parameter(torch.empty(out_features, **factory_kwargs))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

Looking at the equation, I expected that, in general, it might be a matrix:

X • W + b
^   ^   ^
|   |   |
|   |   should be of shape (n, k)?
|   |
|   this one might be (m, k), right?
|
say, this one is (n, m)

Does it broadcast the bias vector in cases when linear matrix multiplication returns a matrix?

Hehe. Everything you ever learned in math class is now wrong thanks to broadcasting. Jokes aside, this kind of stuff gets me all the time because I first learned this stuff from strict math point of view.

1 Like

Bias is a vector with number of elements equal to the number of output features (as you see in the code). So for each sample (row of X) the bias is the same. If there are many samples in a batch then the bias is broadcasted. If there is a single output than the bias is a single number.

1 Like

Yeah, I understand this part for the case when we have a system of linear equations, like:

| x11 x12 x13 |    | w1 |   | b1 |   | x11*w1 + x12*w2 + x13*w3 + b1 |
| x21 x22 x23 | x  | w2 | + | b2 | = | x21*w1 + x22*w2 + x23*w3 + b2 |
| x31 x32 x33 |    | w3 |   | b3 |   | x31*w1 + x32*w2 + x33*w3 + b3 |

However, I wasn’t sure if that works the same way in cases when we have a matrix of weights, and not a column vector. So, it means that we model bias in such a way that each coefficient is broadcasted across the columns, right?

Somehow, I thought that it is more like the following picture shows.

matmul

But I guess I just didn’t get the equation right and mistakenly generalized it from the lower dimension. Good to know!