We saw the log rules for the quotient and multiplication: ln(x/y)=ln(x)−ln(y) ln(x*y)=ln(x)+ln(y)
just a reminder that ln(x) / ln(y) and ln(x) * ln(y) which look similar have no simplification rules.
A tip that helps me remember this: x * y can lead to large numbers and applying log to it converts the multiplication to a summation, which is smaller than the multiplication.
Hope this helps someone
This was the first of the fastai lessons so far where I got REALLY lost, probably because I’ve never studied calculus and because things move faster around that.
I’m going through everything in this week’s notebook really slowly and expanding my explanations to myself in notes as and where something feels like a step or progression is compressed. Same with some of the new terms / shorthands for things we’d been doing in part 1 but that were never referred to with those terms (like ‘back propagation’ etc). Will get there in the end, I hope, and will try not get dissuaded by the forward march of the lessons!
I don’t think there is a substitute for debugging in the exact environment the code is running in and which it was developed for. Anything else is a compromise.
The most important things to do IMO is ensure you’ve got the pre-reqs - i.e. watch the 3blue1brown ‘essence of calculus’ series and enough of Khan Academy that you’ve covered derivatives and the chain rule. If you didn’t cover that in high school (or did, but have forgotten it), then you’ll need to back-fill that stuff now, and the fast.ai material in lesson 12 won’t make that much sense without it (since we’re starting on the assumption that you’re already comfortable with that).
It’s not any harder than anything else we’ve done up until now in the course, and I’m sure you’ll be able to pick it up with time and practice! I’ll create a new thread for asking calculus questions now - feel free to use it to ask anything you like.
I was trying to work through some backprop on pen and paper this past week (along with the code)
and always got a little stuck when I got to higher order Jacobian stuff. I mean the layers in the middle where you have dY/dX and both are tensors and not like a simple scalar loss. I was expressing some of my confusion in one of the study groups and someone recommended this video. It has has some nice heuristics and explanations on how we get those results without computing the entire Jacobian. This was a video that made it make more sense to me. It’s another example of “getting the shapes to work out” which is a theme that keep coming up (broadcasting etc.) Just sharing the link to the video in case anyone else was stuck on similar thing.
Same here Alex. In addition to the suggestions above, this helped me a lot: Make Your Own Neural Network
also, I watched the V3 version of the second part first (2020) and watched this lesson again. V3 is essentially the same as this, but I believe both are complementary. A slightly different angle to the same thing is very helpful.
I read this over the weekend and can confirm that it’s quite useful at this low level stuff. Esp going end to end with some simple networks, showing how the various pieces fit together and so on. Filled in some gaps in my understanding, for sure. Thank you for the recommendation, @nikem!
That’s great to see so many good explanations of backprop and chain rule. Kudos to @sinhak for writing this down for matrix case! It is not the first time I approach chain rule and trying to implement it from scratch. However, deriving partial derivatives and Jacobians, especially, for multiple layers, has always been a struggle. (And still is, actually…)
This time, I tried to use Jupyter and PyTorch as “copilots”: I started with pen and paper, but figured our the right multilications using pdb. Here is my small note with somewhat trivial derivations, but it helped me to reimplement forward/backward path for linear layers “from scratch”; and it was great to see that my implementation is aligned with what was presented during lectures. (Except that I don’t sum up the last dimension for bias tensor, and keep it as a “column” vector.)
Essentially, I derived the equations for scalar case, and tried to align the shapes of gradients with the shapes of weights, i.e., if W is 4x3, then W.grad should have the same shape in order to do an update. And, it worked! So, an interactive playground is great assistance in this. As well as autograd frameworks that you can use to check if everything is done correctly.
I wonder if it is possible to do the same for conv/attention layers? Like, starting with somewhat lousy math and figuring out the right implementation using interactive playground. Might be an interesting learning experiment.
Yet another presentation about the gradients of transformations and backpropagation - while the concept of derivatives is simple for functions of one argument, making it work for multi-dimensional data requires advanced linear algebra skills and it is better to trust PyTorch
By the way, I just realized that I don’t quite understand why the bias is a vector and not a matrix in PyTorch’s implementation. (See the following snippet.)
Hehe. Everything you ever learned in math class is now wrong thanks to broadcasting. Jokes aside, this kind of stuff gets me all the time because I first learned this stuff from strict math point of view.
Bias is a vector with number of elements equal to the number of output features (as you see in the code). So for each sample (row of X) the bias is the same. If there are many samples in a batch then the bias is broadcasted. If there is a single output than the bias is a single number.
However, I wasn’t sure if that works the same way in cases when we have a matrix of weights, and not a column vector. So, it means that we model bias in such a way that each coefficient is broadcasted across the columns, right?
Somehow, I thought that it is more like the following picture shows.
But I guess I just didn’t get the equation right and mistakenly generalized it from the lower dimension. Good to know!