I’m seeing some interesting topics appear where folks are asking for help understanding some foundational topics that have popped up in the first 2 lectures. I’m planning to add some material to the start of the next lecture giving some more info about some of these topics. So please let me know in the replies what you feel you need to better understand in order to follow along with the lessons so far. This could include coding, math, ML, DL, or other topics.

Here’s what’s on my list so far:

What are callbacks, and why do we use them

What is variance, and why does it matter

When should we use, or not use, softmax

Adding these based on discussion below:

All the dunder methods, why do we use them? I understand the necessity for __ init __, but don’t understand the others.

Why use a class instead of a function for our own notebooks? If we’re defining a new class, but we’re not inheriting the attributes of another class, why not just use a function?

for software engineering / python topics like callbacks, decorators, partial functions I’d find it helpful to just have a short hint (likely it was there in the lesson and I need to listen again!) how one would handle that in a much less expressive language than python (e.g simple, ‘overloaded’ loop instead of callbacks?). Just to be sure, I distinguish deep learning conceptual questions from advanced python & software engineering topics.

We calculate gradients and backpropagate them. Great. But why don’t we need the loss ?
Aren’t we using some sort of error to tell the gradient descent in which direction it should move ?

I’d really like to understand deeply the rationale about why all this works in the first place. (I thought I was clear on that until I saw we didn’t need the MSE during first lecture)… I hope it’s not too easy for everybody else…

Short answer: when we backprop, we always take the derivative of the loss with respect to something. If you start at the loss, the derivative of the loss w.r.t. to the loss is always 1. So that doesn’t really tell you anything.

Instead, we start at the derivative of the loss w.r.t. the model’s output. For the MSE loss that is 2 * (pred - true). We have enough information here in order to compute that gradient, because we know what the prediction result is, and we know what the true label is. So this is has everything we need to get started with the backwards pass.

In one of the twitter posts , could see that regularization was given as a reason for why sometimes training loss is greater than validation loss. There was also mention of degrees of freedom of a model. Would like to hear from you on these.

For me, all the dunder methods, why do we use them? I understand the necessity for __ init __, but don’t understand the others.

Also, why use a class instead of a function for our own notebooks? If we’re defining a new class, but we’re not inheriting the attributes of another class, why not just use a function?

Just to say this in a slightly different way from how @machinethink did, the loss \mathcal{L} is in a sense always present: when you want to figure out how which direction to wiggle some weight w in the middle of the network, ultimately you need to calculate \partial \mathcal{L}/\partial w, the derivative of the loss with respect to that weight.

We want to do this not just for that particular weight w, but all of the weights in the network; backpropagation is an efficient algorithm for calculating these derivatives (based on the chain rule, plus an algorithmic technique called dynamic programming).

[lr_find+fit_one_cycle]
I find that one of the most powerful element of fastai is the the couple lr_find + fit_one_cycle, because it helps a lot to save time. Having a precise explanation of the code and the principles could be super interesting. And what could be improve in this two techniques ?

Why is having a mean of 0 and a standard deviation so important? What happens when both are off, when one is good and the other is off, and when both are pretty spot on?

In the 02_a_why_sqrt5 notebook we are deriving conclusions about initialization parameters based on MNIST … BUT, how do we know that is a sufficient dataset from which to draw conclusions? In other words, we didn’t test with a 3 channel dataset and/or images of anything other than grayscale pictures of numbers between 1 and 10. When testing anything, how do we determine what is a good enough dataset from which to draw conclusions?

I think you probably have enough info to have a first go at answering this yourself - for this particular issue. We’re trying to answer: does multiplying by this weight matrix (then maybe doing a relu) result in a variance of 1, if the input has a variance of 1. Other folks are welcome to jump in too (both to answer this, or any previous question).

You already have the exact code for fit_one_cycle in last week’s notebook. We have enough in the notebooks now to implement lr_find - want to give it a try?

One of the issues I was wondering about was if inputs are normalized around mean of 0, wouldn’t relu kill a lot of outputs. In that case, mean of 0.5 instead of 0 may be better right ?

Sure but I won’t be able to get to until next week (taxes, preparing a couple presentations for a conference, trying to figure out who is feeding the dogs while I’m gone, etc…).

Could you give me an idea of what kind of experiments would be helpful?

I was thinking of building a simply ConvNet and using MNSIT and maybe a small subset of ImageNet for my datasets, and then kinda follow your 02a/b notebooks to manually adjust layer initialization so as to force various scenarios (e.g., good mean bad std, bad mean good std, good mean and std) … print out how these things change over the course of training and their effects on training time, accuracy, etc…

Understand that Sklearn will take care of this for you. However, is that done for our mini-batches? If it does, I think I’m missing it because a small batchsize (like 32) occasionally popped errors when only 1 class appeared. However, this never occurred on large sizes (1024).

Probably it could be interesting to further discuss initialization methods and practices, following the discussion started in Lecture 2. For example, as it was mentioned, BatchNorm layers are usually skipped during the initialization process. Also, there is a difference between CNN’s, GAN’s, and RNN’s initialization schemes as well.

Is there any generic heuristic about how to choose proper distributions and parameters? Which layers should be manually initialized, and which–not? How the authors of new methods and architectures choose the initialization strategy? Is it mostly a trial-and-error, grid/randomized search process?

As I can understand, it is mostly about keeping the mean and variance within the same range of values from layer to layer, right? I guess many of these questions were already discussed during the lecture and are highlighted in papers but as well as some other listeners, I still feel a bit of uncertainty about this topic.