Lesson 4 - Official Topic

Here is a basic intuition why.

Let’s take one neuron for example. If the input to the neuron is zero, the neuron will output a 0. Alone, this is not a big issue, but it becomes a problem if this happens to this neuron for every one of the input samples (different images). Our images have a lot of pixels that are 0.

If this happens, this neuron will always produce 0 during the forward propagation, and then the gradient flowing through this neuron will forever be zero irrespective of the input.

In other words, the weights of this neuron will never be updated again. Such a neuron can be considered as a dead neuron, which is considered a kind of permanent “brain damage” in biological parlance

That’s my understanding. Maybe someone more knowledgeable can confirm this or give a better explanation.

4 Likes

Hello! This chapter made me learn a lot

The only thing i am confused is in the following code line:
nn.Linear(28*28,30),
nn.ReLU(),
nn.Linear(30,1)

The “30” in the first line means 30 neurons (30 different models), so the first layer has 30 neurons and the second layer has 1 neuron (the output).

What if the first layer is “nn.Linear(28*28,1)” (like the model we trained before) instead of 30 and then pass through ReLU and then to another linear layer?

What is the purpose of the out_features parameter in nn.Linear? What does it do and why 30 or a number > 1?

Hello

Actually, the 30 stands for 30 different features. Giving the first layer 30 neurons means we are allowing our model to represent an image using 30 different features. This usually results in greater performance because each neuron learns a different set of weights to represent different functions over the input data.

That means that the first layer can construct 30 different features, each representing some different mix of pixels. You can change that 30 to anything you like, to make the model more or less complex.

Hi all,

I am having trouble with the loss function and I’m wondering if anyone can help!

In notebook 5 there is a table that represents the probability of an image being a 3 or a 7:

|3|             7|  targ|idx|  loss|
|---|---|---|---|---|
|0.602469|0.397531|0|0|-0.602469|
|0.502065|0.497935|1|1|-0.497935|
|0.133188|0.866811|0|2|-0.133188|
|0.99664|0.00336017|1|3|-0.00336017|
|0.595949|0.404051|1|4|-0.404051|
|0.366118|0.633882|0|5|-0.366118|

Now from my interpretation the loss of predicting a 3, which is where an image has the target = 1, you take the probability of the image being a 7. So for the second row, the image is actually a 3, with a target = 1 and the loss is given by -Pr(7) = - 0.497935. This is the same as - ( 1 - Pr(3)).

We want to maximise this to get the best possible performance such that it should equal zero (maximise because it’s a negative number).
Or better to say, we want to minimise the probability that the model predicts 7 in this case, so we use SGD to find the parameters that minimise this.

My question arises from the multiclass case. The notebook says:

To see this, consider what would happen if we added an activation column for every digit (0 through 9), and then targ contained a number from 0 to 9. As long as the activation columns sum to 1 (as they will, if we use softmax), then we’ll have a loss function that shows how well we’re predicting each digit. We’re only picking the loss from the column containing the correct label. We don’t need to consider the other columns, because by the definition of softmax, they add up to 1 minus the activation corresponding to the correct label. Therefore, making the activation for the correct label as high as possible must mean we’re also decreasing the activations of the remaining columns.

I interpret this like this. If we have numbers 1 - 9, with the image inputted being an actual 3, we use the result of the softmax which gives the probability of it being 3 (and of course the probability of being all the other numbers) as the loss function. Therefore we’re looking to maximise this value, and hence minimise 1 - Pr(3). My question is, is this loss function, in my example explained here, 1 - Pr(image = 3)? There’s no mentioned of 1 - Pr(3) in the notebook, well the 1 - operation in general, so that is confusing me!

Thanks!

Possible bug! I have loaded the Paperspace + Fast.AI notebook on https://console.paperspace.com/ and ran all of the cells in the full book ipynb file. I get this, which means something went very wrong somewhere:

I did not change any of the default settings when creating the notebook. All the notebook cells are unchanged.

Many thanks!

Try a higher learning rate

1 Like

Thank you! I can see change when running this cell for the first time, but a different problem arises when I run it several times more:

It seems to be stuck on these three parameters:

tensor([ 0.1753, -0.6854,  1.3738], requires_grad=True)

When looking at the speed variable initialisation:

speed = torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1

Ignoring the added randn noise, I find the parameters (please correct me if I made an error. I am error prone):

0.75 * (time - 9.5)**2 + 1 =
= 0.75 * (time**2 -19 * time +90.25) + 1
= 0.75 * time**2 -14.25 * time + 68.5

so the underlying parameters (if I did not make any errors) are a = 0.75, b = -14.25, and c = 68.5 which are different from the inferred a = 0.1753, b = -0.6854, and c = 1.3738.

Also, I expected the embedded outputs shown in the notebooks to be the same as the outputs after one runs the unchanged notebooks top to bottom for themselves. Without having working code and expected output great confusion arises. I have written an issue https://github.com/fastai/fastbook/issues/500 concerning this problem.

1 Like

Actually, the inferred points of the embedded output seem to be reaching the same values as the inferred points of my output.

I am still confused. Why do they not converge?

The main point of this section of the book/notebook is to demonstrate how SGD works so you understand the concept. It shows you that after running several epocs the parameters are improving which demonstrates it’s working. At the end of running this you should not expect the learned parameters to exactly match the values used in the original equation exactly, just that they are closer to them after training than the random ones that were generated.

In the example in the book, it is impossible to achieve perfect convergence because parameters in the quadratic equation do not exactly match the parameters in the example speed equation (ex: a*time**2 vs 0.75 * (time - 9.5)**2 : parameter for 9.5 is missing).

If you want to see the learned parameters converging much closer to the factors in the speed equation you can remove the noise and change the factors in the speed equation so they match the quadratic equation (ex: speed = 0.75*(time)**2 + 4). I then added additional items to the print statement in the apply_step function if prn: print(loss.item(),params.data,params.grad.data) and move the print statement above the param.grad = None line so you can easily see the parameters changing. The last thing you can do is play around with increasing the learning rate (if you go too high, it will not converge so change it slowly) and apply a lot more iterations. You will also need to run more iterations for the values to converge. See screenshot.

There are a lot of tricks for improvements all around which are discussed later in the course. The main thing is just having a basic concept of how SGD works.

3 Likes

In the example though, it seems like the bias is being added to the entire image, not to individual pixels? I could be missing something though