Mnist number prediction LESSON 4

naveenperera · December 16, 2020, 12:50am

In lesson 04 of part 1, the the functions which recognizes a three or a seven, which are,

def linear1(xb): return xb@weights + bias
corrects = (preds>0.0).float() == train_y

How is possible to declare a three or a seven based on the sum is greater than zero or less than zero (>0,a three & <0, a seven)?

darek.kleczek · December 16, 2020, 6:22am

This is not intuitive at first, but once you go through the entire lesson and understand SGD and sigmoid activation it should start making sense.

linear1 is a function that takes raw input and converts it to a number (‘logit’) between -inf and +inf
for binary classification (is it a 3 or a 7) this will later go through sigmoid, which squeezes the output to be between 0 and 1
loss function with sgd will help us adjust the parameters so that the predictions for one class are close to zero (linear1 is below zero, ideally close to -inf) or close to 1 (linear1 above zero, or ideally close to +inf)

yat626 · December 17, 2020, 6:15am

I am not sure I am correct, but I think since the the randomized weight can be negative number,
We just use positive or negative to define if it’s 3 or 7.
We can also reverse it will negative means 3, positive means 7, the choice is arbitrary.

PalaashAgrawal · December 17, 2020, 7:48am

@naveenperera
Basically, we will adjust our weights and bias in such a manner that when the input is three, the output is greater than 0, and negative for seven (or vice versa, whichever way you formulate the problem).

Our assumption is this. Hopefully there is **some ** inherent difference in our threes and sevens. And we hope that we can multiply our input pixel values with weights and add them up. If our weights are properly adjusted, we can differentiate between threes and sevens. For example, here, we will differentiate our threes and sevens based on whether they are greater than zero or less than zero.

For example, imagine this…
a picture of a seven is less likely to be inked on the bottom right of the picture. Whereas, aa bottom right pixel is more likely to be inked in the case of a three. This means, in the case of sevens, the bottom right pixel is most probably white, ie, has a higher pixel value. And the corresponding pixel for a three is more likely a lower pixel (black) value. Now if you multiply it with a proper weight, you can make sure that the output is positive or negative, depending on if its a three or a seven. And you can do this for all pixels. Ofcourse, not all pixels may have a heavy influence on the decision, For example, the middle most pixel is as likely to be inked for a three, as for a seven, and so we cant make a decision based on that pixel alone. But thats fine, we’ll assign a lower weight to that pixel, and it wont matter much in the final decision.

Hope it makes it clearer. Cheers.

naveenperera · December 28, 2020, 3:01am

Thank you so much @PalaashAgrawal . But I have one more question. Initially the weights are randomly generated. Therefore how do we confidently declare that a three is greater than zero and a seven is less than zero?

naveenperera · December 28, 2020, 3:02am

Thank you @darek.kleczek. Make sense now

PalaashAgrawal · December 28, 2020, 8:12am

@naveenperera @yat626
The threes need not necessarily be greater than 0, or sevens need not be less than zero. We can interchange them, or even set the threshold as any other number.
The idea is to create a mathematical distinction.
Our loss function is a function that will have a higher value whenever the network gives a wrong prediction - for example, in this case, if the network predicts the value of a three as less than zero, the loss would be higher in value, and whenever the network gives a correct prediction (for example, if it predicts the three as greater than zero, or seven as less than zero), the loss would be lower.
Now, we use an algorithm, called gradient descent, that is designed to MINIMIZE the loss function, which basically means, adjust the weights to give the most correct predictions ( because correct predictions give lower loss function value).

So basically, we randomly initialize the weights, and later on, adjust them according to our criteria. Initially the neural network knows nothing about our data, so it would give horrible predictions, but we will continuously improve it by adjusting our weights, such that the loss functions is reduced (or in other words, the network gives better prediction). (This is also a great point to understand why we need a loss function - the loss function acts as the mathematical function that can be used to understand how well the neural network is working. It is a mathematical way to tell the network, how well the predictions are, and when we give the Neural Network the task to minimize the loss function, we are in fact, making it “learn” and give better predictions)

To make it clearer, even if you would have set the threshold in the loss function such that the threes gave a value less than zero and sevens gave a value greater than zero, the gradient descent would adjust the weights accordingly, so that the loss function value would be minimum (ofcourse, meaning, that weights would be adjusted by the gradient descent algorithm such that the best predictions are given, which now, are negative for our threes and positives for our sevens.)

You can even try setting the threshold as a non zero value, and find out it still works.

Hope it makes it clearer
Cheers

naveenperera · December 28, 2020, 11:42am

Loud and Clear @PalaashAgrawal. Thank you so much. . The point I was missing was what the loss function gives us. As you said we can make the distinction based on that. Totally understand it now. Thanks again. Cheers!

mlaw · December 30, 2020, 12:08am

Agrawal: Let me see if I have this right. To reiterate, the question is about
corrects = (preds>0.0).float() == train_y
corrects.float().mean().item()

The variable “corrects” is a bit misleading because the predictions at this point are neither correct nor incorrect. We do know that half the predictions will be below 0 because the weights were initialized with randn which creates random variables in a normal distribution with a mean of 0.

Consequently all that “corrects.float().mean().item()” tells us is that half the values will be negative and half will be positive which basically we already knew. It is the loss function that will push the predictions in whatever direction we want and give the results meaning.

Does this make sense?

PalaashAgrawal · December 30, 2020, 3:52am

That is correct! And it will keep pushing the model till it gets the kind of results that we might be satisfied with. Of course, it’s not always possible to get 100% accurate results

mlaw · December 30, 2020, 4:27pm

Thanks for the confirmation.

mlaw · January 4, 2021, 7:41pm

Agrawal,
Here is something that has been bugging me for several weeks. Maybe you or someone else can shed light on it. I understand how we use weights and bias’s to lower loss etc. What I don’t understand is that when we are finished with reducing our loss, how do we actually use this for inference. What exactly is the model? Is it the set of weights? How does the model use the weights to make a prediction? I assume the learner does that. What does the learner do exactly. Is the concept too complicated to explain in simple terms? I looked at the section on how to create your own learner and it’s far above my knowledge at this point. However the whole thing would make more sense to me if I had an idea how the inference is done from what we create in lesson 4.

PalaashAgrawal · January 4, 2021, 8:19pm

@mlaw
The concept is not at all complicated, and you are, knowingly or unknowingly, totally spot on …
The Machine Learning model is nothing but the values of the weights. We basically store away the values of all weights and biases. These are optimized parameters that “know” how to differentiate between a cat and dog, or a three and a seven, for example.
Learner is nothing but a fastai class that contains your data, model, optimizer and other settings stored in it. So don’t be confused with what a learner is. Its basically a class which contains all the functionality we need for training the model. We don’t need to separately define a model class, and an optimizer class, etc. Everything is under one roof. This is what makes fastai so simple and easy to use.

For example, when we say learn.fit(), fit() is nothing but a function that calls the optimizer over the model with respect to the data we pass to it.
.
What we are concerned with is - we have a model which has weights and biases. We have some labelled data that we pass through it, and the optimizer adjust the weights and biases such that the best results are obtained. This is all that learner does. Theres no other rocket science behind this

we save the model after training (in other words, save the weights and biases) and, later load them for inference. When you pass your new unseen data to the model during inference, you’re basically multiplying the pixel values with the weights and adding them up, and this will result in the prediction. Because our weights are optimized, we wont get random results, but we’ll get good results - like if you pass a picture of a handwritten three, that the model has never trained on before, it will still predict correctly. That is what Deep Learning is in essence!

Hope it is clearer.
If you really want to understand how learner is coded, you can go to part 2 of the 2019 course. However only go through it when you finish part 1 of the course, and you’re very comfortable using it, and with broader deep learning concepts. You dont want to go so deep that you dont understand anything and it all becomes more burdensome than helpful.

Cheers

mlaw · April 1, 2021, 8:48pm

Agrawal,

Thanks for your past help on this. Since you seem to be active on the forum and have an admirable interest in helping others I thought I would share this link with you so that you can share it with others when appropriate. It is a fantastic and easily understood explanation of exactly how back propagation works. I find the simple partial derivatives to be magical in their effect.

Explanation of Back Propagation
Andrej Karpathy is so good that Tesla hired him away from Stanford.
[https://www.youtube.com/watch?v=i94OvYb6noo]

I would like to contact you privately. Can you email me when you get a chance?
mlhxcom@gmail.com