Why don't we add the bias to each pixel but rather to the entire image?

elemeos · May 3, 2021, 11:46pm

Hello, as the title says, I was reading through the part 1of the course and i’ve reached the MNIST notebook. The only thing I cant quite grasp is why the bias is added to the weighted sum of each picture and not each pixel.

An example is:

(train_x[0]*weights.T).sum() + bias

Why is this correct and not this:

(train_x[0]*weights.T + bias).sum()

Lastly, I dont get how the dot product helps us calculate the weighted sum better than a for loop.

Also, I would like to get some further clarification as to why we choose the y=a*x + b function to predict images. Is that something standard ? Can we use others ?

Thanks in advance!

JackByte · May 4, 2021, 6:16pm

Hi elemos,

try to run the code on Google Colab or Kaggle. And inspect the shape of train_x[0], weights.T and bias. Can you figure out, why (train_x[0]*weights.T + bias).sum() won’t work?

The answer is in the script

While we could use a Python for loop to calculate the prediction for each image, that would be very slow. Because Python loops don’t run on the GPU, and because Python is a slow language for loops in general, we need to represent as much of the computation in a model as possible using higher-level functions. In this case, […] matrix multiplication

This is how a standard linear layer is computed. You can see later in the notebook, that nn.Linear (from PyTorch) is used instead of linear1. There are many different PyTorch models/layers that can be used. Convolutions for instance are used in Computer Vision a lot. But you’ll get there as soon as you dive deeper into the book.

elemeos · May 4, 2021, 6:36pm

Hello and thanks for the answer, regarding the first part i realised i meant to write

((train_x[0]*weights.T) + bias).sum()

so that the bias gets added to each weighted pixel.

I managed to figure out the second part about the matrix multiplication after I understood the shape of the tensor and how matrix multiplication works.

As for the third part I get that its a standard thing that was chosen for this example, however I cant quite grasp how summing the weighted pixels gives us a prediction. Meaning, if we multiply each pixel with their respective weight, and then sum the result from all the pixels, how does this constitute a prediction ? Thats what I really had in mind, i just phrased that really wrong.

Thank you for your answers on the rest however.

JackByte · May 6, 2021, 6:53pm

Hello again

The bias is created as a single number bias = init_params(1)… Hm first I thought this would not work because of the shapes not matching. But maybe it would still work because of broadcasting. But the formula would look different then. The w*x in the equation y=w*x+b is a matrix multiplication. A matrix multiplication in Python is done with the @ operator as seen in def linear1(xb): return xb@weights + bias. The python code (train_x[0]*weights.T).sum() is mimicking what habens if you use the matrix multiply train_x[0]@weights. But this will only works for the first image train_x[0]. If you wan’t to do it for the whole dataset you would have to use a for loop. But with matrix multiplication you can just train_x@weights and you will get a prediction for all images.

Try to reread the chapter and rewatch the videos… Hopefully you will find sense in my words then (train_x[0]*weights.T).sum() + bias gives you a number. Which I understand is far from being a prediction at first glance. What does it mean?! The example is a binary classification example: is it a 3, yes or no?. So how does 23.4543 help you with that?! It doesn’t (really). Thats why sigmoid is introduced, to push this number between 0 (false) and 1 (true).

olliebeannnn · January 25, 2023, 2:58pm

Is that right? I think the in example in the book, predictions are already made before introducing the sigmoid function in this cell: Google Colab

It says that “To decide if an output represents a 3 or a 7, we can just check whether it’s greater than 0.0”. Why would 0 be the cutoff at which a prediction says it’s a 3 or a 7?

UPDATE: dug around in the notebook and I think I have a clearer idea of what’s happening—0 was just an arbitrary cutoff for converting the output to a prediction. But if we had (somehow) optimised based on 0 being the cutoff, the model would have adjusted the weights to make 0 the cutoff.