MNIST loss function

darshil_fcbayern · August 5, 2021, 4:41pm

In MNIST Loss function section, authors use this to calculate the prediction:

(train_x[0]*weights.T).sum() + bias

Why the weights are transposed in this case?

(train_x[0]*weights).sum() + bias also return a number.
What does that number signify?

hadicurtay · January 29, 2022, 7:30am

This answer may be a year late but I’ve just reached this step and spent 2 days trying to figure this (blame my lack of foundation on the amount of time it took )

So here goes

We need to do a matrix multiplication which in Python is done using @ operator. In the example above we’re not using the @ operator and just a simple multiplication hence we need to additionally transpose. If you reviewed khan academy matrix multiplication topic you would know that the objective is to get a dot product in a matrix multiplication.

So when you simply multiply
train_x[0] which is shape 784
with
weights which is shape [784,1]

You are going to get a different result than multiplying with
weights.T which becomes shape [1, 784]

In my understanding -
In weights.T each pixel of input image of train_x[0] is multiplied with each value of Weights.

In weights each pixel of input image of train_x[0] is going to be multiplied with all the weights which is not the result we want.

Not the same results. Hence we use transpose if we are Not using matrix multiplication so we can get the same result which is a dot product.

ihavequestions · August 17, 2023, 3:55am

Thanks for answering it a year later, because I’ve had the same question since yesterday.

ihavequestions · August 17, 2023, 9:56pm

Despite your excellent answer, I still didn’t feel like I understood what the heck was going on. I think I’ve figured it out in a way that makes sense to me!

Ultimately, what helped me move past this is when I finally realized that multiplying tensors is performing an operation that I was not clear on at all.

I understood from Math that matrix multiplication, at least for 2d matrices, was only possible when the column of the first matrix and the row of the second matrix matches (i.e. you can multiply a MX2 by a 2XN to yield a MXN matrix, but multiply a MX1 by NX1 is not possibly since 1!=N.

I also understood from the previous fastai lessons that there is such a thing in PyTorch and Numpy called broadcasting where the libraries will stretch vectors/matrices/tensors to make the computations possible even if the shapes are viable.

Despite having that knowledge, I was seeing results I did not expect when performing multiplication on tensors.

Your answer, getting to understand the goal of the calculation in the first place, and trying at least 2982938 different queries in Jupyter, Google, Youtube and Chat-GPT finally got me to some level of understanding.

Goal

In order to get a prediction from our randomized initial weights, we need to multiply every pixel in our image by a corresponding weight. When we state our goal this way, it is easier to understand which is the right approach.

Multiplication in PyTorch

In addition to understanding our goal we also need to understand how PyTorch works when multiplying tensors. When you multiply two tensors (i.e. use the * operator, PyTorch will, if necessary, first perform broadcasting. Then, it will perform element-wise multiplication. That’s was the missing information for me and I only understood it after comparing various outputs of tensor operations that didn’t match my expectations. Element-wise multiplication is different from matrix multiplication, and will yield different results.

Answer

Because our goal is to multiply every pixel in our image by its corresponding weight, we know we want back a 784 vector tensor (1X784).

If we simply compare the shapes of train_x[0], weights, and weights.T, then our answer becomes clear.

train_x[0] is a row vector with 784 elements, a 1X784 tensor.

weights is a 2d vector with shape 784X1

weights.T, the transpose of the weights, has it’s opposite shape, 1X784.

train_x[0] * weights

If you use the * to between two tensors 1X784 * 784X1, pytorch will attempt element-wise multiplication. Element-wise multiplication requires two vectors to be the same shape. So first broadcasting will happen. The 1X784 vector will formally become a 784X784 vector. The 784X1 vector also because a 784 vector. Then it will perform element-wise multiplication multiplying each element at the same row, index in both vectors by one another. The final result will be a 784X784 vector. This is not what we want.

train_x[0] * weights.T

Since both tensors are the same shape, 1X784, no broadcasting needs to occur. Element-multiplication occurs and you’re spit back another 1X784 tensor, the image’s pixels scaled (multiplied) by their weights.

This was a frustrating discovering as this nuance was not covered anywhere and is quite an important distinction. This whole time I thought the * was performing matrix multiplication. Instead if you want to perform, you need to either use python’s @ operator, or use PyTorch’s torch.mm. You can also use torch.matmul but note that it will perform broadcasting whereas the former doesn’t.

Try it yourself

Runnable Jupyter examples are worth a million bucks! Run this in your notebook for additonal clarity between element-wise (* ) & matrix ( @ ) multiplication.

tensor_a = torch.Tensor([1,2,3]) # shape (1,3)
tensor_b = torch.Tensor([ # shape (3,1)
    [1],
    [2],
    [3]
]) 
print(tensor_a @ tensor_b) # matrix multiplaction, yields scalar
print(tensor_a * tensor_b) # element-wise multiplication, broadcasts, yields 3x3 matrix

# tensor_c & tensor_d are effectively what broadcasting does to tensor_a and tensor_b when you multiply (*) them. 
tensor_c = torch.Tensor([
    [1,2,3],
    [1,2,3],
    [1,2,3]
])
tensor_d = torch.Tensor([
    [1,1,1],
    [2,2,2],
    [3,3,3]
])

# same as tensor_a * tensor_b (element-wise multiplication)
print(tensor_c * tensor_d)

# performs actual matrix multiplication
print(torch.mm(tensor_c, tensor_d))

Results

tensor([14.])
tensor([[1., 2., 3.],
[2., 4., 6.],
[3., 6., 9.]])
tensor([[True, True, True],
[True, True, True],
[True, True, True]])
tensor([[14., 14., 14.],
[14., 14., 14.],
[14., 14., 14.]])