Lesson 4 - Official Topic

@spike @gautam_e
Good question!
I had to think about this.

Based on my understanding, having a bias for each weight would still be equivalent to having a single value for the entire neuron, since you could just sum all the biases for each pixel and end up with a single value.

Basically, each neuron is wired up to all the input values (i.e. each pixel in the image) and amplified/attenuated by the weight for that input pixel before being added up. We then add up a bias for the entire thing. If we had a bias for each input value (e.g. pixel), we could just total them up to an equivalent single value ā€“ does this make sense?

Best regards,
Butch

3 Likes

Hey @butchland yeah I think your answer is the right explanation. Thanks, it didnā€™t occur to me earlier that it all boils down to one bias term effectively.

Hi,

Iā€™m trying to replicate all the steps for the full MNIST dataset and wanted to sense check if the loss function is correct.

def cross_entropy_loss(predictions, targets):

  sm_acts = predictions.log_softmax(dim=1)

  idx = range(len(predictions))

  res = -sm_acts[idx, targets].mean()

  return res

Any help will be much appreciated!

Thanks!

Hi!

Iā€™ve made a solid attempt at the full mnist from scratch and learnt a lot.

It looks like its working fine when I define a neural net as a function.

def init_params(size, std=1.0): 
  return (torch.randn(size)*std).requires_grad_()

# initialise weights and biases for each of the linear layers
w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,10)) # 10 final activations
b2 = init_params(10) # 10 final activations
params = w1,b1,w2,b2

def simple_net(xb): 
    res = xb@w1 + b1 
    res = res.max(tensor(0.0)) 
    res = res@w2 + b2 
    return res

The output I get looks like this

When I try and refactor the code and use this alternative neural net with nn Module

s_net = nn.Sequential(
    nn.Linear(28*28,30),
    nn.ReLU(),
    nn.Linear(30,10)
)

The output looks a bit odd. The training and validation loss decreases but the batch accuracy is always 1.00 for each epoch which is very odd.

The link to my Google Colab is here - Mnist from scratch

Would love some help from the community to help me figure out whatā€™s going on. :sweat_smile:

Thank you!

Adi

@muellerzr, wondering if you could lend a hand here? :sweat_smile: Thank you so much!

1 Like

Hey @spike,

Hereā€™s how I understand it.

In, y =w*x + b, x is just a single input or pixel. You can rewrite this as y = w1x1 + b.

If there are more inputs (e.g. 3), the equation might look like y = w1x1 + w2x2 + w3x3 + b.

You need one weight for each of the inputs or pixels. You only need one bias no matter how many inputs.

In the mnist example, there are 28x28 pixels to start with and therefore we need to initialise and train 28x28 weights and 1 bias for a single linear layer with 1 activation.

Cheers,
Adi

1 Like

Iā€™m working through the questions for Chapter 4 and wanted to understand more around question 31: Why do we have to zero the gradients?

Specifically, why would one want to NOT zero out the gradients?

I think I understand why we want to zero them out in this implementation, but donā€™t see why itā€™s implemented in this way.

Thank you

Hey @misham

Great question. Got me thinking as well.

From the fastai book

The reason for this is that loss.backward actually adds the gradients of loss to any gradients that are currently stored. So, we have to set the current gradients to 0 first:

In terms of why is this implemented this way, these are the steps involved in SGD. Stepping the parameters is the final step before we calculate the loss once again or end training.

Cheers,
Adi

Hi Adi! I wanted to look into your code, but I canā€™t get it to run without errors, maybe itā€™s something about the new fastai and PyTorch versions on colab?

Regarding the actual problem, it looks like training is running fine, itā€™s just the metric that seems off, right? Did you try the out-of-the-box accuracy function?

1 Like

Hey @johannesstutz,

Thanks for looking into this. Yes, the errors seem to be due to some compatibility issues with fastai and fastcore versions in colab. I get passed these errors when I downgraded fastcore to an older version and similarly with pytorch.

When I built a neural net as a function, the training works fine and the metrics print as expected.

When I try it with nn.Sequential, the training and validation loss decreases (which is great) but the metrics print funny.

I tried the out of the box accuracy metric it still prints 1.00 which is very odd.

My goal is to build the full mnist trainer from scratch and map it to the refactored or convenient version to strengthen my understanding. Really appreciate the help!

Cheers,
Adi

Thanks @aditya.swami

If I understand correctly, you can take multiple steps, checking the gradient, before deciding to end the training for that epoch?

Hi @misham,

There are 7 steps in SGD

  1. Initialize the weights.
  2. Make a prediction
  3. Calculate the loss
  4. Calculate the gradient, which measures for each weight, how changing that weight would change the loss
  5. Step (that is, change) all the weights based on that calculation.
  6. Go back to the step 2, and repeat the process.
  7. Iterate until you decide to stop the training process (for instance, because the model is good enough or you donā€™t want to wait any longer).

PyTorch provides us with a handy way to calculate the gradient.

  1. We ā€œtagā€ the initial randomised weights using requires_grad_(). Check out the init_params function in the lesson.

  2. Make the prediction. This is applying our inputs (xb mini batch from the training dataloader) to the model.

  3. Calculate the loss using our loss function -> we define this as mnist_loss in the lesson

  4. Calculate the gradients with the loss.backward() method

  5. We then step the weights by the loss * learning rate

  6. We zero out the gradients because loss.backward actually adds the gradients of the loss to any gradients that are stored. If we donā€™t take zero our the gradients, the loss for the next epoch will be incorrect.

  7. We repeat training for however many epochs we want. Usually this is until we reach a satisfactory accuracy (using the validation dataloader).

Hope this helps!

Cheers,
Adi

Right, that all makes sense, thank you for clarifying.

I think Iā€™m not asking the question well, let me try it this way.

Since weā€™re zero-ing out the gradients in step 6, what is the purpose of adding the gradients back into the loss if we donā€™t zero them out?

I was reading through PyTorch docs but they donā€™t explain the reasoning behind that functionality or itā€™s applications. I spent some time googling around but didnā€™t find any discussions around that either - probably not googling for the right terms though.

Thanks again for answering my questions

Misha

Hey @misham,

Glad to help. :slight_smile:

I donā€™t follow what you mean by

Since weā€™re zero-ing out the gradients in step 6, what is the purpose of adding the gradients back into the loss if we donā€™t zero them out?

We arenā€™t adding the gradients back into the loss as far as I understand. For every epoch we

  1. Calculate the loss
  2. Calculate the gradients
  3. Step the weights
  4. Zero the gradients
  5. Train another epoch

Can you help me understand your question better?

Cheers,
Adi

Hey Adi, I tried again today but canā€™t run the notebook without errors. I installed the older versions of the libraries but I canā€™t run the training loop:


Does it work for you? Running without errors?

I did find an interesting error though. Check your dataset - all the targets are zero! The error is in the create_xy function, count=0 should not be in the for loop but outside.
That explains why the accuracy was always 1.00 - the model correctly predicted everything being a zero, because thatā€™s what all images were labeled as :smiley:

1 Like

Sorry, I didnā€™t make that clear at all. Iā€™m asking about loss.backwards() specifically. That is, why does <object>.backwards() support adding gradients back in?

Nice catch @johannesstutz! I updated to code to make sure the labels are correct. :sweat_smile:

Tested it by indexing into some random values and it appears to be working ok. Thatā€™s the good news.

When I ran the rest of the notebook, the accuracy was very low and the training loss did not appear to improve. I think I need to rework some parts of the code and debug.

Thanks for taking the time!

1 Like

Hey @misham,

Your question is unfortunately beyond my current understanding. Perhaps some else can help us!

Iā€™m following 04_mnist_basics.ipynb and wondering what would accuracy function be if we wanted to classify all digits rather than 3s and 7s?

correct = (preds>0.5) == yb

is used when we want to classify 2 classes.

Hey @misham, I found this StackOverflow answer regarding the default PyTorch behavior of accumulating gradients. Apparently, this behavior is useful for training other architectures, like RNNs. When you donā€™t need the functionality you just have to manually zero out the gradients.
I have just started learning about RNNs so I canā€™t give you further details at the moment, maybe thatā€™s just a thing you have to accept as a default :slight_smile:

1 Like

Hey @jellybrain, basically, when you have more than two classes your prediction is not a single number, but every class has its own prediction. In the MNIST example, the model assigns a probability to every digit. The accuracy is then calculated just by picking the class with the highest probability and comparing it to the label. Chapter 5 goes into detail how a loss function for problems with more than 2 classes looks like!
(You can also check out the source code for accuracy, the default fastai accuracy function)

1 Like