Lesson 4 - What happens inside nn.sequential?

Hi everyone! I’m on lesson 4 and I’ve been implementing everything from scratch following Jeremy’s suggestion in the “further research”.

I’d like to continue building the whole neural network from scratch (e.g. implement my own nn.sequential), but I’m having trouble wrapping my head around how the outputs/activations for intermediate layers are treated, and how to keep track of gradients across multiple layers?

For example, let’s say I want to create the following simple neural net:

  • Layer 1: 784 inputs, 30 outputs
  • ReLU
  • Layer 2: 30 inputs, 10 outputs (the labels)

Individually, each layer is quite simple:

  • Layer 1 is 30 parameters for each pixel, right? i.e. parameter set of weights (784, 30) & biases (30).
  • ReLU is straightforward enough as I just need to max(0, layer1),
  • Layer 2 is also similar to Layer 1, just with a different shape ie. weights (30,10), biases (10).

After Layer 2, I know I’ll need to use cross-entropy loss on the final activations.

But now comes the part I don’t understand: what happens before Layer 2, in each layer in our forward and backwards steps? What do I need to do with the 30 output parameters from layer 1 before rectifying them? Is the loss function only measured at the last layer? If so, how do I treat the 30 intermediate activations? Do I need to do anything?

Another issue I have is that the gradients for layer 1 get “lost” after I rectify them and pass them through to layer 2 making .backward() on the loss results not work properly–Layer 1’s parameters have undefined .grad, leading to the following error when trying to access the underlying .data attribute.

# how I am currently treating each layer: 
  def _forward(self, xb):
      self.preds = self.layer1.model(xb)
      self.preds = self._ReLU(self.preds)
      self.preds = self.layer2.model(xb)

When I try to optimize,

     42   def _optimize(self, *args): # my own optimizer based on lesson 4
---> 43     params = [self.layer1.weights.grad.data, 
     44               self.layer1.bias.grad.data,
     45               self.layer2.weights.grad.data,
     46   ...

AttributeError: 'NoneType' object has no attribute 'data'

What’s the correct way to do this?

Any help is appreciated! My goal is to implement a simple 2 layer net from scratch without the torch helper functions like nn.Sequential and nn.Linear. I’m almost there! If anyone can help me understand the above, I’d be extremely grateful!

@rek
That is great! Implementing your own nn.sequential is a really good way to understand Pytorch and its functionality. Its actually not quite difficult.
Are you running your _optimize function before passing some data to your model and running .backward() method on your loss function?
Each parameters grad will be None unless .backward is run at least once.
Will like to help you further if needed!
Cheers!

I’ve found out the error now is due to nan gradients. I tried using

from torch import autograd
autograd.set_detect_anomaly(True)

and it’s telling me Function 'LogBackward' returned nan values in its 0th output., meaning the LogBackward is returning nan.

I realized that the issue was that using torch.log and torch.softmax was causing issues, so I just switched to torch.log_softmax and it solved everything! Strange.