How to backpropagate for updating w1 and b1 that is layer 1 from layer 2 (w2, b2) and RELU activation backpropagation

I am looking for chapter 4 's complex network (2 layer 31 neuron model from scratch.)
Code explains better.

See the commented lines in code. Uncommenting them throws an error.
Does pytorch backward passes the gradient to upper layers as well?
I don’t think that params of w1 are changing. Upon logging w1 the params are always the same and w1.grad is [0.,0. …]

TIA <3


# one layer with 30 neurons.
# with relu activation function.
# then next layer with 1 neuron.
# then apply the sigmoid.
# the second dimension in the matrix denote the number of neurons.
w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,1))
b2 = init_params(1)
def complex_net(x): 
    res = x@w1 + b1
    res = res.max(tensor(0.0))
    res = res@w2 + b2
    return res

def calc_grad(x, y, model):
  pred = model(x)
  loss = mnist_loss(pred, y*1.0)
  loss.backward()

def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid()
    return torch.where(targets==1, 1-predictions, predictions).mean()

def train_a__complex_net_epoch(x, y, model, lr = 0.0001):
  
  calc_grad(x, y, model)
  w2.data -= lr * w2.grad
  b2.data -= lr * b2.grad
  # w2.backward()
  # b2.backward() 
  
  w1.data -= lr * w1.grad
  b1.data -= lr * b1.grad

  w2.grad.zero_()
  w1.grad.zero_()
  b2.grad.zero_()
  b1.grad.zero_()

Is the snippet you provided above your own implementation or an implementation taken from the chapter?

hi thanks for the response.
It is the modified version of the chapter4 last part open ended question. where a non linearlity was added.

In the chapter there is no scratch way given to train a model for more than 1 layer and more than one set of weights.

So far I cam across this piece of implementation but still whenever I am printing w1 in the logs I see that w1 is never getting updated. that is loss is only getting propagated towards w2 and not from w2 to w1.

# one layer with 30 neurons.
# with relu activation function.
# then next layer with 1 neuron.
# then apply the sigmoid.
# the second dimension in the matrix denote the number of neurons.

## Non linear net with more than one activation functions ans sigmoid in last layer.
## This is production used neural network in today's date.


w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,1))
b2 = init_params(1)
def complex_net(x):
    res = x@w1 + b1
    res = torch.relu(res) # this will not work and loss will not get propagated to next layer if torch activation functions are not used. since math is of torch thus use math functions also from torch library.
    res = res@w2 + b2
    return res

def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid()
    return torch.where(targets==1, 1-predictions, predictions).mean()

def calc_grad(x, y, model):
  pred = model(x)
  loss = mnist_loss(pred, y*1.0)
  loss.backward()


def batch_accuracy(x, y, model):
    pred = model(x)
    # loss function not equals to accuracy as lost dunction penalizes
    # 0.99 probability with a loss of 0.1 but in accuracy is probability is 0.99 then it clearly is a valid class and we need to compute this accuractely.
    acc1 = (pred > 0.5) == y
    return acc1.float().mean()


def train_a__complex_net_epoch(x, y, model, lr = 0.0001):

  calc_grad(x, y, model)
  w2.data -= lr * w2.grad.data
  b2.data -= lr * b2.grad.data
  # print("W2 DATA", w2.data, w2.grad.data)
  
  
  # w2.backward()
  # b2.backward()

  w1.data -= lr * w1.grad.data
  b1.data -= lr * b1.grad.data

  print("W1 DATA", w1.data, w1.grad.data)
  
  w2.grad.zero_()
  b2.grad.zero_()
  w1.grad.zero_()
  b1.grad.zero_()

for i in range(3):
  train_a__complex_net_epoch(train_x, train_y, complex_net, lr=0.1)
  print(batch_accuracy(valid_x, valid_y, complex_net), end = '\n')

Output is like.

W1 DATA tensor([[-1.8053, -0.8775,  0.6938,  ...,  0.9554,  0.2526, -1.1141],
        [ 1.6942,  1.7738, -0.6138,  ..., -1.8349, -0.7304,  0.0407],
        [-1.0632, -0.8974, -0.7251,  ..., -1.3899,  0.6839,  1.1826],
        ...,
        [ 1.7208,  0.9560,  0.2970,  ..., -1.0092, -0.8471, -2.4497],
        [-0.1379, -0.1619, -0.6100,  ...,  0.8664, -0.5725,  1.0455],
        [-0.7145, -0.8339,  0.4436,  ..., -1.8590,  0.4673,  1.4599]]) tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor(0.5505)
W1 DATA tensor([[-1.8053, -0.8775,  0.6938,  ...,  0.9554,  0.2526, -1.1141],
        [ 1.6942,  1.7738, -0.6138,  ..., -1.8349, -0.7304,  0.0407],
        [-1.0632, -0.8974, -0.7251,  ..., -1.3899,  0.6839,  1.1826],
        ...,
        [ 1.7208,  0.9560,  0.2970,  ..., -1.0092, -0.8471, -2.4497],
        [-0.1379, -0.1619, -0.6100,  ...,  0.8664, -0.5725,  1.0455],
        [-0.7145, -0.8339,  0.4436,  ..., -1.8590,  0.4673,  1.4599]]) tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor(0.5515)
W1 DATA tensor([[-1.8053, -0.8775,  0.6938,  ...,  0.9554,  0.2526, -1.1141],
        [ 1.6942,  1.7738, -0.6138,  ..., -1.8349, -0.7304,  0.0407],
        [-1.0632, -0.8974, -0.7251,  ..., -1.3899,  0.6839,  1.1826],
        ...,
        [ 1.7208,  0.9560,  0.2970,  ..., -1.0092, -0.8471, -2.4497],
        [-0.1379, -0.1619, -0.6100,  ...,  0.8664, -0.5725,  1.0455],
        [-0.7145, -0.8339,  0.4436,  ..., -1.8590,  0.4673,  1.4599]]) tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
tensor(0.5535)

Notice the zero gradients in w1.

Right, I see. Could you also please share the code in the calc_grad function in your original snippet?

1 Like

Updated the post.

Thanks!

One piece of code that I see missing from the original snippet is the with torch.no_grad(): line (it’s either torch.no_grad() or torch.no_grad_()). Anything under this line’s scope will be ignored when calculating the weights with the .backward() method.

So what could be the reason for the error is that the .backward() method is taking into account the code you use to update the weights.

So what you would want to do is the following.

def train_a__complex_net_epoch(x, y, model, lr = 0.0001):
  with torch.no_grad():
    calc_grad(x, y, model)
    w2.data -= lr * w2.grad
    b2.data -= lr * b2.grad
  
    w1.data -= lr * w1.grad
    b1.data -= lr * b1.grad

    w2.grad.zero_()
    w1.grad.zero_()
    b2.grad.zero_()
    b1.grad.zero_()

Let me know if this works.

You also don’t want to perform the .backward() method on the weights and biases, which I see you have commented out in the original snippet. This method needs the final result/outcome, which is the loss in this case, to be able to appropriately calculate the new weights.

1 Like

I think everything is fine.
most of the inputs are 0 and gradients of first layer weight = input thus values are 0 in most of the gradients and code is working thanks for analysis.