Understanding Michael Nielsen's backpropagation code (Chapter 2)

mcintyre1994 · November 5, 2017, 8:25pm

I’m trying to understand/run the code in Michael Neilsen’s Neural Networks and Deep Learning chapter 2, on backpropagation: http://neuralnetworksanddeeplearning.com/chap2.html#the_code_for_backpropagation

At the start of the backward pass, it has:

delta = self.cost_derivative(activations[-1], y) * \
    sigmoid_prime(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())

The forward pass creates the activations list, where activations[i] contains a vector of the activations of the neurons in layer i. So activations[-1] is the last layer. y is the desired output.

cost_derivative is defined as:

def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x /
\partial a for the output activations."""
return (output_activations-y)

So that first line outputs a vector with the same shape as our output layer. So my question is how is that np.dot on the 4th line supposed to work? My understanding is that activations[-2] is a vector of the activations of the neurons in the 2nd-to-last layer, which can have any number of neurons, so I’m not sure how we can dot product it (or its transpose) with the delta, which has the shape of the output layer.

I ran the code (https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network.py) with some added debug lines to try to understand this, and it doesn’t seem to work:

>>> from network import *; net = Network([2,1,2])
>>> net.backprop([1,2], [3,4])

Activations[0]
[1, 2]

Activations[1]
[[ 0.33579893]]

Activations[2]
[[ 0.37944698]
 [ 0.45005939]]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<snip>/neural-networks-and-deep-learning/src/network.py", line 117, in backprop
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
ValueError: shapes (2,2) and (1,1) not aligned: 2 (dim 1) != 1 (dim 0)

activations looks exactly as I’d expect - 2 activations, then 1, then 2. The failure is on the line I’m unclear about, and fails as I’d expect. But, presumably the code in this book is tested (the book is excellent) and I must be doing something wrong. I was writing an independent implementation and hit the same issue, so I was expecting to be able to take this code apart to figure it out - but I can’t figure out how this is supposed to work, or why it works for the author.

I’d appreciate any insight on what I’m missing here. Thanks!

hetelek · November 6, 2017, 12:03am

Micheal Nielsen’s code is expecting a column vector. Try passing it a column vector and it should work.

from network import *; net = Network([2,1,2])

x = np.array([[1], [2]])
y = np.array([[3], [4]])
net.backprop(x, y)

yashk · March 30, 2020, 7:04am

I have a question regarding the michael nielsen’s code written in chapter 1. I have read many people say that they have been getting less accuracy with tensorflow as compared to michael nielsen’s code.
And I have a doubt if this is right, and if it is right then what might be the reason behind it?

Thank you in advance!!