I’m trying to understand/run the code in Michael Neilsen’s Neural Networks and Deep Learning chapter 2, on backpropagation: http://neuralnetworksanddeeplearning.com/chap2.html#the_code_for_backpropagation

At the start of the backward pass, it has:

```
delta = self.cost_derivative(activations[-1], y) * \
sigmoid_prime(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
```

The forward pass creates the `activations`

list, where `activations[i]`

contains a vector of the activations of the neurons in layer i. So `activations[-1]`

is the last layer. y is the desired output.

`cost_derivative`

is defined as:

```
def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x /
\partial a for the output activations."""
return (output_activations-y)
```

So that first line outputs a vector with the same shape as our output layer. So my question is how is that `np.dot`

on the 4th line supposed to work? My understanding is that `activations[-2]`

is a vector of the activations of the neurons in the 2nd-to-last layer, which can have any number of neurons, so I’m not sure how we can dot product it (or its transpose) with the delta, which has the shape of the output layer.

I ran the code (https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network.py) with some added debug lines to try to understand this, and it doesn’t seem to work:

```
>>> from network import *; net = Network([2,1,2])
>>> net.backprop([1,2], [3,4])
Activations[0]
[1, 2]
Activations[1]
[[ 0.33579893]]
Activations[2]
[[ 0.37944698]
[ 0.45005939]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<snip>/neural-networks-and-deep-learning/src/network.py", line 117, in backprop
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
ValueError: shapes (2,2) and (1,1) not aligned: 2 (dim 1) != 1 (dim 0)
```

`activations`

looks exactly as I’d expect - 2 activations, then 1, then 2. The failure is on the line I’m unclear about, and fails as I’d expect. But, presumably the code in this book is tested (the book is excellent) and I must be doing something wrong. I was writing an independent implementation and hit the same issue, so I was expecting to be able to take this code apart to figure it out - but I can’t figure out how this is supposed to work, or why it works for the author.

I’d appreciate any insight on what I’m missing here. Thanks!