Backpropagation Explained using English Words*

*Most words are in English.

Important

This guide assumes a basic understanding of derivatives and matrices.

image

You can view a more neatly formatted version of this post on my blog.

Backpropagation sounds and looks daunting. It doesn’t need to be. In fact, backpropagation is really just a fancy word for the chain rule. Implementing a backpropagation algorithm is simply implementing one big fat chain rule equation.

Let’s remind ourselves of the chain rule. The chain rule lets us figure out how much a given variable indirectly changes with respect to another variable. Take the example below.

\begin{align} y &= 3u \\ u &= 7 + x^2 \end{align}

We want to figure out how much y changes with each increment in x. The problem is that x doesn’t direcly change y. Rather, x changes u which in turn changes y.

The chain rule allows us to solve this problem. In this case, the chain rule tells us that we can figure out how much x indirecly changes y by multiplying the derivative of y with respect to u, and the derivative of u with respect to x.

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

Aaand I’ve just described backpropagation in a nutshell. That’s all there really is to it. The only difference is that in a neural network there are many more intermediate variables and functions, and that we want to find out how the weights indirectly change the loss.

Let’s see this tangibly in action.

We have the following neural network comprised of two layers: the first layer contains the affine function[1] together with the ReLU, while the second layer contains only the affine function. The loss, which is MSE (Mean Squared Error), will then be calculated from the output of the second layer.

image

Mathematically speaking, the first layer with a single sample x looks like this.

\text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1)

The second layer looks like this.

\text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2

And the loss function looks like this.

\frac{(y - (\text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2}{2}

Note

MSE in its most basic form looks like this.

\text{MSE} = \frac{(y - x)^2}{1}

If we have multiple data points, then it looks like this.

\text{MSE} = \frac{(y_1 - x_1)^2+(y_2 - x_2)^2+(y_3 - x_3)^2}{3}

However, when working with multiple samples, the mean squared error comes out looking like this, where N represents the total number of samples.

\frac{(\vec{\rm{y}}_1 - (\text{max}(0, \vec{\rm{x}}_1 \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2 + (\vec{\rm{y}}_2 - (\text{max}(0, \vec{\rm{x}}_2 \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2 + \cdots + (\vec{\rm{y}}_N - (\text{max}(0, \vec{\rm{x}}_N \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2}{N}

Or more simply…[2]

\frac{\sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2}{N}

…or even more simply.

\frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2

Our goal for the rest of this guide is to derive the gradients of w_1.

The equation above looks quite the mouthful though. One might even say scary. How would you even apply the chain rule here? How would you use the chain rule to derive the gradients of the weights and biases?

Let’s simplify things by introducing a bunch of intermediate variables. We’ll begin by substituting the innermost pieces of the equation, and then gradually make our way out.

\begin{align} u_1 &= \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 \\ u_2 &= \text{max}(0, u_1) \\ u_3 &= u_2 \cdot \vec{\rm{w}}_2 + b_2 \\ u_4 &= \vec{\rm{y}}_i - u_3 \\ u_5 &= u_4^2 \end{align}

The menacing equation above now gradually simplifies into the cute equation below.

\begin{align} \text{MSE} &= \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2 \\ &= \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, u_1) \cdot \vec{\rm{w}}_2 + b_2))^2 \\ &= \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (u_2 \cdot \vec{\rm{w}}_2 + b_2))^2 \\ &= \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - u_3)^2 \\ &= \frac{1}{N} \sum^N_{i=1} (u_4)^2 \\ &= \frac{1}{N} \sum^N_{i=1} u_5 \end{align}

Very cute, hey?

In this cuter version of the equation, it is visible that incrementing \vec{\rm{w}}_1 does not directly change the MSE. Rather, incrementing \vec{\rm{w}}_1 changes u_1, which changes u_2, which changes u_3, which changes u_4, which in turn changes u_5.

\frac{\partial}{\partial \vec{\rm{w}}_1} \text{MSE} = \frac{\partial}{\partial \vec{\rm{w}}_1} \frac{1}{N} \sum^N_{i=1} u_5 = \frac{1}{N} \sum^N_{i=1} \frac{\partial u^5}{\partial \vec{\rm{w}}_1} = \frac{1}{N} \sum^N_{i=1} \frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3} \cdot \frac{\partial u_3}{\partial u_2} \cdot \frac{\partial u_2}{\partial u_1} \cdot \frac{\partial u_1}{\partial \vec{\rm{w}}_1}

See? Just a big, fat, and simple chain rule problem.

Note

\partial is a curly “d” and can be read as “curly d”, or simply as “d”. \partial notation will be used below, due to a concept known as partial derivatives. We will not go into this concept here, however, this is a great brief rundown on partial derivatives.

Now we can tackle finding the gradients for w_1. To do so, let’s find the gradients of each intermediate variable.[3] [4]

\begin{align*} \text{gradient of } u_4 &= \frac{\partial u_5}{\partial u_4} &&= \frac{\partial}{\partial u_4} u_4^2 &&&= 2u_4 \\ \text{gradient of } u_3 &= \frac{\partial u_4}{\partial u_3} &&= \frac{\partial}{\partial u_3} \vec{\rm{y}}_i - u_3 &&&= -1 \\ \text{gradient of } u_2 &= \frac{\partial u_3}{\partial u_2} &&= \frac{\partial}{\partial u_2} u_2 \cdot \vec{\rm{w}}_2 + b_2 &&&= \vec{\rm{w}}^T_2 \\ \text{gradient of } u_1 &= \frac{\partial u_2}{\partial u_1} &&= \frac{\partial}{\partial u_1} \text{max}(0, u_1) &&&= \begin{cases} 0 & u_1 ≤ 0 \\ 1 & u_1 > 0 \end{cases} \\ \text{gradient of } \vec{\rm{w}}_1 &= \frac{\partial u_1}{\partial \vec{\rm{w}}_1} &&= \frac{\partial}{\partial w_1} \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 &&&= \vec{\rm{x}}^T_i \end{align*}

Now we multiply everything together.

\frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \frac{\partial u_5}{\partial \vec{\rm{w}}_1} = \frac{1}{N} \sum^N_{i=1} (2u_4) \cdot (-1) \cdot \left(\vec{\rm{w}}^T_2\right) \cdot \left(\begin{cases} 0 & u_1 ≤ 0 \\ 1 & u_1 > 0 \end{cases}\right) \cdot \left(\vec{\rm{x}}^T_i\right)

And it all eventually expands out to the following.

Expansion

\begin{align} \frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \frac{\partial u_5}{\partial \vec{\rm{w}}_1} &= \frac{1}{N}\sum^N_{i=1} (2u_4) \cdot (-1) \cdot (\vec{\rm{w}}^T_2) \cdot \begin{cases}0 & u_1 ≤ 0 \\ 1 & u_1 > 0\end{cases} \cdot \vec{\rm{x}}_i^T \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & u_1 ≤ 0 \\ -2u_4 \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & u_1 > 0\end{cases} \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & u_1 ≤ 0 \\ -2(\vec{\rm{y}}_i - u_3) \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & u_1 > 0\end{cases} \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & u_1 ≤ 0 \\ -2(\vec{\rm{y}}_i - (u_2 \cdot \vec{\rm{w}}_2 + b_2)) \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & u_1 > 0\end{cases} \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & u_1 ≤ 0 \\ -2(\vec{\rm{y}}_i - (\text{max}(0, u_1) \cdot \vec{\rm{w}}_2 + b_2)) \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & u_1 > 0\end{cases} \\ &= \frac{1}{N}\sum^N_{i=1} \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ -2(\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2)) \cdot \vec{\rm{w}}_2^T \cdot \vec{\rm{x}}_i^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0\end{cases} \end{align}
\frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{1}{N} \sum^N_{i=1} -2(\vec{\rm{y}}_i - \text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2) \cdot \vec{\rm{w}}^T_2 \cdot\vec{\rm{x}}_i^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases}

We can further simplify by taking -1 and 2 common.

\frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{2}{N} \sum^N_{i=1} (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i) \cdot \vec{\rm{w}}^T_2 \cdot \vec{\rm{x}}_i^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases}

We can simplify even further, by letting e_i = \text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i. The e stands for “error”.

\frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_1} = \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{2}{N} \sum^N_{i=1} e_i \cdot \vec{\rm{w}}^T_2 \cdot \vec{\rm{x}}_i^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases}

And there you go! We’ve derived the formula that will allow us to calculate the gradients of \vec{\rm{w}}_1.


When implementing backpropagation in a program, it is often better to implement the entire equation in pieces, as opposed to a single line of code, through storing the result of each intermediate gradient. These intermediate gradients can be reused to calculate the gradients of another variable, such as the bias b_1.

Instead of implementing the following in a single line of code.

\frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3} \cdot \frac{\partial u_3}{\partial u_2} \cdot \frac{\partial u_2}{\partial u_1} \cdot \frac{\partial u_1}{\partial \vec{w}_1}

We can instead first calculate the gradients of u_4.

u_{4_g} = \frac{\partial u_5}{\partial u_4}

Then calculate the gradients of u_3 and multiply it with it with the gradients of u_4.

u_{3_g} = u_{4_g} \cdot \frac{\partial u_4}{\partial u_3} = \left(\frac{\partial u_5}{\partial u_4}\right) \cdot \frac{\partial u_4}{\partial u_3}

Then multiply the product above with the gradients of u_2.

u_{2_g} = u_{3_g} \cdot \frac{\partial u_3}{\partial u_2} = \left(\frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3}\right) \cdot \frac{\partial u_3}{\partial u_2}

Then multiply the product above with the gradients of u_1.

u_{1_g} = u_{2_g} \cdot \frac{\partial u_2}{\partial u_1} = \left(\frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3} \cdot \frac{\partial u_3}{\partial u_2}\right) \cdot \frac{\partial u_2}{\partial u_1}

And finally multiply the product above with the gradients of \vec{\rm{w}}_1

\vec{\rm{w}}_{1_g} = u_{1_g} \cdot \frac{\partial u_1}{\partial \vec{w}_1} = \left(\frac{\partial u_5}{\partial u_4} \cdot \frac{\partial u_4}{\partial u_3} \cdot \frac{\partial u_3}{\partial u_2} \cdot \frac{\partial u_2}{\partial u_1}\right) \cdot \frac{\partial u_1}{\partial \vec{w}_1}

Let’s see this using Python instead.

The following is our neural network.

l1 = relu(lin(trn_x, w1, b1))
l2 = lin(l1, w2, b2)
loss = mse(l2, trn_y)

Line 1
l1 is the first layer, \text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1).

Line 2
l2 is the second layer, \text{max}(0, x \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 = l_1 \cdot \vec{\rm{w}}_2 + b_2.

Line 3
loss is the MSE, \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2))^2 = \frac{1}{N} \sum^N_{i=1} (\vec{\rm{y}}_i - l_2)^2.

Note

Substitutions

\begin{align} u_1 &= \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 \\ u_2 &= \text{max}(0, u_1) \\ u_3 &= u_2 \cdot \vec{\rm{w}}_2 + b_2 \\ u_4 &= \vec{\rm{y}}_i - u_3 \\ u_5 &= u_4^2 \end{align}

Derivatives of the Substitutions

\begin{alignat}{3} \text{gradient of } u_4 &= \frac{\partial u_5}{\partial u_4} &&= \frac{\partial}{\partial u_4} u_4^2 &&&= 2u_4 \\ \text{gradient of } u_3 &= \frac{\partial u_4}{\partial u_3} &&= \frac{\partial}{\partial u_3} \vec{\rm{y}}_i - u_3 &&&= -1 \\ \text{gradient of } u_2 &= \frac{\partial u_3}{\partial u_2} &&= \frac{\partial}{\partial u_2} u_2 \cdot \vec{\rm{w}}_2 + b_2 &&&= \vec{\rm{w}}^T_2 \\ \text{gradient of } u_1 &= \frac{\partial \vec{\rm{u}}_2}{\partial \vec{\rm{u}}_1} &&= \frac{\partial}{\partial u_1} \text{max}(0, u_1) &&&= \begin{cases} 0 & \vec{\rm{u}}_1 ≤ 0 \\ 1 & \vec{\rm{u}}_1 > 0 \end{cases} \\ \text{gradient of } \vec{\rm{w}}_1 &= \frac{\partial u_1}{\partial \vec{\rm{w}}_1} &&= \frac{\partial}{\partial w_1} \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 &&&= \vec{\rm{x}}^T_i \end{alignat}

First we need to calculate the gradients of u_4.

diff = (trn_y - l2)
loss.g = (2/trn_x.shape[0]) * diff

Line 2
trn_x.shape[0], in this case, returns the total number of samples.

Next are the gradients of u_3

diff.g = loss.g * -1

Then the gradients of u_2

l2.g = diff.g @ w2.T

Then the gradients of u_1

l1.g = l2.g * (l1 > 0).float()

And finally the gradients of \vec{\rm{w}}_1.

w1.g = (l1.g * trn_x).sum()

The equation for the gradient of b_1 is almost the same as the equation for the gradients of w_1, save for the last line where we do not have to matrix multiply with \vec{\rm{x}}_i. Therefore, we can reuse all previous gradient calculations to find the gradient of b_1.

b1.g = (l1.g * 1).sum()

Important

When multiplying various tensors together, make sure their shapes are compatible. Shape manipulations have been omitted above for simplicity.

Conclusion

And that’s all there really is to backpropagation; think of it a one big chain rule problem.

To make sure you’ve got it hammered down, get out a pen and paper and derivate the equations that would compute the gradients of \vec{\rm{x}}_i, b_1, \vec{\rm{w}}_2, and u_2 respectively with respect to the MSE.

And if you really want to hammer down your understanding on what’s happening, then I highly recommend reading The Matrix Calculus You Need For Deep Learning. I’ve also compiled backpropagation practice questions from this paper!

Answers

\begin{align} \frac{\partial \text{MSE}}{\partial b_1} &= \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{2}{N} \sum^N_{i=1} (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i) \cdot \vec{\rm{w}}_2^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases} \\ \frac{\partial \text{MSE}}{\partial \vec{\rm{x}}_i} &= \begin{cases} 0 & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 ≤ 0 \\ \frac{2}{N} \sum^N_{i=1} (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i) \cdot \vec{\rm{w}}^T_2 \cdot \vec{\rm{w}}_1^T & \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1 > 0 \end{cases} \\ \frac{\partial \text{MSE}}{\partial \vec{\rm{w}}_2} &= \frac{2}{N} \sum^N_{i=1} (\text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i) \cdot \text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \\ \frac{\partial \text{MSE}}{\partial b_2} &= \frac{2}{N} \sum^N_{i=1} \text{max}(0, \vec{\rm{x}}_i \cdot \vec{\rm{w}}_1 + b_1) \cdot \vec{\rm{w}}_2 + b_2 - \vec{\rm{y}}_i \end{align}

If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do let me know!


  1. Affine function is a fancy name for the linear function ↩︎

  2. \sum is known as the summation or sigma operator. If we have the equation \sum^4_{x=1}2x, it means sum the equation 2x for all values of x from 1 to 4. Find out more here. ↩︎

  3. If needed, get a refresher of the derivative rules here. ↩︎

  4. \begin{cases} \end{cases} denotes a piecewise function. The most simplest piecewise function returns one calculation if a condition is met, and another calculation if the condition is not met. It can be thought of as an if-else statement in programming. Find out more here. ↩︎

4 Likes

Amazing blog post. Thank you so much!!!
Just one question please, I don’t understand why we use sum() to compute w1.g and b1.g and not while calculating other gradients?

1 Like

Thanks so much for writing and sharing such an elegant explanation! I’ll reference this for sure when I get to the backpropagation lesson in Part 2 of the course.

One question: why do you transpose the matrix after taking the derivative (for example in gradient of u_2)? Is that just part of taking derivatives involving matrices?

1 Like

:smiley:

sum() is used when computing w1.g and b1.g since we want to matrix multiply.

Matrix multiplication involves taking the dot product of each row in the first matrix with each column in the second matrix (great guide here). The dot product itself involves multiplying each corresponding element in a row and column together, and summing all products.

Mathematically, the dot product looks like this, where \vec{\rm{a}} and \vec{\rm{b}} are vectors, N is the total number of elements in a vector, and i accesses the corresponding element in both vectors.

\sum^N_{i=1} \vec{\rm{a}}_i\vec{\rm{b}}_i

Programmatically, we can write that like this.
w1.g = (l1.g * trn_x).sum()

While PyTorch does have the built-in matrix multiplication operator @, which can be used like this…
w1.g = l1.g @ trn_x
…it does not give us the flexibility to matrix multiply tensors/matrices of different shapes and sum over a different dimension — abilities vital for broadcasting.

2 Likes

Glad to know! :smiley:

Yes, it’s part of taking derivatives involving matrix multiplication.

If we have the following equation involving scalars, y = xw + b, then \frac{dy}{dx} = w.

If we instead have the following equation using matrices instead, y = \vec{\rm{x}} \cdot \vec{\rm{w}} + b, the derivative will still be \vec{\rm{w}}, except transposed: \frac{dy}{d\vec{\rm{x}}} = \vec{\rm{w}}^T.

When you solve for the derivative of the matrix version of the equation by hand, you’ll find that the answer ends up being \vec{\rm{w}}, but transposed.

If you want to learn more about doing it by hand, I highly recommend going through The Matrix Calculus You Need For Deep Learning paper — it may seem overwhelming, but it really isn’t; just a little bit tedious, that’s all. If you want a reference on the various derivatives of matrices, view this section of the same paper.

That said, I think knowing that the derivative of a matrix multiplication results in the same answer, but transposed, is the only identity you really need to know.

1 Like

Wonderful, thanks! Will def go through that paper.

1 Like

Thank you for the explanation, my bad I thought that l1.g * trn_x was already a matrix multiplication.

l1.g * trn_x is elementwise multiplication. :slightly_smiling_face:

Now that you mention this, the dot product can more simply be described as taking the elementwise product between two vectors, and then summing the products. :thinking:

1 Like

I guess that would be true but only if the two matrices are of the same shape (to make the element-wise product possible).

Yes. The same condition holds for the dot product too so it works out.

1 Like