Preview release: "Matrix Calculus for Deep Learning"

It is quite funny because I read it not too long ago and it seemed fine to me… I read your comment now and yeah, this is an obvious mistake…

But! I think that here we are just looking at the vector sum reduction and y no longer equals sum(zx). It is just to demonstrate the principle of what happens to the scalar value.

If y = sum(x<vec> * z<scalar>) then the partial derivative with regards to z is going to be mhmm … nx

But as here the y = sum(x<vec> + z<scalar>) we just get n.

Thanks a ton @parrt and @jeremy! This is a great resource, all the moving parts in a single page! I spent the weekend poring over this document and I thought I’d add my two cents here. Overall I found the document very very useful. It took a couple of iterations to understand how the Jacobian, partials are derived for different kinds of functions. The effort does pay off well in the later part. The application of these rules to the activation, loss functions of a neuron are straightforward.

  • Paying attention to notation is quite important, especially if one is used to glossing over material. It took a while to recognize values that are scalar (x in italics) and which were vectors (x in bold). f(x) is different from f(x), and from f(x). The gradient vector (of partials) is arranged horizontally, but otherwise vectors are pretty much arranged tall. The term w.x is a dot product. These are all well explained in the document, but I still tripped on a couple of them in the first iteration.

  • The foreshadowing of how the concepts will be used in the later part was very helpful. The two main assumptions were the element-wise diagonal property, and the fact that the number of scalar valued functions m equals the number of elements in x. These assumptions may seem a tad arbitrary, until you keep in mind how they relate to the activation (and loss) function of a neuron.

Anyway, I’ve tried to document my notes (in plain english) as I read this paper, I’ve shared it here in case it is helpful. Do let me know if there are any corrections.

3 Likes

I just wanted to report that I went through the paper :slight_smile: First read is done and it suprisingly already makes a lot of sense :slight_smile:

Now will continue to read it as probably only now I understand the notation somewhat better and it is easier to see where the parts fit when you have an idea of the whole :slight_smile:

Just wanted to say thank you very much again for putting this together and making it available to us!!! :beers:

2 Likes

@radek, Awesome, it does get better every time you go through it!

Revisiting the part1 lectures after this will ground many assumptions better. I used to wonder why the w’s for CNN layers were arranged as a tall vector in the math, when logically we thought of it as 256x256 etc.

yeah, the bold doesn’t always look bold enough to stand out but…

I’m still confused by notation. In the paper it says each fi function within f returns a scalar. So:

f = [f1(x), f2(x), … fm(x)], where x is [x1, x2, …xn]

So here it suggests that f1 is a function of all of the parameters in x… i.e. [x1, x2, …xn]

However, in @beecoder 's post it says that “the ith scalar function in f(x) is a function of (only) the ith term in vector x”.

This suggests that f1 is a function of only the first parameter (x1) in vector x.

What am I missing?

EDIT: I think I figured it out. For element-wise operations, it only makes sense if f1 only applies to x1, f2 to x2, and so on.

Usually, f = [f1(x), f2(x), … fm(x)], where x is [x1, x2, …xn]. Here, f1 is a function of [x1, x2, …xn].

However, if we want to do element-wise operations of f with g = [g1(w), g2(w), … gm(w)], we need functions that are only sensitive to a single element of their respective vectors. For element-wise, it doesn’t make any sense to consider f1([x1, x2, … xn]), since that would not be element-wise! Instead for element-wise, f1 is only defined for x1, and so on.

Therefore, the equation y1 = f1(x) O g1(w) is the same as y1 = f1(x1) O g1(w1).

Yes you did! :slight_smile:

@DavidBressler, seeing this now. Yes that’s right, the description of f was more general in the beginning but later the element-wise diagonal restriction was introducedd. This is good enough for modeling the neuron and also makes the math easier. This is how I’ve understood it…

Jeremy and Terence – thanks for this excellent and useful article, which provides a clear and logical foundation for the mathematics of vector differentiation.

1 Like

Hi @jeremy @parrt , first of all thanks for your article. I’m reading it and have a couple of question of some assumption you made on the 5th paragraph of “Derivatives of vector element-wise binary operators” section . You said that 0 o 0 = 0 no mater what o is. I imagine you implicitly are restricting on the kind of binary operation or maybe I’m missing something. For example o: x,y -> x+y+1 have 0 o 0 = 1.

Hello! If that is a zero then 0 is a constant and won’t change so no derivative.

thanks for your answer… probably I should re-read slowly to better understand the context … I’m reading “Regardless of the operator … 0 o 0 = 0 no mater what” as "for each o if o is a binary operator then 0 o 0 = 0 " and I do not see the connection with derivative for that …I’mean the true of the statement seam to me independent of the derivative

ok,I just looked at it. Basically we are saying if the partial derivatives go to zero then clearly any operator such as 0-0 will give you another 0, right? anything times 0 is zero… etc…

what about the opeartor o: x,y -> x+y+1?

The partial derivative of x is not zero, so that statement does not apply.

Ok now I see… what is confusing me is the statement “if those partial derivatives go to zero” seamed to me to refer to the derivative in respect to f and g rather then on the operator, since the last statement was referring to this two partial derivative… thanks for helping in the clarification.

Jeremy and Terence, thanks a lot for a wonderful, lucid article. It was my 1st time reading a paper and it was a good experience. There is one point that I would like to clarify:

In the section: “The gradient of the neural network loss function”, in the formula for C,
We have used activation(xi)=max(0, w.xi + b) and then eventually found dC/dw. As per notation, w is a vector. So here, is the neural network we are dealing with having only one layer?

{ I presume so because otherwise the calculations might become lengthy(and here in the paper, we are trying to generalize it) and also from the fact that xi would also be a function of some other terms(like the weights in previous layers) but we deal with it as a constant.}

1 Like

@jeremy @parrt Awesome paper! I’m now working through the 4th section and I’m a bit confused by the Jacobian matrices. Initially, it is said that Jacobian matrix is a matrix where each row is a gradient of a function. The gradient of a function is defined as a vector that contains partial derivatives of each variable in a function.

However, in section 4.2, in the Jacobian matrix J_w it looks like each row contains partial derivatives for \textbf {w} parameter only, where it also supposed to contain partial derivatives for \textbf {x} parameter, as the function that we are taking a gradient of is y_i = f_i(\textbf {w})\bigcirc g_i(\textbf {x}), so a gradient would have to contain partial derivatives of all (\textbf {w}, \textbf {x}) variables and not just \textbf {w}. Now, I think that’s what the subscript w in J_w means, however my question is: doesn’t it contradict the definition of a gradient and by extension Jacobian matrix too?

Finished the article as well – it has been super valuable. It’s a pity the article only explains the weights of a single neuron, e.g. a vector, and doesn’t cover the derivative multiple neuron where the weights is a matrix. I guess I should read the linked paper to understand how this can be calculated: https://atmos.washington.edu/~dennis/MatrixCalculus.pdf

The paper is just great and starts exactly where I needed it to start.
This video may be helpful to people in combination to the paper:

Now what I didnt get is why g(x) pops up here:


Its the first time a second function in addition to f(x) is used in the paper. Why do we need g(x) and why is it added at this particular point?