Error in matrix calculus for DL paper?

louis · June 10, 2020, 7:31pm

Hello,

I am going through the Jeremy’s paper about the algebra behind back propagation and I am struggling to get his point on the derivation of element-wise operations. I assume I am missing something, rather than the paper being confused, but I cannot put my hand on what.

So in the paragraph “Derivatives of vector element-wise binary operators”, once the jacobian (J) of the element wise operation of 2 vectors F and G, we look for a condition where the J is diagonal.
The argument comes as:

" We know that element-wise operations imply that fi is purely a function of wi and gi is purely a function of xi. For example, sums . Consequently, reduces to "

What puzzles me is that actually here the element-wise operation is made between F and G, not between W and X which are the parameters of F and G. So the element wise operation between F and G does not imply at all that fi depends only on xi. if Y = F O G then yi depends only on fi and gi ok. But that can not be repercuted to F or G’s paremeters.
Actually seems to be an abuse of notation since the element-wise operator involves two scalars. Those scalars can of course depend on the vectors W and X as long as we respect the fact the element wise operation between F and G involves only Fi with Gi.

Finally, the fact that we ask under which condition do we have a diagonal Jacobian matrix, and that “by definition of the elemnt_wise operation” we actually have directly this diagonal, means that actually there are no condition needed. So the logic is weird.

Has anyone understood what this is supposed to mean ?

echon · June 24, 2020, 4:48am

The wording is a bit muddled. It is expressed more clearly below, in the reference (section 8.2):

Given the constraint (element-wise diagonal condition) that fi(w) and gi(x) access at most wi and
xi, respectively, the Jacobian simplifies to a diagonal matrix

That is, if we assume that fi only depends on wi and gi only depends on xi for all i then
J is diagonal. It is an assumption, a condition on every fi and gi that is sufficient for J to be diagonal. As you mention, it is not necessary at all that this is the case in general and this assumption is absolutely independent of the fact that O is an element-wise operator.
If this “element-wise diagonal condition” does not hold, there’s no guarantee that J will be diagonal [in fact, it can’t be, the condition is also necessary], even if O itself is a binary element-wise operator.
Another way of thinking about this is that for the Jacobian of F O G = f(w) O g(x) to be diagonal, not only must O be a [binary] element-wise operator but f and g must also be element-wise operators (unitary in this case, rather than binary).
An example of a case in which the above condition holds is when fi simply returns wi and gi simply returns xi, which is in fact what most (all?) subsequent examples in the paper use. In that case F O G is simply equal to w O x.
Another example may be that fi(w) returns -wi and gi(x) returns xi/3, that is if F O G = -w O x/3, or if fi(w) = sqrt(wi) and gi(x) = exp(xi), F O G = sqrt(w) O exp(x)
One more example is when both functions simply return a constant in all cases [Notice that the condition is that they access at most wi/xi, they may not use any valuel], e.g. F O G = 3 O 1. In this case, in which both F and G are constant, J would be the zero matrix, which is diagonal. Any combination of the above cases would also work.

A case in which the condition would not hold is if, e.g., fi(w) = w7 + wi and/or gi(x) = x(i+3), or fi(w) = sum across w, etc. In those cases J is not diagonal.
Cases such as the latter are possible in principle, which is why fi(w), gi(x) are written like that. It is not an abuse of notation, it simply denotes the general case, in which f and/or g are not element-wise.