I understand that we need non-linear functions to represent a wider range of functions as compositions of linear functions results in a linear function. Thus, we have activation functions such as ReLU in-between the matrix multiplications. My question is do we need these activations for each and every layer? If so why?

For example, what would happen if we have ReLU every 2 layers in a 10-layer network instead of every layer? It would still be non-linear. But I would think that it won’t have the same representation power as having a ReLU in every layer. Consequently, wouldn’t this serve as some kind of regularizer by reducing overfitting (as the representation power is reduced)?

I think the layer without the RELU would be somewhat redundant because linear combination of a linear combination would still remain a linear combination, so the NN would treat it the two layers as one as follows:

layer 1: W1X1+b1

layer 2: W2X2+b2

if we pass layer 1 into layer 2 without RELU, it would be

W2(W1X1+b1)+b2

=W2W1X1 + (W2b1+b2)

=W3X1 + Constant (if we let W1W2 be W3)

So the NN would just learn W3 and constant, treating it as one layer.

But don’t ask me why such similar things works for batchnorm

Hi, understanding why non-linear functions are so important is a key step to use neural nets. Please, check this out, tensorflow playground will help you to answer your questions.