I understand that we need non-linear functions to represent a wider range of functions as compositions of linear functions results in a linear function. Thus, we have activation functions such as ReLU in-between the matrix multiplications. My question is do we need these activations for each and every layer? If so why?
For example, what would happen if we have ReLU every 2 layers in a 10-layer network instead of every layer? It would still be non-linear. But I would think that it won’t have the same representation power as having a ReLU in every layer. Consequently, wouldn’t this serve as some kind of regularizer by reducing overfitting (as the representation power is reduced)?