In the CNN, a larger number of activation function layers can easily lead to gradient explosion/disappearance?

Shirui · October 28, 2019, 11:59am

When updating weights, we need to get the value of Loss partial guide to the convolution kernel $\frac {\partial L} {\partial k_{pq}^{(l)}}$ .
The formula is like this:
$\frac {\partial L}{\partial k_{pq}^{(l)}} = \sum i \sum j (\frac{\partial L}{\partial x{ij}^{(l)}}f\prime(u{ij}^{(l)})x_{i+p-1, j+q-1}^{(i-1)})$
My question is that, as there is the derivative of the activation function $f\prime(u_{ij}^{(l)})$ ,
so if there are too many activation function layers in a model, the index of the activation function derivative in the expression will increase, especially the node in front of the model. (In fact, the increase in index is reflected in $x_{i+p-1, j+q-1}^{(i-1)}$ .)
Eventually it is easy to cause gradient explosion/disappear.
(Assumed activation function derivative $f \prime (u_{ij}^{(l)}) \ne 1$ )
Is my understanding correct?