In the CNN, a larger number of activation function layers can easily lead to gradient explosion/disappearance?

(Shirui Zhang) #1

When updating weights, we need to get the value of Loss partial guide to the convolution kernel [公式].
The formula is like this:
\frac {\partial L}{\partial k_{pq}^{(l)}} = \sum i \sum j (\frac{\partial L}{\partial x{ij}^{(l)}}f\prime(u{ij}^{(l)})x_{i+p-1, j+q-1}^{(i-1)})
My question is that, as there is the derivative of the activation function [公式],
so if there are too many activation function layers in a model, the index of the activation function derivative in the expression will increase, especially the node in front of the model. (In fact, the increase in index is reflected in x_{i+p-1, j+q-1}^{(i-1)}.)
Eventually it is easy to cause gradient explosion/disappear.
(Assumed activation function derivative [公式])
Is my understanding correct?

0 Likes