Residual Summation - Why Before relu?

Hello,

In Identity Mappings in Deep Residual Networks, He et al. investigate the importance of residual summations in ResNet and prove that “clean” identity branches are imperative for success. Therefore, they suggest a series of pre-activation networks that abide by a batch normalization → ReLU → convolution pattern and, regarding your question, apply no operations after the residual summation. The result is a family of more performant models that outperform plain ResNets and can be trained at vastly greater depths. Additionally, state-of-the-art architectures like EfficientNet and ConvNeXt also do not perform activation functions after the residual summation, further demonstrating the merit of information flowing through each block unchanged.

3 Likes