Residual Summation - Why Before relu?

In ResNet, the basic structure of a bottleneck block is conv → bn → relu → conv → bn → relu → conv → bn → residual summation → relu. Applying the residual summation before relu is weird to me. Isn’t the point of ResNet to let information go through a block unchanged? What is the rationale behind this?


In Identity Mappings in Deep Residual Networks, He et al. investigate the importance of residual summations in ResNet and prove that “clean” identity branches are imperative for success. Therefore, they suggest a series of pre-activation networks that abide by a batch normalization → ReLU → convolution pattern and, regarding your question, apply no operations after the residual summation. The result is a family of more performant models that outperform plain ResNets and can be trained at vastly greater depths. Additionally, state-of-the-art architectures like EfficientNet and ConvNeXt also do not perform activation functions after the residual summation, further demonstrating the merit of information flowing through each block unchanged.

1 Like