There was a recent paper that discussed the role of scale / multi-scale in image processing and they modified ResNet to make Res2Net and showed improved accuracy:
I have taken @fgfm 's implementation of it and modified the stem to be the same as our XResNet and added in Mish activation function.
I wanted to post it out in hopes others will test it out and figure out how to improve on it. It is currently doing amazingly well on training loss…but just so so on accuracy.
So something is still not quite there with it, but it holds promise.
Here’s some quick data from the paper:
I’m going to continue rewriting it to make it closer in spirit to how XResNet is written (very compact) but adding in multi-scale handling intrinsically seems to be a promising avenue for improving various vision architectures.