I’ve been playing around trying to beat the Imagewoof leaderboards and ran into something odd regarding the activations in ResNet. Figured I’d share what I have so far.
TL;DR: Using DoubleMish instead of Mish after the residual connection improves accuracy by 1%. Other tweaks to the activation functions improves accuracy by 2%. At least for 128px, 5 epoch, XResNet-18.
Basically I was looking at the default XResNet-18 architecture and found that the way activations are applied along the residual path is inconsistent in scaling blocks versus non-scaling blocks. In a non-scaling block the residual connection adds non-activated values to activated values. Whereas in a scaling block both are non-activated because the residual side passes through a pooling layer, conv, and batchnorm first.
So I was messing around trying to fix that and noticed that in general adding activations along the residual paths tended to improve accuracy. I tried something crazy, went back to default XResNet-18, but with DoubleMish instead of Mish after every residual connect (i.e. here: return self.act(self.convpath(x) + self.idpath(x))
). DoubleMish is just mish(mish(x))
.
Testing against Imagewoof, 128px, 5 epochs, 20 runs each, I saw a mean accuracy of 67.9% for default XResNet18 and 69.1% with the above tweak.
I’ve since rigged up a custom XResNet18 model where the activation functions can be specified in a variety of places. Running evolutionary search using that I’ve gotten mean accuracy up to 69.9%. The discovered configuration for that adds Mish after the stem, adds DoubleMish before the final AdaptiveAvgPool, adds DoubleMish before the convolutional path in each resblock, adds Mish before the pooling along idpath, adds Mish after the convolution in idpath, and uses DoubleMish instead of Mish after residual connections.
Bit of a laundry list of changes. I’m still experimenting to see which of those are important. The DoubleMish after residual connection alone is +1% so that’s at least significant. Using DoubleMish instead of Mish everywhere does not provide a benefit. So it’s only useful in key locations. I find that quite odd.
Some things to note:
I was never able to match the accuracies on the leaderboard, even when I attempted to replicate the same configurations. So I just run all my experiments referenced to the default Imagenette example code. XResNet18 is the simplest to mess with. All of this is at 128px, 5 epochs, 20 runs each. The evolutionary algorithm uses None, Mish, or DoubleMish in all the configurable activation locations. Ranger optimizer with fit_flat_cos. 64 batch size. 1e-2 lr for all runs.
Obviously this might be overfitting the hyperparameters to Imagewoof. I figure after all my experimentation is done I can try the resulting architecture against other datasets to see if the improvement is consistent.
Since I’m not varying LR it’s possible that plays a role in the differences, but Ranger tends to be forgiving. Again, I can always fine-tune LR once I have an architecture I want to validate.
Given the limited training it’s possible this doesn’t scale to higher epochs. I’ve only got a measly 970 to experiment on right now so I’m working with what I’ve got
Here’s a gist with the core bits of the code: https://gist.github.com/fpgaminer/57232ab085b8be1decb0906c4eb03356 I’ll try to release a more complete notebook once I’ve finished experimenting. Right now the experimental notebook is … a nightmare of cells.
Here’s a CSV of all the experimental runs thus far: https://gist.github.com/fpgaminer/04be013f894997f92bb33a89bc39fc76
That’s all I’ve got for now. Just really strange that more aggressive activation specifically along the residual path leads to improved accuracy. Though Mish itself is quite surprising. It’s so similar to Swish and yet whatever subtle difference is there is significant. So maybe it’s not so odd that DoubleMish is useful in certain places.