Meet Mish: New Activation function, possible successor to ReLU?

MicPie · September 15, 2019, 2:39pm

Super interesting thread and discussion!

As far as I got everything correct from the previous posts, this was so far the best model setup:

Mish activation function
change the order to Conv-Mish-BN (this order seems to be already included into the ConvLayer class of fastai v2 dev.)
hyper parameters, optimizer, and training schedule like in the notebook

I tried my Squeeze-Excite-XResNet implementation with AdaptiveConcatPool2d with Mish based on the fastai XResNet and got the following results after 5 runs:
0.638, 0.624, 0.668, 0.700, 0.686 → average: 0.663
… so this was not really improving the SimpleSelfAttention approach from the notebook.

Then I combined the MXResNet from LessW2020 with the SE block to get the SEMXResNet ( ) and got the following results (+ Ranger + SimpleSelfAttention + Flatten Anneal):
0.748, 0.748, 0.746, 0.772, 0.718 → average: 0.746

And with BN after the activation:
0.728, 0.768, 0.774, 0.738, 0.752 → average: 0.752

And with AdaptiveConcatPool2d:
0.620, 0.628, 0.684, 0.714, 0.700 → average: 0.669

With the increased parameter count (e.g., SE block and/or double the parameters from the FC head input stage after the AdaptiveConcatPool2d) the 5 epochs are very likely not enough to really compare it to the models with fewer parameters, as it will need more time to train (like mentioned above).

I also have seen the thread from LessW2020 about Res2Net. - Did somebody already tried it in a similar fashion and got some (preliminary) results?