Meet Mish: New Activation function, possible successor to ReLU?

In my quick experiments, Mish felt very stable (converges with plenty of hyperparameters) and smooth (reasonable looking shapes) for classification. For regression on the other hand, it was quite fragile, yet smooth and slightly fuzzy around the borders if it worked.

This is surprisingly similar to my experience with Mish. Once it starts to converge, it usually reaches a good point. However, getting it to converge, can sometimes be difficult. Has anyone found practical tips for this?

Mish is quite sensitive to LR in some cases. I haven’t done a thorough hyperparameter tuning to find what’s best for Mish but it usually works in all the tasks I’ve used it in.

3 Likes

Hello there,

I was wondering whether there was a good way to initialize convolution and FC layers for this non-linearity? So far, I’ve been using it and initializing the layers just like it’s a ReLU, but perhaps someone has a better idea?

cc @Diganta

Cheers

4 Likes

There have been a couple of papers and projects which have explored doing a mixed activation training. One was where they trained the model for 20 epochs with ReLU and then 15 more with Mish. Additionally, there was another project which used different activation for different layers. This is very empirical and doesn’t have any universal justification. There’s a lot you can experiment with. Weight initialization, optimizers , better LR policy.

1 Like

Some updates - Faster approximation for mish and h-mish - https://github.com/YashasSamaga/ConvolutionBuildingBlocks

And a cool implementation of different tricks with MXResNet to push acc on CIFAR-10: https://github.com/iamVarunAnand/image_classification

6 Likes


My poster for Mish got selected for DLRLSS hosted by MILA, CIFAR, Vector and AMII. I will get to interact and present my work to 250 other fellows from over the world and also hopefully to the invited expert panel and speakers including Chelsea Finn, Andrew Saxe and many more. I will also hopefully feature our collaboration at Landskape with Javier Ideami on loss landscape visualizations for Mish.

9 Likes

With the massive help of @iyaja and Javier Ideami, we were finally able to obtain the loss landscapes of Mish, Swish and ReLU. Mish provides an overall lower loss, better accuracy, more well conditioned landscape which is highly smooth thus making optimization easier as compared to both Swish and ReLU.
Link to tweet
Javier also discussed this in-depth in his talk last night at the Synthetic Intelligence Forum, Montreal along many other great projects and studies in the loss landscape space. Youtube link
Here are the visualizations:


For the visualizations not labelled, its ReLU -> Mish -> Swish (from left to right)

12 Likes

Fantastic graphics - congrats @Diganta.
Even more data on Mish’s excellent performance (and why it works so well) is great to see!

3 Likes

Many more findings are on the way. :slight_smile:

4 Likes

I checked your repo yesterday - and ofc I admire your works ^^

2 Likes

Thank you :slight_smile:

1 Like


Some more views

3 Likes

Mish paper is accepted to BMVC 2020. Wouldn’t have been possible without the collaborative effort of everyone in this community. Big thanks to all here for the support, advises and most importantly trying Mish out.
Thanks!!

13 Likes

Woohoo, congrats and well done for the consistent work and perseverance!

2 Likes

Congrats! Great to hear that your work (which has already been cited 30 times! wow!) is finally being presented at a conference. It sounds like a great and prestigious conference as well! Looking forward to hearing about your experiences.

2 Likes

Thank you! Lot of more effort still to put into this. I’m very much interested in understanding the dynamics behind the performance of non-monotonic functions. But this news is definitely a step in the right direction for me.

1 Like

Thank you!
The true citation count is 45+. I maintain the list here - https://github.com/digantamisra98/Mish/blob/master/Citations.md
No idea why Google Scholar doesn’t reflect all of them. ArXiv shows 34 while Scholar shows 31 and ADS shows 17.
Also, yes BMVC is a very prestigious conference. 2 days ago I was reading a work by Google Brain on MixNet which @LessW2020 had earlier covered in an article. This work by Brain was also published at BMVC, last year. I was super anxious as a single author to make the cut when researchers from such reputed places would have submitted their papers. So far, I have come across one more paper accepted at BMVC 2020 from the Carnegie Mellon Vision Group. So I’m stoked that my work will be presented amongst other great works by such advanced and prestigious research groups.

11 Likes

Super congrats Diganta, and here is the youtube link of the video of the visualizations we put together with Diganta misra, Ajay uppili arasanipalai and Trikay nalamada :wink:

6 Likes

Wow! So much exciting news Diganta - congratulations :clap: :partying_face:

In doing experimenting with Mish on the NN Playground, I found convergence to be surprisingly difficult. I suppose it’s possible that these toy tasks may just not be good for evaluating mish, I suspect the problem is more likely with my implementation. Would someone here mind double checking my code here? This is 6 lines of javascript, although I also implemented softplus by hand just above the activation functions on line 115, which is 10 more lines.

1 Like

Just saw this on Twitter:

First of all, that’s awesome and more than deserved! Hope there are more to come!

I reported earlier (among one more person if I remember correctly) that I had issues with the convergence. I am using a very similar architecture to the mentioned paper. I am going to closely inspect it to figure out what they are doing differently compared to me to run some experiments and hopefully come back with some constructive inputs.

1 Like