Meet Mish: New Activation function, possible successor to ReLU?

kshitijpatil09 · January 23, 2020, 6:31pm

Thanks a lot for quick response. Will that implementation work in production (if used in future) ?

kshitijpatil09 · January 23, 2020, 8:13pm

I did tried your MishCuda implementation and it definitely solved my issue. Now I’m able to train my model with on average 47 seconds per epoch. Although, I did some sanity check on its speed of execution and observed something strange:

Please correct me if I’m doing anything wrong. You can check the colab notebook here

Diganta · January 24, 2020, 5:04am

Can someone help me in reproducing the code for the paper - [https://arxiv.org/pdf/1711.04735.pdf…] by @SuryaGanguli,@sschoenholz
I’m interested in reproducing the Order Chaos Transition Graph for a Non-Linear Activation function as shown on the Figure 1 in the paper.

TomB · January 24, 2020, 3:32pm

For production you may be better off with the JIT Autograd function implementation. The MishCuda implementation requires the CUDA SDK to install (as it compiles locally). So probably best to avoid that dependency. As they implement the same function there should of course be no issue switching between implementations.
Also as you found in your benchmarking the CPU performance is not at all optimised in the MishCuda implementation. This is partly because it doesn’t store any intermediates in order to minimise memory usage and achieve maximum performance (calculating any extra stored values will slow it down a fair bit on GPU). CPU is largely supported for ease of testing and completeness and I don’t really intend to optimise it (you’d be better to optimise a JITed Autograd function which should perform fairly well). So if deploying with CPU inference you are better with the Autograd version.

Diganta · June 11, 2020, 1:09am

Mish and Ranger was recently used in a CVPR conference paper on 3D Human Pose Reconstruction. Links - Paper, Code, Project Page
Congratulations to @LessW2020 on this small achievement. I am soon going to upload an improved preprint for Mish on arXiv with better results and insights.

ilovescience · June 15, 2020, 3:08am

Isn’t Mish used in YOLOv4 though? I think that’s even more impressive since YOLOv4 will likely be used in research and industry for years to come.

Diganta · June 15, 2020, 5:22am

Yes, it is the default activation in Yolov4.
Also Mish has been made much more numerically stable and faster here - https://github.com/opencv/opencv/pull/17540

LessW2020 · June 15, 2020, 5:59pm

Thanks @Diganta for the paper link and congrats on another Mish use win!

I meant to post earlier but as @ilovescience noted, Mish is the default activation for Yolo-v4 which I think is a great accomplishment for Mish and @Diganta

That’s also great to hear about an updated arXiv paper - can you post it out here when it’s live so we can read?

Great to see Mish continuing to evolve and improve. I grimace every time I see code with ReLU btw, which is still pretty common.

Diganta · June 15, 2020, 6:54pm

Thanks!
I will surely update here when I release the next updated version for which I got a lot of insightful comments on the writing part from @Redknight (I guess I’m very immature in writing academic papers). However, it should be up in the next month at the earliest.

Diganta · June 24, 2020, 7:42pm

This paper recently verified that Mish performs better than ReLU in Stereo Matching. I haven’t read the paper completely and I don’t have much experience in this domain, however, I thought it might be helpful for some folks here. Good to see, people trying out Mish in other tasks as well.

vferrer · June 28, 2020, 2:29pm

This papers also uses mish activation. Authors don’t say it but you find it in their github repo

Diganta · June 28, 2020, 3:37pm

Yes, I’m aware of this. I saw this on an issue on the @rwightman repository. There are a couple of papers which do use Mish but don’t mention it. But it’s good to see it being validated on varied tasks.

Diganta · July 2, 2020, 3:16pm

Faster and more accurate Mish in OpenCV - https://github.com/opencv/opencv/pull/17621

Diganta · July 2, 2020, 11:21pm

Mish also provides good adversarial robustness as verified by the authors of a recent Google paper discussing the capabilities of smooth activations to provide higher adversarial robustness.
Link to the tweet - https://twitter.com/cihangxie/status/1278053759197872129?s=19
Thanks to @morgan for getting this to my attention.

ilovescience · July 3, 2020, 12:46am

That’s great! I was actually wondering why they didn’t show results for Mish in that paper… Nice to hear that the authors actually did have results for that…

Diganta · July 3, 2020, 1:01am

Seems that Mish gets the highest standard accuracy and pretty close to GELU in terms of adversarial robustness. I guess since Mish isn’t published at a conference, they chose not to put it’s results on the paper.

david_c · July 6, 2020, 12:21pm

Folks here might be interested in this popular NN Playground environment which I forked to add both Sine and Mish activation functions.

Diganta · July 6, 2020, 6:54pm

This is awesome. Thanks!

DeepBlender · July 6, 2020, 7:58pm

In my quick experiments, Mish felt very stable (converges with plenty of hyperparameters) and smooth (reasonable looking shapes) for classification. For regression on the other hand, it was quite fragile, yet smooth and slightly fuzzy around the borders if it worked.

This is surprisingly similar to my experience with Mish. Once it starts to converge, it usually reaches a good point. However, getting it to converge, can sometimes be difficult. Has anyone found practical tips for this?

Diganta · July 8, 2020, 12:42pm

Mish is quite sensitive to LR in some cases. I haven’t done a thorough hyperparameter tuning to find what’s best for Mish but it usually works in all the tasks I’ve used it in.