Meet Mish: New Activation function, possible successor to ReLU?

Thanks a lot for quick response. Will that implementation work in production (if used in future) ?

I did tried your MishCuda implementation and it definitely solved my issue. Now I’m able to train my model with on average 47 seconds per epoch. Although, I did some sanity check on its speed of execution and observed something strange:

Please correct me if I’m doing anything wrong. You can check the colab notebook here

Can someone help me in reproducing the code for the paper - [https://arxiv.org/pdf/1711.04735.pdf…] by @SuryaGanguli,@sschoenholz
I’m interested in reproducing the Order Chaos Transition Graph for a Non-Linear Activation function as shown on the Figure 1 in the paper.

3 Likes

For production you may be better off with the JIT Autograd function implementation. The MishCuda implementation requires the CUDA SDK to install (as it compiles locally). So probably best to avoid that dependency. As they implement the same function there should of course be no issue switching between implementations.
Also as you found in your benchmarking the CPU performance is not at all optimised in the MishCuda implementation. This is partly because it doesn’t store any intermediates in order to minimise memory usage and achieve maximum performance (calculating any extra stored values will slow it down a fair bit on GPU). CPU is largely supported for ease of testing and completeness and I don’t really intend to optimise it (you’d be better to optimise a JITed Autograd function which should perform fairly well). So if deploying with CPU inference you are better with the Autograd version.

2 Likes

Mish and Ranger was recently used in a CVPR conference paper on 3D Human Pose Reconstruction. Links - Paper, Code, Project Page
Congratulations to @LessW2020 on this small achievement. I am soon going to upload an improved preprint for Mish on arXiv with better results and insights.

10 Likes

Isn’t Mish used in YOLOv4 though? I think that’s even more impressive since YOLOv4 will likely be used in research and industry for years to come.

Yes, it is the default activation in Yolov4.
Also Mish has been made much more numerically stable and faster here - https://github.com/opencv/opencv/pull/17540

3 Likes

Thanks @Diganta for the paper link and congrats on another Mish use win!

I meant to post earlier but as @ilovescience noted, Mish is the default activation for Yolo-v4 which I think is a great accomplishment for Mish and @Diganta

That’s also great to hear about an updated arXiv paper - can you post it out here when it’s live so we can read?

Great to see Mish continuing to evolve and improve. I grimace every time I see code with ReLU btw, which is still pretty common.

5 Likes

Thanks!
I will surely update here when I release the next updated version for which I got a lot of insightful comments on the writing part from @Redknight (I guess I’m very immature in writing academic papers). However, it should be up in the next month at the earliest.

4 Likes


This paper recently verified that Mish performs better than ReLU in Stereo Matching. I haven’t read the paper completely and I don’t have much experience in this domain, however, I thought it might be helpful for some folks here. Good to see, people trying out Mish in other tasks as well.

4 Likes

This papers also uses mish activation. Authors don’t say it but you find it in their github repo

3 Likes

Yes, I’m aware of this. I saw this on an issue on the @rwightman repository. There are a couple of papers which do use Mish but don’t mention it. But it’s good to see it being validated on varied tasks.

2 Likes

Faster and more accurate Mish in OpenCV - https://github.com/opencv/opencv/pull/17621

2 Likes

Mish also provides good adversarial robustness as verified by the authors of a recent Google paper discussing the capabilities of smooth activations to provide higher adversarial robustness.
Link to the tweet - https://twitter.com/cihangxie/status/1278053759197872129?s=19
Thanks to @morgan for getting this to my attention.

1 Like

That’s great! I was actually wondering why they didn’t show results for Mish in that paper… Nice to hear that the authors actually did have results for that…

1 Like

Seems that Mish gets the highest standard accuracy and pretty close to GELU in terms of adversarial robustness. I guess since Mish isn’t published at a conference, they chose not to put it’s results on the paper.

2 Likes

Folks here might be interested in this popular NN Playground environment which I forked to add both Sine and Mish activation functions.

4 Likes

This is awesome. Thanks!

1 Like

In my quick experiments, Mish felt very stable (converges with plenty of hyperparameters) and smooth (reasonable looking shapes) for classification. For regression on the other hand, it was quite fragile, yet smooth and slightly fuzzy around the borders if it worked.

This is surprisingly similar to my experience with Mish. Once it starts to converge, it usually reaches a good point. However, getting it to converge, can sometimes be difficult. Has anyone found practical tips for this?

Mish is quite sensitive to LR in some cases. I haven’t done a thorough hyperparameter tuning to find what’s best for Mish but it usually works in all the tasks I’ve used it in.

3 Likes