Meet Mish: New Activation function, possible successor to ReLU?

Hey all.
Today (December 8, 5 pm PT) I’ll be speaking at the Weights & Biases Salon along with Maithra Raghu from Google Brain. My talk will be about Smooth Activations, Robustness and Catastrophic Forgetting where I will present a new hypothesis for lifelong learning. Maithra will talk about her paper “Do Wide and Deep Networks learn the same thing?”
RSVP on zoom
YouTube Live
Its my first time giving a talk with one of my inspirations in this field and I would really appreciate you all to hop in.
Thank you!


Sry I read too late to join :frowning: but I watched now afterward ^^

All the talks are recorded. Let me know if you have had any questions regarding my presentation once you go through it. Thanks!


Well, I went through your whole presentation today at dawn 02:50 (UTC+01:00 time) when I wrote my prev reply then I went to sleep (you know I’m in the EU) - that’s why I only replay again now ^^.

I don’t really have questions becasue I know your work in great detail since you wrote and published your paper on arxiv + I also checked your github repo right when the code was released there + I also follow your posts here on the forum :slight_smile:
(So I’m also familiar with your Triplet Attention network for other reasons, ofc it’s offtopic - the video was about the Mish activation function, its properties and implications)

I can follow the explanations/conversations in the video, but I had to really concentrate what’s going on :smiley: (I read more easily your paper in my own tempo back then)
I only have 1 suggestion:
You can add youtube subtitles for the video and then easier to follow for those who not already know your work or not even know your main points. (it’s not me, but I think about other people)

Btw, keep up the good work! :wink:

1 Like

Hey, thanks for the appreciation and also your suggestion.
The talk was not primarily on Mish though, I wanted to connect the dots and present this new dilemma/ trilemma which can arise and can be potentially solved by smooth activations.

Yes I guess from next time onwards I’ll make subtitles available. Also just for everyone here, this is the link to my presentation slides.
This is the final video uploaded.

Also since GitHub made discussions now available to all public repos, please feel free to use the discussion forum on my repository to discuss on anything about Mish, Activations or Non-Linear Dynamics in general.

Discussion Forum Link


It would be awesome if you all consider to take part in this. :slight_smile:

1 Like

I hit 1k stars on my repo. It’s insane how much the project has grown. Also hit 100 citations on Google scholar (still don’t know why it doesn’t show the actual count which is 121)
Wanna sincerely thank everyone here for all that they have done to support me throughout this project, I feel blessed and honoured.


Well done :smiley: :smiley: :clap: :clap:

1 Like

Thank you! :sweat_smile:

Hey guys,

Seems like Mish is gaining some interest in the community of getting it added to core PyTorch just like SiLU and Hard Swish have been added. Link of the issue -

Mish had been earlier added to different experimental branches on PyTorch by internal pytorch members.

It would be awesome if you believe it’s useful to get it added to PyTorch then leave a comment on that issue thread.

Thanks! :slight_smile:


20k views :dizzy_face:


Some exciting updates:
I will be soon releasing new benchmarks with Mish for Object Detection and Instance Segmentation models. It will be a whole suite based on MMDetection and powered by Weights & Biases.


First model in the series: Mask RCNN with a ResNet-50 + Mish. Links (includes log files and weights):
Weights & Biases Dashboard:

@muellerzr it said I can’t reply to this thread in consecutive six times and it would require for someone to reply to my last one before I can create a new reply. Is this expected? If so, is there a workaround this rather than editing old threads? Thanks!


Second model in the series: Faster RCNN with a ResNet-50 + Mish. A whopping 1.3% AP boost over vanilla Faster RCNN. Links (include log files and weights):
Weights & Biases Dashboard:
Per Epoch Performance Dashboard on WandB:

Some results:

@muellerzr works now


With the massive efforts of Javier Ideami, now, there is an interactive web visualizer of the loss landscapes of a ResNet with Mish, Swish and ReLU. Link.


In 2019, I started a small research group called as Landskape with the vision of fostering inter-disciplinary research into the “whys” and “hows” of deep neural networks. Today I’m so excited to launch our Twitter page for Landskape - We have and had researchers and students from UIUC, IIT-G, MILA, KAIST, HKUST, Imperial College with collaborators from Continual AI and Google with backgrounds in Math, Physics, Computer Science and even Design. As of now, we are working on projects in the domain of Super Resolution and Continual Learning. To stay updated with our research, consider following our Twitter channel or visit our page.


Heads up: Mish is now added to PyTorch and will be included in the 1.9 release. View the merged PR here.


I ran some benchmarks with the PyTorch native Mish implementation, other Mish implementations, and a few other activation functions. For the most part it’s quite good.

On a Tesla V100 native Mish was faster than native ReLU and for float16 was fastest and second fastest on forward and backward pass, respectively. On a Tesla P100 native Mish was faster or tied with all the other Mish implementations, including MishCuda, but lagged behind ReLU.

The only sore spot was CPU performance, where it was significantly slower than both a Torchscript version and raw PyTorch during the forward pass.


Interesting, the benchmarks link leads me to a 404. Can you share the correct link?

I copied an old link with the wrong date. Should be fixed now.

1 Like