Meet Mish: New Activation function, possible successor to ReLU?

Diganta · August 9, 2020, 7:56am

@rwightman Do you have any ImageNet benchmarks for Mish which you’d be okay to share with me. I’d love to include them in the camera ready version of Mish paper accepted to BMVC and provide appropriate credits to you.
The deadline is 13th August. Let me know. Thanks!

rwightman · August 10, 2020, 10:55pm

@Diganta unfortunately not much of interest, in experiments I’ve done I often get good results with mish when focusing on good results in minimal epochs or with minimal tuning of other hparams. I’d say those aren’t too interesting as they mirror your findings.

When I’m pushing the long epoch, heavy aug training on ImageNet, etc I haven’t seen a consistent and statistically significant enough boost to be a clear improvement over relu or swish. That’s not saying better results aren’t possible, but it’s tough to commit the time/compute to dig in further given the time/cost of those training sessions when my augmentation/training recipe changes are yielding larger gains.

For instance, I recently put together a PyTorch impl of CSPResNeXt/CSPResNet/CSPDarkNet53 that I noticed you did some experiments on. Your results were a clear improvement over the darknet impl baselines, but using the default leaky-relu and different training hparams yielded results beyond either. I manged a top-1 of 80.04 CSPResNeXt50 (at 224x224, not the normal 256), 80.06 for CSPDarkNet53 at 256x256, and 79.5 for CSPResNet50 at 256x256.

I’d love to rerun those experiments with Mish but given how long they took for the first run, and the performance penalty it’s not at the top of the priority list.

Diganta · August 13, 2020, 5:57am

Ah no worries. Obviously there is room for a lot of tuning in terms of hyper-parameters to obtain the best settings to maximize gain.
Additionally, I needed some help with EvoNorm. Here’s my implementation. Could you please check what’s the potential issue in it since I’m having very high memory consumption and the B0 variant doesn’t perform at par with batch sizes = 1 or 2. Thanks!

Diganta · August 14, 2020, 12:13am

BMVC paper version for Mish is out now on ArXiv with more details, insights and results. Link - https://arxiv.org/abs/1908.08681v3

rwightman · August 14, 2020, 1:29am

I made an attempt at EvoNorm a few months back. It is indeed very slow and definitely a memory pig. Quite a few ops over lots of elements, very hard to compete with the cuDNN impl of batch norm. The heavy memory usage kind of defeats the point of it all as it requires moving the batch size down towards 1 very quickly just by switching to it. My impl doesn’t make use of many inplace ops to speed things up more when jit script is applied.

I did find that jit scripting the modules helped, I created a helper factory that wraps each instance with a self.norm = torch.jit.script(EvoNormSample2d(chs)) and it made it a little quicker. Even then, I didn’t have the patience to do any real training with it but did verify it was converging on a smaller dataset. It may be one of those ‘good in theory, bad in practice’ things…

My impl is here if you want to give it a try and see if it behaves any differently… https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/evo_norm.py

Funny thing, I just found a really silly bug in my S version after revisting just now. It now actually works, or converges quickly at least. A break helps you see things in a different light.

EDIT: Also, don’t try the jit script thing in PyTorch 1.5.x, use 1.4 or 1.6, 1.5 killed the jit op fusion which was the whole point…

Diganta · September 1, 2020, 2:21pm

Mish is currently 5th best model for MS-COCO test dev (CSP-p6 + Mish is the model to be precise) on paperswithcode benchmarks. Also, this model is SOTA on APS, APM and APL metric beating Efficient Det D7x.
Additionally, I did a major revamp of my repository so if anyone wants to give any feedback, you’re more than welcome to do so.
Thanks!

Diganta · September 25, 2020, 6:12am

Quick Update: Mish on CSP-p7 detector is currently the SOTA on object detection on MS-COCO (test-dev). Additional details on paperswithcode.

morgan · September 25, 2020, 9:59am

Niiiice! quick one, have you had time to explore much how Mish performs in transformer architectures?

Diganta · September 25, 2020, 4:37pm

Hi Morgan. No unfortunately, I haven’t myself played around with Transformers a lot so I haven’t tried Mish on the same yet. Would love to have someone to try it out though. Recently I have been seeing many nice results of using Mish in different tasks like 3d shape descriptors, volumetric occlusion prediction, scene flow, segmentation (credits to @muellerzr).
Currently, I am working on something more exciting and has showed much promise as compared to Mish. Let’s see how that goes.

morgan · September 25, 2020, 8:26pm

wuuuut! Very excited for this. Will let you know if I manage to do some proper testing with Mish in Transformers

Diganta · September 25, 2020, 9:10pm

Great, keep me posted.
I’ll post updates here as they come by.

Diganta · October 8, 2020, 10:04pm

This is our new work (not the one I mentioned in this thread earlier). Would love some feedback. @muellerzr @morgan @LessW2020 @ilovescience

Diganta · December 1, 2020, 1:11am

Small personal update: I have now joined Weights & Biases full time as a Machine Learning Engineer. There’s a lot of exciting features up ahead in the pipeline and I hope to make WandB integration in fastAI more seamless than it is now.
Mish really brought me into this position so I wanna thank all of you for supporting along!!
Cheers!

AmorfEvo · December 2, 2020, 10:30am

Good luck then ^^

Diganta · December 3, 2020, 1:00am

Thank you!!

Diganta · December 8, 2020, 11:53pm

Hey all.
Today (December 8, 5 pm PT) I’ll be speaking at the Weights & Biases Salon along with Maithra Raghu from Google Brain. My talk will be about Smooth Activations, Robustness and Catastrophic Forgetting where I will present a new hypothesis for lifelong learning. Maithra will talk about her paper “Do Wide and Deep Networks learn the same thing?”
RSVP on zoom
YouTube Live
Its my first time giving a talk with one of my inspirations in this field and I would really appreciate you all to hop in.
Thank you!

AmorfEvo · December 9, 2020, 1:49am

Sry I read too late to join but I watched now afterward ^^

Diganta · December 9, 2020, 3:04am

All the talks are recorded. Let me know if you have had any questions regarding my presentation once you go through it. Thanks!

AmorfEvo · December 9, 2020, 12:27pm

Well, I went through your whole presentation today at dawn 02:50 (UTC+01:00 time) when I wrote my prev reply then I went to sleep (you know I’m in the EU) - that’s why I only replay again now ^^.

I don’t really have questions becasue I know your work in great detail since you wrote and published your paper on arxiv + I also checked your github repo right when the code was released there + I also follow your posts here on the fast.ai forum
(So I’m also familiar with your Triplet Attention network for other reasons, ofc it’s offtopic - the video was about the Mish activation function, its properties and implications)

I can follow the explanations/conversations in the video, but I had to really concentrate what’s going on (I read more easily your paper in my own tempo back then)
I only have 1 suggestion:
You can add youtube subtitles for the video and then easier to follow for those who not already know your work or not even know your main points. (it’s not me, but I think about other people)

Btw, keep up the good work!

Diganta · December 9, 2020, 7:03pm

Hey, thanks for the appreciation and also your suggestion.
The talk was not primarily on Mish though, I wanted to connect the dots and present this new dilemma/ trilemma which can arise and can be potentially solved by smooth activations.

Yes I guess from next time onwards I’ll make subtitles available. Also just for everyone here, this is the link to my presentation slides.
This is the final video uploaded.

Also since GitHub made discussions now available to all public repos, please feel free to use the discussion forum on my repository to discuss on anything about Mish, Activations or Non-Linear Dynamics in general.

Discussion Forum Link