Meet Mish: New Activation function, possible successor to ReLU?

Diganta · October 10, 2019, 5:55pm

That’s some great work though. I’m extremely occupied with the Tensorflow-addons implementation of Mish (the discussion you are a part of).
No worries, feel free to use Mish at your own discretion.
I will look forward to the results.
I’m working on the ImageNet results now.

TomB · October 10, 2019, 7:03pm

This was a fairly interesting test. Classification on a kaggle challenge involving low-level features (actually segmentation but was pre-training the encoder on classification and might use for prediction). Tested EfficientNet-B0 (the RWightman version as I liked it a little more than lukemelas’). Swish vs. Mish with and without the pre-trained weights. Accuracy threshold (defualt 0.5):

So pre-training helped a lot in spite of activation change, it basically matched Swish, but was slower to start.

Training loss (with mixup so flattens out more quickly than normal):

So, you can see here the slower start as well, and see that in fact pre-trained Mish was slower than the non-pre-trained to start but then quickly caught up.This also doesn’t show the first epoch as the first data point is end of it (must look at adding an initial eval for nicer logging).

That wasn’t with any care about that either. Using a differential learning rate but same as Swish (except a slightly lower LR for Mish based on graph (if repeating I’d skip that). I’d also look at trying to settle the network into Mish a little first. Something like an epoch all frozen except the BN (fastai won’t freeze those by default). Then maybe a higher differential rate. This was also with a default-inited classifier so quite impressed it didn’t go off the rails.

TomB · October 10, 2019, 7:30pm

Could be lack of optimisation of batch size and effective LR. From some initial tests (will fill in more, just what I managed with free GPU time), imagewoof accuracy across 5 runs off otherwise original script defaults (minus the distributed which was messing up on one card) so xresnet50/adam/one_cycle:

		mean	std
lr	bs
0.001	64	0.5752	0.028657
	128	0.5392	0.017297
	256	0.5040	0.013416
0.002	64	0.6148	0.009445

So, larger batch sizes hurt quite a bit as does lower LR. Not quite sure how distributed works (or what different options mean) but I think it accumulates losses across cards so a larger effective BS than the 64 per card.

More generally I don’t know Jeremy spent that much time getting good scores, think he was more filling out some baselines (think he said in the lecture they should be pretty easy to beat).

ilovescience · October 10, 2019, 7:51pm

Thanks for your benchmarks as well. This makes sense. However, it means Ranger is not as impressive of an optimizer as I thought. Were you able to try out Adam on Imagenette?

TomB · October 10, 2019, 8:13pm

Only tried imagenette in some initial tests, nothing saved. Can run some more tomorrow so will do some then.

Bjorn · October 11, 2019, 8:57pm

Hi, I changed the activation function to mish in the TPU tutorial that was released together with Pytorch 1.3.

This is my first try with any TPU code but I thought it might be intressting for you to see.
I could not tell if it is better than Relu based on this small tutorial but the nice thing is that Resnet 18 can be trained from scratch in a few minutes on a TPU.

All code from tutorial, only change is to the activation function:

Diganta · October 12, 2019, 4:24pm

Great work. Thanks for the interactive notebook. Everything seems fine.

ilovescience · October 14, 2019, 8:05pm

@Diganta Are you a member of kagglenoobs.slack.com? If not, I recommend you join. There was a discussion about Mish and some members are somewhat critical about Mish. I don’t know your work well enough so it would be better for you to respond to the criticisms and gather some feedback.

Diganta · October 14, 2019, 9:03pm

@ilovescience Thank you for redirecting me to the discussion. It’s indeed very demoralizing and I really would appreciate to take some time off after going through that. I understand there always is criticism but downright discarding more than 70 benchmarks (with all the notebooks present in the repository) just because I didn’t release one kernel on kaggle and tagging it as “Made up” is indeed extremely disturbing. I agree and respect him as he his one of the most highly praised person in the Kaggle community but downright disrespecting and discarding someone’s effort in that manner is something I won’t accept or respect. I think people have misinterpreted that I’m trying to promote my work while all I have done is asked the community to try it and give feedback and I don’t see any crime in that. This community has been amazing, every one here has been supportive and helpful, even the constructive criticism here has been extremely helpful but that thread was really heart-breaking and unforgiving. We really need to see how we use this community to uplift others not slam dunk their efforts down because I’m not researching for my own benefits here- Mish is an activation function that I expect others to try, I have no personal agenda or monetary advantages in the same and I’m sorry to see how well reputed people react to new research by a considerable new individual.

ilovescience · October 14, 2019, 9:09pm

I understand this is your work that you have put a lot of time into, but I think you shouldn’t take the criticism too seriously. Regarding Abhishek’s criticism, I think he just wasn’t aware of your repository. For some of the other criticisms, I think they are putting your paper to the level of scrutiny that one would have for a paper coming out of a university research lab with like 10 authors. However, I feel that is definitely not fair in this case.

Anyway, I understand you would get demoralize after reading that conversation but I hope that they will be able to have a constructive conversation with you regarding how you might be able to improve your work.

Diganta · October 14, 2019, 9:19pm

This happens due to the Bias present in research these days. If the same paper would have been published by Google Brain, I bet it would have been lowkey celebrated. I don’t understand the fact that there are less people who actually appreciate and more people who actually criticize and maybe that’s the reason why so many are even scared to put out the results (Just like me, Mish had been there since April, I made it public only in August).
Yes, it is demoralizing (Getting laughed at just because I have a logo is absolutely shambolic-ally inappropriate, it’s basically disrespecting someone’s personal opinion similar me to criticizing someone’s user name on a public forum which makes no sense at all).
It just wasn’t demoralizing, it was way beyond comprehension to be labelled as shady and “not trust-worthy” even before giving me a chance. I can understand that I didn’t share one kernel out of the 75 (which again is so not justified to be criticized upon), and if there was an impression that I’m trying to make up the results then they could have easily tried it out for themselves.

morgan · October 14, 2019, 9:22pm

I don’t have much to add except keep your chin up @Diganta! I’ve really appreciated all of the effort and hard work you have put in to benchmark and demonstrate the usefulness of Mish!!

Diganta · October 14, 2019, 9:24pm

@morgan Thank you!

Seb · October 14, 2019, 9:55pm

@Diganta
From what I’ve seen you’re honest, do all your work openly and are very receptive to feedback. People should focus on constructive criticism.

Diganta · October 14, 2019, 10:45pm

Thank you! I hope the community makes some amendments within itself and decreases the amount of toxicity it has considering sometimes the comments that are being passed is not well thought of in regards to how seriously it might affect an individual.

jeremy · October 15, 2019, 3:36am

@Diganta: I’m so sorry - I really don’t know what to say. Sadly, there are a lot of toxic people in the machine learning community. Toxic things, can, by definition, poison you. The only solutions are:

Try to stay away from them
If you come into contact with them, detoxify as best as you can.

In this case, that means, for instance, recognizing where the toxicity is coming from (generally it means people feel threatened, or it comes from hubris / arrogance / ignorance, etc), and remembering that the negative feedback is a tiny fraction of the positive feedback (e.g. in this case, I’ve not seen a single negative reaction to your work IIRC, but I’ve seen literally hundreds of likes, retweets, etc).

Please do take some time away if you need to. But then do come back and keep doing your great work!

jeremy · October 15, 2019, 3:43am

BTW, I get this kind of reaction to my work on a regular basis. I find it heart-breaking and energy-sapping. It often makes me feel like giving up. So far, I haven’t done so, because in the end, I know that for every toxic comment, there’s hundreds or thousands of positive reactions.

I don’t know why our brains so love to highlight the negativity they come across - but we have to try to fight back against our silly brains as best as we can when that happens!

ducha-aiki · October 15, 2019, 11:07am

Man, ignore them. You did absolutely amazing work and your attitude to the experiments is much more true scholarly than lots of the PhDs have.
Yes, MiSH is “made-up”. So what? The most cited computer vision paper ever - SIFT - was also made-up. It was also rejected 2 or 3 times before finally get published.

Now, regarding the criticism. Would the Mish paper be rejected from ML conference, does it look like “not a proper paper”? Probably.
But the main reason for this is easily fixable: appearance. You just need to take proper LaTeX template and grad some of good matplotlib code snippets and make the thing nice.
The lots of submissions from top universities, from other hand - has quite opposite situation: great-looking paper with vague or weak results. That is much harder to fix

Bjorn · October 15, 2019, 11:08am

Hi Diganta,

This is the first thread that I really follow and it is also the starting point for me to try to improve networks. Your activation function feels correct and that is also supported by the data. So what I am doing now is that I am trying to apply Resnets with Mish instead of Relu.

Since I have only read about Mish in this forum I just thought that Mish was a huge success everywhere. Sorry to hear about the toxicity. I have also run into it on MachineLearning community on Reddit.

What would be really good is if we could download pretrained Resnets with Mish instead of Relu for better preformance.

Diganta · October 15, 2019, 11:20am

Thank you so much for the support @jeremy. Means a lot. I also should apologize to have taken those criticism as personally as I did, but maybe because I have invested a lot in this project I was affected by how the work was regarded by someone as prominent as him in the community. Criticism and questioning is far over toxicity and nobody should shy away from criticism and questioning but judging a book by it’s cover is not. I will be more professional and less personal to any negative criticism from now on and will try my best to improve on from this. As you know, I’m just an undergrad taking turtle steps to reach somewhere I aspire to be, it’s a long and treacherous road in many ways but I’m up for the challenge.
I don’t know about the level of toxicity that exists but definitely we all can be more matured in how we question someone’s effort, nobody here is accrediting their work to be SOTA or flawless, questioning will only help to improve.
For instance, I got to know about the focal loss 1st derivative comparison from your twitter thread yesterday and I thank you for that.
Lastly, I apologize to everyone to have been less matured in handling criticism and shouldn’t have taken things personally. I personally don’t want to be a center of an escalated discussion which is not fruitful.