Meet Mish: New Activation function, possible successor to ReLU?

@LessW2020 I am trying to run running batch norm as according to this post: Running batch norm tweaks and when it runs update stats I get the following:

--> 24         s  = x    .sum(dims, keepdim=True)
     25         ss = (x*x).sum(dims, keepdim=True)
     26         c = self.count.new_tensor(x.numel()/nc)

IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)

Which I believe has to due with the fact of how tabular tensors are passed in as two separate entities (cont and cat vars). Where should I go about fixing this?

Super interesting thread and discussion! :smiley:

As far as I got everything correct from the previous posts, this was so far the best model setup:

  • Mish activation function
  • change the order to Conv-Mish-BN (this order seems to be already included into the ConvLayer class of fastai v2 dev.)
  • hyper parameters, optimizer, and training schedule like in the notebook

I tried my Squeeze-Excite-XResNet implementation with AdaptiveConcatPool2d with Mish based on the fastai XResNet and got the following results after 5 runs:
0.638, 0.624, 0.668, 0.700, 0.686 --> average: 0.663
… so this was not really improving the SimpleSelfAttention approach from the notebook.

Then I combined the MXResNet from LessW2020 with the SE block to get the SEMXResNet ( :wink: ) and got the following results (+ Ranger + SimpleSelfAttention + Flatten Anneal):
0.748, 0.748, 0.746, 0.772, 0.718 --> average: 0.746

And with BN after the activation:
0.728, 0.768, 0.774, 0.738, 0.752 --> average: 0.752

And with AdaptiveConcatPool2d:
0.620, 0.628, 0.684, 0.714, 0.700 --> average: 0.669

With the increased parameter count (e.g., SE block and/or double the parameters from the FC head input stage after the AdaptiveConcatPool2d) the 5 epochs are very likely not enough to really compare it to the models with fewer parameters, as it will need more time to train (like mentioned above).

I also have seen the thread from LessW2020 about Res2Net. - Did somebody already tried it in a similar fashion and got some (preliminary) results?

4 Likes

Just completed a thorough benchmark on SE-ResNet-50 on CIFAR-10.
Parameters:
epochs = 100
batch_size = 128
learning_rate = 0.001
Optimizer = Adam

Observation Notes - Mish Loss is extremely stable as compared to others especially when compared to E-Swish and ReLU. Mish is faster than GELU in comparison. Mish also is the only activation function which crossed 91% in the 2 runs while other activation functions went to a max of 90.7%.
Mish highest Test Top-1 Accuracy - 91.248%. SELU performed the worst throughout.

(Please click on the picture to view it in larger aspect)

4 Likes

I think you mixed up the axes labels on the accuracy curve and loss curve graphs. epochs should be x, not y.

1 Like

Corrected! :slight_smile:

Not yet – I was going to wait until after I am finished with the project since this was just a small part of something much bigger. But I’ll definitely share it here when I do!

1 Like

I just wanna note here as I reran my tabular models and here is one bit I noticed. The increase in accuracy was noticed specifically when I included information via time-based feature engineering. When the paper is published I will be able to share more details, but I definitely notice the change specifically there. In all other instances there was no radical difference. The moment that was introduced accuracy shot from 92% to 97% with negligible error. This I believe is what was missing. I will note that this time-based feature engineering is not like Rossmann. I will try plugging in a Mish activation as I believe that this could shine here.

4 Likes

I would really like some more in depth on this. Are you working on a paper as per what you wrote? If yes, do send the arXiv link, once it’s up.

1 Like

I am, hence why I can’t say too much on here right now! :slight_smile: But I will definitely send the arXiv once it’s up :slight_smile:

1 Like

Awesome. Do provide the Mish scores once you have them.

The reasoning for hoping for Mish is I have been thoroughly testing these new optimizer + Mish on a variety of datasets with no changes whatsoever. But with this dataset I saw direct changes using the new optimizer + scheduler. Sadly, the only bit that did not change was Mish in the end :frowning: My model was set up with two hidden layers of 1000 and 500. I ran it in three separate areas that I saw various improvements on (with the optimizer and in general) but to no avail :frowning: Sorry @Diganta! The tabular mystery is still a thing! (because this whole experiment has gotten me re-thinking tabular models as a whole)

1 Like

I do agree that tabular models are a mystery of their own. But I would really like to see your progress.

I’ll send you a DM :slight_smile:

1 Like

I’ve been out of the loop here for a bit as I have a lot of consulting work in progress, but @muellerzr happy to see you might have a paper out soon!

Re: tabular data - here’s a recent paper and apparently open source code that may be of great interest:

I want to test it out but too swamped atm.

Also, I wanted to add that I’m having really good results with the Res2Net Plus architecture and the Ranger beta optimizer. I’ll have an article out soon on Res2Net plus, and then the Ranger beta and the paper it’s based on but it trains really well…not fast enough for leaderboards, but excellent for production work.

Hope you guys are doing great!

5 Likes

@LessW2020 very good find! I’ll try it out and report back :slight_smile: Hope you are doing well too (and not too too swamped! :slight_smile: )

Only bit that concerns me: “Our implementation is memory inefficient and may require a lot of GPU memory to converge”

1 Like

Small Update:


Will do this for all 21 activation functions, I have been benchmarking on.

7 Likes

@Diganta Congratulations for your work!
Now it is a fact that Mish activation boosts the accuracy as well as the model stability on SqueezeNet.
I have been following this thread & the imagenette/wolf and was super excited about MXResnet, where Mish also boosted the model.
Do you know if is there any benchmark of EfficientNet replacing the Swish activation for Mish?

Exciting times we are living, you don’t know if current SOTA will survive next week.
Thanks again for keep pushing.

3 Likes

@LessW2020 I’ve been running some trials with Ranger913A over the past few days with EfficientNet-b3 + Mish on Stanford-Cars. Overall I found it matched the accuracy of “vanilla” Ranger but didn’t beat it. It also needed a higher lr (10x) than Ranger and was about 15% slower to run.

If there are 1 or 2 other parameter configurations you feel are worth exploring then I can probably make time to trial them, however in general with Ranger913A and Ranger I’ve found 93.8% to be a hard ceiling with b3, although I have seen 93.9% after adding Mixup.

All results are in the “Ranger913A” notebook here: https://github.com/morganmcg1/stanford-cars

Results
04

Accuracy, last 10e (full 40e plots in the notebook)

Validation loss, last 10e

@mmauri I have a look at my repo above, and my other results with EfficientNet + Mish + Ranger: [Project] Stanford-Cars with fastai v1

Its not on Imagenette/Imagewoof but I’ve found Mish + Ranger to be super helpful. Also, myself and @ttsantos are working on a PR for the efficientnet_pytorch library to allow you to toggle the activation function between Mish, Swish and Relu…

6 Likes

Hey Marc. Thanks for the appreciation. Yes it demonstrates the consistency and superiority of Mish yes. @morgan here used Mish with Efficient Net and beat the accuracy of the Google’s paper on the Stanford Cars Dataset on the B3 model.
Also for more comparison, I used the B0 model to get scores for various activations on the CIFAR-10 dataset.

5 Likes

It’s very nice you are doing all those benchmarks and also including statistical significance.

I have two suggestions you could explore:

  1. Does Mish beat ReLU when constrained by run time? For example, if 50 epochs with Mish take as long as 80 epochs with ReLU (I just made up numbers here), you could compare those results.

  2. Does Mish increase the capacity of a network when replacing ReLU? I’m not sure exactly how you would test that, but I assume that there’s a certain number of epochs after which networks won’t improve their accuracy by much. I assume it’s higher than 50 epochs.

Thanks for all your work!

1 Like