Meet Mish: New Activation function, possible successor to ReLU?

grankin · September 5, 2019, 6:33am

I made it work. Replace in the constructor of Res2Net this

    self.avgpool = nn.AvgPool2d(7, stride=1)

with that

    self.avgpool = nn.AdaptiveAvgPool2d(1)

Redknight · September 5, 2019, 12:27pm

@Diganta sorry wrong paper, that one is good but the real gem is this one: “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” https://arxiv.org/abs/1803.03635

LessW2020 · September 5, 2019, 12:31pm

Thanks a ton @grankin! Now we can test it out and see how/if it helps.
Much appreciated.

Diganta · September 6, 2019, 2:36am

@Redknight I have the array of sigma_b (standard deviation of bias Initializer) , sigma_w (standard deviation of weight Initializer) and the q (EOC phase plane boundary values), how do I plot the EOC curve?

muellerzr · September 6, 2019, 2:54am

I’m going to run some tabular tests today and tomorrow on the following datasets:

Adults -> Binary
Rossmann -> Regression
PUC-Rio -> Multi Class (non-binary)
Brazil Air Pollution -> Time Series

With the following setup:

Base
Flatten and Anneal
Mish
RangerLars and Ranger + LookAhead

Will fill in the results when they are done

Diganta · September 6, 2019, 3:03am

Hey Zach. Can you direct me to a repository where you are maintaining the logs or these results in general. Would be useful.

muellerzr · September 6, 2019, 3:08am

Sure I keep a little side “play” repository here. The notebooks will get uploaded to their own folder soon, I’ll name it “Tabular Optimizer Experiments”

If I find anything substantial I’ll also make a separate repository

muellerzr · September 6, 2019, 4:42am

ADULTs results are in and well… they surprised me to say the least. Essentially: Flatten + Ranger wins again, but Flatten + Ranger + Mish loses. I was surprised by this Perhaps we need to rethink the implementation somehow @Diganta? I still need to run on the other datasets but so far so good!

Except when taken for longer epochs (10 instead of 5) the advantage vanishes.

LessW2020 · September 6, 2019, 5:15am

Hi @Diganta thanks for the notice. I have responded to all the comments including yours, and updated the article to add some of our results from here as well as the benchmark testing results.
And of course I now link to your github

Diganta · September 6, 2019, 5:58am

I’ll take a look. But will it be possible just for clarity for you to post the standard deviation of the results of the various runs you obtained?

Diganta · September 6, 2019, 5:58am

Thanks @LessW2020

muellerzr · September 6, 2019, 6:03am

Those should be included in the table

See the readme in the folder. Otherwise I’ll post everything once it’s all done.

fgfm · September 6, 2019, 8:07am

@grankin @LessW2020 if you guys want to try Res2Net, I would suggest gavsn implementation !
I have been experimenting with Res2Net (unrelated to this topic) and the performance boost is indeed quite welcome. Also, if you need something more in the fashion of torchvision ResNet, you can check my personal modifications: https://github.com/frgfm/Holocron/blob/master/holocron/models/res2net.py (which I’m using for object detection extensively)

Glad to see that you are still exploring the next low hanging fruits to climb that leaderboard

Cheers

LessW2020 · September 6, 2019, 2:14pm

Thanks very much @fgfm. I’m going to work with yours today - I also greatly appreciate how you have comments throughout the code. One thing I don’t like about XResNet is that while it’s very compact, there’s no or almost no comments about it and I found it hard to modify as a result.

Also - regarding object detection - I want to test repPoints for object detection so maybe we can compare notes on a different thread in the future on that.

Anyway thanks again for the link to your res2net, will run it today.

rwightman · September 6, 2019, 3:21pm

Don’t take results on 5/10/20 epoch tests and assume they extrapolate. When optimizing a 5 epoch metric, you get h-params, model archs, and optimizer choices that are good at producing the highest results in 5 epochs, and not necesssarily more or less. That’s always the danger of picking a metric and optimizing it.

I’d argue that the optimizer, model, hparams you need to pick get the best 5 epoch score is going to be detrimental (especially in hparams) to the max achievable validation score on a given dataset and even more so to generalization on a hold out or test set from a slightly differing distribution. This is especially true when you have to start disabling augmentation or loosening regularization to hit the high score, but need them on to hit a max score on a much longer training session.

I’ve done some longer tests to compare against baselines for good Imagenet results. Mish, like Swish does look quite promising despite the heavy performance drag. I need a lot more data to make an decisive +x% type statements though.

In terms of this basket of new optimizers and their various combinations I haven’t really seen a decisive win on results. They can produce good results faster and are less sensitive to LR. However, there haven’t been any SOTA wins for met yet. Generalization, as with all adaptive optimizers can be really problematic if not careful (best to increase beta1 by default on that count just as with Adam).

muellerzr · September 6, 2019, 3:28pm

I agree, the reasoning behind this is with tabular finding a “sweet spot” for number of epochs generally isn’t much at all. If we examine the Rossmann example Jeremy found a maximum at 5 epochs that was never matched again. I believe this is a bit different for tabular, hence the epoch amounts. I’m not doing 5 as I agree that is unreasonably small and instead doing 10 in hopes to help build that extrapolation as best I can. If you find that is still not proficient let me know I greatly value your input

rwightman · September 6, 2019, 3:35pm

I agree, the reasoning behind this is with tabular finding a “sweet spot” for number of epochs generally isn’t much at all. If we examine the Rossmann example Jeremy found a maximum at 5 epochs that was never matched again. I believe this is a bit different for tabular, hence the epoch amounts. I’m not doing 5 as I agree that is unreasonably small and instead doing 10 in hopes to help build that extrapolation as best I can.

Yes, 5/10 epochs more relevant for tabular data. I was pointing out wrt to a lot of the past imagewoof/nette results and leaderboard wins too though.

muellerzr · September 6, 2019, 5:56pm

So across the board results were nill. No real gain or improvement depending on optimizer or activation function. Which again, I find strange. As personal research showed improvement. I think fundamentally we may (as a whole) be missing something in terms of what’s happening and how can we improve it but for now it looks like it’s still a heafty black box (in regards to tabular). I’ll post the code later today for anyone to look at.

Diganta · September 6, 2019, 10:38pm

Can you share your ImageNet results?

Diganta · September 6, 2019, 10:40pm

It’s not just the result that showcases why a particular algorithm tends to work better than the other. There are underlying mathematical proofs, which itself hide the secrets of how to make it perform the best which is what I’m currently working on in regards to Mish. Maybe and hopefully it will provide some definitive answers.