Meet Mish: New Activation function, possible successor to ReLU?

Diganta · September 21, 2019, 1:53am

Yes, I agree on that but there’s no actual way to prove it even. Just seeing scores exceeding a threshold doesn’t indicate to be increasing learning capacity.

Diganta · September 21, 2019, 1:55am

Yes, he’s talking about practical significance. So few questions would be:

What’s the practical significance for us? And how to confirm that?
Should I obtain the 95% CI also for the results?

This is a very good blogpost - https://statisticsbyjim.com/hypothesis-testing/practical-statistical-significance/

I believe I should increase my sample size. Will 10 work? That’s I guess the max I can do based on the resources I have.

Diganta · September 21, 2019, 1:58am

Agreed that 3 epochs is a bit low on the scale but I have seen consistent performance of Mish on other tests which required 5 runs so I doubt whether it is that I got lucky. I can certainly do 5 runs, no issues. But the thing that makes more sense is CIFAR 10 being a low variance dataset. But then how would you prove it otherwise, what dataset should I use? ImageNette and ImageWoof aren’t standardized dataset to convince a reviewer having result based on them and also I have no computational resources to go for multiple runs on ImageNet. So what exactly I should do?

Also it’s about comparison. It isn’t possible that even on 3 runs Mish got lucky and every other activation function somehow performed way bad than how they would have performed. So I don’t think it’s a case of being lucky (because it makes no mathematical sense) but I do agree that CIFAR 10 maybe is a low variance dataset.

oguiza · September 21, 2019, 7:03am

Hi Diganta,
You may want to take a look at this blog post. The concept of effect size is also used in medicine to report the practical clinical significance of a drug in a clinical trial.
You could easily calculate Cohen’s d (there’s code in the blog post to do it) and report on it.

Cohen’s d measures the difference between the mean from two Gaussian-distributed variables. It is a standard score that summarizes the difference in terms of the number of standard deviations.

And the score is standardized:

Small Effect Size : d=0.20
Medium Effect Size : d=0.50
Large Effect Size : d=0.80

Based on your data, the effect size between Mish and ReLU seems to be large (1.6 if I’m not mistaken), which is pretty good!

Diganta · September 21, 2019, 8:00am

This is really helpful. Thanks for this. I am currently increasing my sample size to 23 from 3 for the p-value calculation. Should I do Cohen’s d score calculation on the 3 samples or on the 23 sample set?

oguiza · September 21, 2019, 8:06am

Thank you Diganta.
I think you can use it with both, but if you can claim the same results with 23 samples, it would be much more credible IMO.

Diganta · September 21, 2019, 8:08am

Alright. Would take some time though. For each activation 20 more samples will take 4 hours - 4.5 hours. 24 activations in total. So that will be around 96 hours which will be like 1 week max.

oguiza · September 21, 2019, 8:10am

Maybe you can run the additional 20 against the activations that are closer to Mish in your previous results to see if they still look similar.

Diganta · September 21, 2019, 8:12am

Would be unfair in my opinion. I want to have as constructive results as I can. Though yes, I’ll first run the ones which are significantly closer to Mish but eventually will do for all.

TomB · September 21, 2019, 8:51am

Why would it be unfair? The calculations are based on your given sample size, that’s kind of the point is to allow accurate estimation of true (population) statistics given your sample size.
You might want separate tables for an evaluation of all of them with a smaller sample size and then a more detailed analysis of the top performers with more runs to simplify reporting but I think a comparison across different sample sizes isn’t unfair.

My suggestion would be you might be better to include confidence intervals rather than significance. The CI will give information on both the statistical significance and the effect size while being less liable to misinterpretation by those not familiar with statistical inference. For instance, people might think a lower p-value is indicative of better performance or that a high p-value means more runs are needed, which are both misguided views. There is a move away from significance testing in various areas of academia (mostly psychology and social sciences) with various journals no longer reporting p-values.

Seb · September 21, 2019, 12:08pm

I think more runs will clarify things! Thanks for doing all that work.

I’d like to see Mish, ReLU, Swish when you have them, as a start.

Also I agree with moving to confidence intervals. I believe that would help with the practical significance issues.

a_yasyrev · September 21, 2019, 1:20pm

I have question about MXResnet implementation.
in MXReesnet class we have

blocks = [self._make_layer(expansion, block_szs[i], block_szs[i+1], l, 1 if i==0 else 2, sa = sa if i in[len(layers)-4] else False, sym=sym)
for i,l in enumerate(layers)]

so, Q about

sa = sa if i in[len(layers)-4] else False

as len(layers) always 4, we have SelfAttention module only in first block? and only in third ResBlock?

Seb · September 21, 2019, 1:45pm

hi Andrei,

I’m the author of the SimpleSelfAttention module and of the placement in the xresnet. Note that it’s still very experimental.

I believe you are right about the placement of the SelfAttention module. Feel free to experiment with other placements!

Let me know if you have other questions.

a_yasyrev · September 21, 2019, 3:47pm

Got it.
I just won’t clarify, what you put in in 1 place, at the and of 1 block.
Ok, will try variants.

Seb · September 21, 2019, 3:52pm

Input size to the layer will change if you move it so it should have an impact on speed.

Also, I’ve found that you need to place the attention layer after a conv_layer that uses “zero_bn=True”. Seems to help a lot with convergence.

Diganta · September 21, 2019, 5:14pm

@TomB completely agree with you, but I just personally want all activations to have tests with all same parameters which includes sample size even if they are not close to Mish.
@Seb Thanks for all the suggestion.

Just got over with the 23 batch run for Mish and Swish. Mish so far has the lowest mean loss, lowest standard deviation, highest mean accuracy.

@TomB @Seb - would be glad if you could provide your thoughts on this.
@oguiza as of right now Mish against Swish falls in the small sample effect - medium sample effect range for Cohen’s d score.

Seb · September 21, 2019, 6:05pm

Variance has gone up for both (shows that there was something wrong with sample size = 3 variance), so it gets harder to tell them apart in terms of accuracy.

I plugged your values into https://www.medcalc.org/calc/comparison_of_means.php and get very slightly different values (could just be rounding).

Assuming everything stays the same (it won’t), you’d need more than 50 samples of each to get to p<0.05 and a 95% CI that is not crossing 0. That’s a lot of work!

Possibly it will be easier to tell Mish and ReLU apart.

Also, one thing with the comparison of means test is that you don’t need to have the same sample size for both types. So you could have 23 for mish and 10 for e.g. PreLU.

Is mean loss based on your loss function? When working with imagewoof, I found that loss has less variance and therefore gives better p-values and tighter confidence intervals. IMO it’s worth looking at loss if you find that accuracies are a bit close to each other.

Diganta · September 21, 2019, 6:12pm

50 samples will probably take me an eternity haha. I can do for loss too, I have all the data saved so that won’t be an issue. I will make another dedicated table for the loss tests. Thanks!
And yes, the slight difference in values you got is because of the rounding up in the table. All the values were calculated to complete precision. Just to make the table compact I wrote the values after rounding them up.

TomB · September 21, 2019, 6:40pm

Off to bed, will have a better look tomorrow. But that Cohen’s d seems odd (though I’m not really familiar with it and my inferential stats is all a bit rusty). But I wouldn’t expect a non-significant result to be considered a medium effect. Are you mixing mean/sd between percent and not (i.e. 0.8748 vs 87.48)?

That’s seems better for a non-significant result.

Diganta · September 21, 2019, 6:54pm

@TomB Calculated the Cohen’s D score using this tool - https://www.socscistatistics.com/effectsize/default3.aspx

I didn’t completely get you on this:

Are you mixing mean/sd between percent and not (i.e. 0.8748 vs 87.48)?