Meet Mish: New Activation function, possible successor to ReLU?

Thanks @fgfm for confirming this! At least now we know what was going on with that.

Re cosine annealing I was using flat run and then cosine annealing (credit to @grankin for that improvement).
Thus what I mean was by waiting until .75 of the run was complete before starting the cosine decay, I got better results instead of starting cosine at .5 of the run.

Im going to make a quick repo so everyone can try the new setup with Ranger Mish and flat cosine decay, and to put in the pull request to add our new high score for 5 epochs.

@LessW2020 interested to hear about AutoOpt! Just looked at the code. I may try myself (just for practice as it seems very hefty for the callback system).

Also just saw your fix… trying a small experiment… large improvements on first epoch

1 Like

The reason behind it might be obvious, but it’s quite a significant difference! I wonder if it’s a common mistake. I ran some tests on Imagewoof and will start a new thread.

I’m going to setup a quick github repo and do a pull request on the ImageWoof leaderboards to get the first Mish score into the recordbooks.

I’m listing the following as contributors on this, let me know if I missed anyone:
@LessW2020 / @muellerzr / @Seb / @grankin / @Redknight / @oguiza

Here’s the summarized results (Ranger + Mish, 5 epochs 128px ImageWoof):

and current leaderboard (as it stands publicly):

*Side note but I think I can improve Mish’s perf a bit by inlining the tanh…I’ll test it today.

3 Likes

Great! I’m still surprised at the jump!

True baseline would be 61.25% or 63.89% (using Imagewoof-160 or Imagewoof-full size), still a very impressive jump.

I forget if those runs were done with the full size dataset? If not, it would be a way to improve even more.

If you manage to get equivalent run times by improving Mish’s perf, that would be great!

One quick question. Moreso a conceptual. So top_k, why are we examining this overall? I understand it’s a good baseline for how the model is generally fitting (eg is it ‘close’ to the right answer) but we generally only care about the final accuracy correct?

Top k=5 accuracy just happens to be in our code because it is used for Imagenet (1000 classes). Here we have 10 classes only, so I would ignore it or remove it.

1 Like

Ah got it :slight_smile: I may keep it in with a k=2-3 perhaps. But I hear your point. Thanks! I also tried adjusting my betas and tried a new eps, I did notice less volatility in the small-scale tests (train_loss and valid_loss were much much closer)

By the way @LessW2020
image

May wanna wait on that PR :wink: (I’ll have full results of new test here in the next 15 minutes)

New results:

Avg: 72.96%
Std: 1.83%
Maximum: 76.2%

I still need to play around with things and I want to do a few more tests. I’m rerunning this for another 5 due to that high variance as well.

2 Likes

I’ve got the new repo about all setup so we can add the link.
I ran one more 5 run to make it an even 20 and make sure the repo setup works for anyone to test with:

[0.734 0.75 0.748 0.732 0.738]
0.74039996
0.0073102703

so total results:

3 Likes

Oh nice job on the new max high! This is with RangerLars/Over9k?

btw, inlining tanh blows up the gradient tracker so doesn’t look like I can make Mish any faster.

1 Like

It is! I saw your sneaky modification with the betas and an eps of 1e-8, and tried that (even though different optimizer, why not?) Wound up showing some potential at the very least

That’s why I said I may need a little bit. I’m playing around with the right ranges for each. Currently running a baysian optimization so perhaps it can help with that easier. If you want some code to do this as well let me know, I have fastai starter code

1 Like

On your repo, might want to comment out
#from radam import *
and other such imports in train.py

2 Likes

Here’s the results after 10:

Mean: 72.92
Std: 1.34
Max: 76.2
Individual:

[array(0.738, dtype=float32),
 array(0.716, dtype=float32),
 array(0.762, dtype=float32),
 array(0.714, dtype=float32),
 array(0.718, dtype=float32),
 array(0.732, dtype=float32),
 array(0.728, dtype=float32),
 array(0.722, dtype=float32),
 array(0.736, dtype=float32),
 array(0.726, dtype=float32)]

From here I’ll update the results from if the baysian wound up achieving anything noticeable but that will probably not be until later tonight.

Done - thanks @Seb!

I figured I’d try adding SimpleSelfAttention since it did well in my previous tests with Imagewoof128 (better accuracy on ~100 epochs when constrained by run time - edit: that was with xresnet18). For some reason not on Imagewoof256 but that’s another story.
I imagine it might like the different lr scheduler.

Results
[0.752 0.768 0.75 0.756 0.732 0.748 0.754 0.752 0.74 0.75 ]
mean: 0.7502
stdev: 0.0090088835

Epochs are 24/25s vs 23s, so not that much slower.

edit: more info about ssa here: https://github.com/sdoria/SimpleSelfAttention

4 Likes

Wow, that’s fantastic - new high on both average and Max (76.8)! Nice job @Seb!
A second or two extra compute is completely worth it.

Only problem is now you messed up my halfway complete post about how we beat the leaderboards…let me run a set as well and I’ll update everything and hopefully you can expand on the SA aspect.
(Jeremy had asked that I make a post explaining what was going on with our new records with Ranger/Mish/FlatCosine anneal for people to catch up with).

3 Likes

@Seb are those 100 epoch or 5 epoch tests?

This is 5 epochs for 75% average.

The ~ 100 epoch is a previous test (link at the bottom of my other post). I had ran 94 epochs for xresnet18+ssa and 100 for xresnet18 to get the same duration and beat xresnet18 by 0.75%.

1 Like

I may try with RangerLarsif that’s alright? Or if you want to? I found some hyperparemeters that show stable decent results if you want to try them (unsure how they’ll back apply to you):

betas = (0.9,0.998802922279857), eps=1e-5, alpha=0.8999824249083723

This is for RangerLars, I get > 72% for all of the tries thus far (7)

*I don’t want to necessarily rule it out due to how close the margins are

Go for it. I might not be around much for a few days.
My only issue with Over9000 is run time. Eventually I’d like to sort through all those additions and see which ones gets to high accuracy the fastest. Maybe we could have a leaderboard on fastest run to 87% on Imagewoof128 with 1 GPU.

1 Like