Meet Mish: New Activation function, possible successor to ReLU?

LessW2020 · August 27, 2019, 5:30am

Hi all,
After testing a lot of new activation functions this year, I’m excited to introduce you to one that has delivered in testing - Mish.

Per the paper, Mish outperformed ReLU by 1.67% in their testing (final accuracy) and was tested in 70 different architectures.
I tested against ImageWoof using XResNet50 and a variety of optimizers to try and put Mish through it’s paces and saw improved training curves and accuracy jumps of 1-3.6% merely by dropping in Mish instead of ReLU.
The overhead vs ReLU is minimal (+1 second per epoch) and so far well worth it for the accuracy gains.
I wrote a full article on Mish here:

and have a PyTorch/FastAI drop in (mish.py) and Mish XResNet here:

and here’s the paper link:

Please give Mish a try and see how it performs for you versus ReLU as I think you’ll see a nice win from it.

muellerzr · August 27, 2019, 3:45pm

@LessW2020 thank’s for this work! very excited to try. One quick note: on the forward, I believe you’re missing a closing parenthesis:

x = x *( torch.tanh(F.softplus(x)))

dmangla3 · August 27, 2019, 3:52pm

I was following this paper of Digant for some days and eagerly waiting for someone to implement it. Thanks.
It is changing internal layers of model. You trained whole XResNet50 and MXResNet50 again to compare accuracy?

LessW2020 · August 27, 2019, 4:54pm

Hi @dmangla3,
Yes, that’s exactly what I did - swapped out the internal activation function of XResNet from ReLU to Mish.
That thus becomes MXResNet.
You can also just change the internals of it via act_fn, but I found it was easier for testing to have two seperate networks, with different names, to avoid any confusion.

LessW2020 · August 27, 2019, 5:00pm

Thanks @muellerzr! I’ve fixed it and lesson learned - cut and paste from my test code instead of writing it up by hand “b/c it’s so simple” lol.
Please post once you get a chance to use it…I’m very curious how it performs on a variety of datasets.

muellerzr · August 27, 2019, 5:07pm

@LessW2020 no problem! I’ve felt that pain many times. I will take a look at it in a little bit! I plan on trying it with the Adults dataset and Rossmann, and possibly pets.

I will say, I’m trying to prepare my material for my study group this semester. You keep showing up with new state-of-the-art implementations is frustrating

Keep up the good work!

Seb · August 27, 2019, 5:25pm

As a reminder, @grankin reran the baseline for Imagewoof, 5 epochs, (https://github.com/mgrankin/over9000) and got 61.25% averaged over 20 runs, which is higher than what you got with mish.
The true baseline for 20 epochs is most likely higher than on the leaderboard as well.

I’ve explained the issue with the leaderboard here: ImageNette/Woof Leaderboards - guidelines for proving new high scores?

There is also an issue of high variance in accuracy from run to run on Imagewoof/nette, so I wouldn’t rush to making a conclusion with a single run that is furthermore compared to a wrongly measured baseline.

I’ve made those points in the past in the other SOTA threads, but I still see the same method being used of running things once and comparing to underestimated baselines…

LessW2020 · August 27, 2019, 6:47pm

Hi @Seb,
I always value your input - Let me clarify some incorrect assumptions you are making:

Agree about variance - but in this case, the fact is - you don’t know what I got for Imagewoof 5 epochs only testing b/c I didn’t show it in the article
I showed a visual in my article of a single run, which was one run of many I tested, b/c I wanted to show the results of a sample 20 epoch run with details of the epoch to epoch flow to show the improved training stability from Mish as well as the final 20 testing. That run also just happened to exceed the posted leaderboard results at 5 epochs, so I highlighted that,but was not representative of running for a 5 run setup.

Here’s what I got with Ranger + Mish, 5 epochs only runs @ 5 runs… 69.20%, that readily beats the 61.25% you are referring to, as well as the posted leaderboard of 55.25%:

Re: leaderboards - the leaderboards really need to be updated, but they haven’t been… I can’t show a visual in an article of someone posting about a score… as readers will generally consider the github leaderboards as the ‘official’ score.
In this case, it seems Mish + Ranger handily beat the leaderboard and @grankins baseline results, so I’ll have to see about getting these submitted for the leaderboard.
But ultimately, if the leaderboards aren’t updated with your results or others, then it’s not clear if there was an issue regarding reproducibility or what…so all we can really go by is what is posted on github.

Hope that help clarify and let’s see about getting the leaderboards updated!

Seb · August 27, 2019, 6:58pm

Thanks for running those. I did think that you 5 epochs result in the article was likely to be underestimated because it is within a bigger 20 epoch run.

Now, you are changing 3 variables at a time compared to the baseline (optimizer, activation, and lr scheduler), but this is ok because we have result for Ranger + Relu + flat_and_anneal: 59.46%.

Thanks for the extra data, you got me curious enough to take a look at this!

LessW2020 · August 27, 2019, 7:04pm

Hi @Seb,
Thanks for the feedback!
I have ReLU results for all the same variables from earlier testing (optimizer, anneal, lr), so even though 3 are changing against the leaderboard, I was really testing only one change for myself and the article - namely, replacing Mish vs ReLU.
I haven’t really dug into testing the optimal lr with Mish - I suspect there’s a lot more gains to be had with that.
I just wanted to verify it could outperform ReLU and wrote the article based on that. I spent a lot of time before on activations that ultimately couldn’t beat ReLU even though it did great in the papers
Mish looks like a winner though and happy to hear you are going to do some testing with it!

muellerzr · August 27, 2019, 7:04pm

@Seb @LessW2020 I am wanting to test drive Less’ activation along with the Over9000 (Or is it Ralamb at this point? It’s getting a little confusing for me) optimizer. I’m running it now but I want to be sure it’s up to ‘par’:

5 runs (I have class later but I want to experiment) for five epochs (20 later on). Along with the anneal scheduler. How best should I report my results so they are ‘up to par’? (As @LessW2020 's intuition is right, results do seem… interesting… so far)

Current one is as of 12 hrs ago for my small test, I need to use the newest version

LessW2020 · August 27, 2019, 7:07pm

Re: over9000 / RangerLars - RangerKnight just posted 2 fixes to his core code, so that should be integrated into an update for RangerLars.
Give me about 20 minutes and I’ll integrate his fixes and give you a new drop if you’d like to test with that. That would be great actually so we can see what effect the fixes have as they untested!

Seb · August 27, 2019, 7:25pm

@muellerzr
I’d like to see the accuracy for each run, from which we can compute average and standard deviation.
Might have to rerun the same thing with Adam if there are some fixes to LARS. Otherwise, we have a baseline here: https://github.com/mgrankin/over9000

5 runs might be enough. More could be needed if results are close.

muellerzr · August 27, 2019, 7:28pm

@Seb they just finished:

Average: 73.2%
Std: 0.946%

note: These are pre the fix described in the most recent version. So only up from here?

LR: 1e-2

Accuracies:
71.8%
73%
73.8%
74.6%
72.8%

I will rerun with the fixes before posting the official notebook etc.

Seb · August 27, 2019, 7:31pm

This is surprisingly good! Did you use mgrankin’s code? Can you add the parameters you ran “%run train.py etc”?

LessW2020 · August 27, 2019, 7:33pm

Awesome - - fixes are done, btw but I can’t get a server to test it:

github.com

lessw2020/mish/blob/master/rangerlars.py

#RangerLars / Over9000  -
#credit to Federico (https://gist.github.com/redknightlois/c4023d393eb8f92bb44b2ab582d7ec20) and
#@mgrankin  (https://github.com/mgrankin/over9000) for adding Lars into Ranger (Lookahead + RAdam)

#this version integrates several improvements from @oquiza and Yaroslav Geraskin, added by Federico.
#8/27/19

import torch, math
from torch.optim.optimizer import Optimizer
import itertools as it

class RangerLars(Optimizer):

    def __init__(self, params, lr=1e-3, alpha=.5, k= 5,, betas=(0.9, 0.999), eps=1e-8, weight_decay=0):
        if not 0.0 <= alpha <= 1.0:
            raise ValueError(f'Invalid slow update rate: {alpha}')
        if not 1 <= k:
                raise ValueError(f'Invalid lookahead steps: {k}') 
            
        defaults = dict(lr=lr, alpha=alpha, k=k, betas=betas, eps=eps, weight_decay=weight_decay)

This file has been truncated. show original

muellerzr · August 27, 2019, 7:33pm

@Seb since I use colab I just did straight notebook. I did use mgrankin’s code however. I can share a very messy notebook momentarily.

@LessW2020 thanks! I will give this a boot up in a moment.

muellerzr · August 27, 2019, 7:39pm

@Seb VERY messy notebook. I will clean and do a new one with the new code:

github.com

muellerzr/fastai-Experiments-and-tips/blob/master/ImageWoofTests/initial.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Testing SOTA.ipynb",
      "version": "0.3.2",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "id": "hbU_2vvyR5ZK",

This file has been truncated. show original

muellerzr · August 27, 2019, 7:42pm

@LessW2020 sight extra comma after k on the init for rangerlars

Seb · August 27, 2019, 7:45pm

I’ve seen messier
Thanks for sharing! Is there a way you can see the detailed runs in the future?