Meet Mish: New Activation function, possible successor to ReLU?

Seb · September 23, 2019, 12:13pm

Yes, from what I remember t-based CI’s take into account the fact that you are estimating the standard deviation, whereas z-based CI’s assume you know the std dev. Therefore, they are larger than z-based CI’s because there is more uncertainty. As sample size increase, t-values converge to z-values.

Diganta · September 23, 2019, 12:31pm

This is a very handy notebook. Thanks for all the work on it. Regarding the autograd part, to write the backward pass from scratch, someone had opened an issue on my repository regarding the same and had written this code for Mish:

class Mish(torch.autograd.Function):
    @staticmethod
    def forward(ctx, i):
        ctx.save_for_backward(i)
        return i * torch.tanh(F.softplus(i))

    @staticmethod
    def backward(ctx, grad_output):
        i,  = ctx.saved_tensors
        w = 4*(i+1) + 4*torch.exp(2*i) + torch.exp(3*i) + torch.exp(i)*(4*i+6)
        d = 2*torch.exp(i) + torch.exp(2*i) + 2
        return grad_output * torch.exp(i) * w / d**2

class CustomMish(nn.Module):
    def forward(self, input_tensor):
        return Mish.apply(input_tensor)

And he said that this passed the “Autograd” function check but it wasn’t good. I don’t know what he exactly meant by “wasn’t good”.
Link to the issue - https://github.com/digantamisra98/Mish/issues/11

morgan · September 23, 2019, 1:22pm

sorry for the delay, yep thats the library

Diganta · September 23, 2019, 1:23pm

Let me know when you put up a PR.

Seb · September 23, 2019, 1:24pm

Thanks for the great post.

As you’ve said, there’s two main things I would worry about (ignoring inference time, etc for now). I’ll take Mish as an example, but I’m going through the same process and questioning with SimpleSelfAttention.

1) Will this help me train for example Imagenet to 94% faster than before?
(replace Imagenet by whatever dataset you care about)

I think that even with additional computational overhead, you can have better performance for the same amount of time.

Now if we look at the 50 epoch runs on cifar-10, swish and mish beat ReLU, but I wonder if I could just have ran ReLU for 10 more epochs (or whatever it takes to get equivalent timing) and gotten a better score.

Or maybe mish/swish get a better score than ReLU for the same amount of time, and that would be appealing.

So at this point the 50 epoch results won’t lead me to switch from ReLU. But there is something interesting going on, in that they get a better score after having seen the same amount of data.

Which leads us to:

2) Will this help me beat SoTA accuracy on my dataset?

I wouldn’t want to train EfficientNet from scratch, but it beats Resnet by a wide margin in terms of accuracy and it does has some applications in transfer learning (see the latest diabetic retinopathy challenge on Kaggle).

Mish is similar to Swish, so maybe this is where we will see the most interesting results.

Resnets get 93-96% on cifar 10, so, we’re not that close to convergence with the 50 epoch test.

But as we’ve discussed, this is harder and more computationally expensive. Maybe we can look at models that are more adapted to cifar10?

From Dawnbench, we have David Page’s Resnet9, which trains to 94% in a few minutes. Also, wideresnet is a bit slower but maybe more recognized by people in the field and should go to 96%+

The authors of WideResnet got 96.11% accuracy (median of 5 runs). Beat that and you’ll get a lot more attention IMO.

Though you might have to rerun the baseline too, as training methods must have improved since 2016. There’s a leaderboard with more wideresnet implementations here.

Seb · September 23, 2019, 1:44pm

Thanks, I really need to get onboard the Tensorboard train. I do have an issue with time, as sometimes the machines I rent have inconsistent speed (from machine to machine, and within the same machine too). Is there a way to count floating point operations (?) instead?

TomB · September 24, 2019, 6:50am

When you are estimating the population mean rather than SD, but that’s exactly the rationale.

No. You could try recording metrics with relative wall time as in Tensorboard, but then scale the relative wall time based on a benchmark of the speed of the machine. You should be able to pull data out of tensorboard to do this as well, though I haven’t yet looked at pulling stuff out easily, that’s on my todo list.

TomB · September 24, 2019, 6:52am

Thanks, should’ve searched back through the thread. Or at least remembered to check out your repo where you have the nice gradient equations.

ilovescience · September 24, 2019, 6:58am

Did you try Mish with the diabetic retinopathy dataset? Because I tried it without success. Granted, I did try it with EfficientNet so maybe applying it to ResNet was more successful.

Seb · September 24, 2019, 12:20pm

I think pretrained models did better? Which we don’t have with Mish.

Seb · September 24, 2019, 12:27pm

By estimating the SD, I meant we are using the sample standard deviation rather than the population standard deviation. z CIs need the latter which we don’t have.

TomB · September 24, 2019, 12:58pm

Yes, but I think it’s the population mean you estimate not the population SD. From here:

The test statistic is calculated as:

where x bar is the sample mean, s² is the sample variance, n is the sample size, µ is the specified population mean and t is a Student t quantile with n-1 degrees of freedom.

You don’t actually use the population SD, just the population mean (estimated or known).

Seb · September 24, 2019, 1:30pm

I think we’re just looking at things from a different perspective.

I agree our objective is to have an estimate and CI for the population mean.

But in the process, we use both the sample mean x bar, and the sample standard deviation s which are point estimates of the population mean and population standard deviation respectively.

The reason I brought this up is that using s (which is an estimate of the population SD sigma) and not sigma itself is the reason we use t-based CI’s and not z-based CI’s. This is because of the chance that we are underestimating sigma by using s ( t-based CI’s are always a bit larger).

TomB · September 24, 2019, 1:48pm

Mmm, was just editing as I realised we were talking about slightly different things. Nicely explained.

ilovescience · September 24, 2019, 8:29pm

I tried replacing the relu with mish in a pretrained model. A colleague of mine tried it with I think a VGG16 or some small model like that, and replaced with Mish and tested on CIFAR10, and got small accuracy boost. But for diabetic retinopathy dataset, I didn’t see any significant change.

Seb · September 24, 2019, 9:16pm

Oh that’s interesting. I would have assumed replacing ReLU by Mish in a pretrained model would have broken down all the pretraining.

Maybe it could work with Efficientnet given that swish and mish are somewhat close?

ilovescience · September 24, 2019, 9:19pm

Well also we aren’t changing the weights.

I think I remember finding someone else did a similar experiment. I will try to look for it.

morgan · September 25, 2019, 9:36am

I used the pretrained models from @lukemelas’s EfficientNet-Pytorch github and get nice bump in accuracy, even beating the EfficientNet paper’s b3 result on Stanford Cars : [Project] Stanford-Cars with fastai v1

Looking super promising using it with pretrained b7 too

Seb · September 25, 2019, 11:52am

Great result!

Seb · September 25, 2019, 1:15pm

So, let’s say you ran two models, model A (baseline) and model B (modified), for 100 epochs, 15 times each, and get the following 95% t-based CIs:

Model B: M = 0.911, 95% CI [0.9077, 0.9142]

Model A: M = 0.908, 95% CI [0.9046, 0.9114]

Where do you go from there? Do you publish? Do you increase sample size?

My previous intuition would have been to increase sample size so that we differentiate the models better, but this seems akin to p-hacking.