Meet Mish: New Activation function, possible successor to ReLU?

I think we’re just looking at things from a different perspective.

I agree our objective is to have an estimate and CI for the population mean.

But in the process, we use both the sample mean x bar, and the sample standard deviation s which are point estimates of the population mean and population standard deviation respectively.

The reason I brought this up is that using s (which is an estimate of the population SD sigma) and not sigma itself is the reason we use t-based CI’s and not z-based CI’s. This is because of the chance that we are underestimating sigma by using s ( t-based CI’s are always a bit larger).

1 Like

Mmm, was just editing as I realised we were talking about slightly different things. Nicely explained.

1 Like

I tried replacing the relu with mish in a pretrained model. A colleague of mine tried it with I think a VGG16 or some small model like that, and replaced with Mish and tested on CIFAR10, and got small accuracy boost. But for diabetic retinopathy dataset, I didn’t see any significant change.

Oh that’s interesting. I would have assumed replacing ReLU by Mish in a pretrained model would have broken down all the pretraining.

Maybe it could work with Efficientnet given that swish and mish are somewhat close?

Well also we aren’t changing the weights.

I think I remember finding someone else did a similar experiment. I will try to look for it.

I used the pretrained models from @lukemelas’s EfficientNet-Pytorch github and get nice bump in accuracy, even beating the EfficientNet paper’s b3 result on Stanford Cars : [Project] Stanford-Cars with fastai v1

Looking super promising using it with pretrained b7 too

1 Like

Great result!

1 Like

So, let’s say you ran two models, model A (baseline) and model B (modified), for 100 epochs, 15 times each, and get the following 95% t-based CIs:

Model B: M = 0.911, 95% CI [0.9077, 0.9142]

Model A: M = 0.908, 95% CI [0.9046, 0.9114]

Where do you go from there? Do you publish? Do you increase sample size?

My previous intuition would have been to increase sample size so that we differentiate the models better, but this seems akin to p-hacking.

Well, it depends on the field.
In psychology where I’m more familiar(-ish) probably not, that’s a pretty small effect size that’s likely not significant (/in which you aren’t confident if we’re eschewing t-tests entirely). In medicine maybe, I’d guess it would depend on practical significance (depending on other treatment options etc) but still probably not though even that small a margin could be lots of lives in certain cases.

In ML you probably delete the analysis as no one publishes SDs even, make up some elaborate explanation accompanied by a few pages of incomprehensible formulas and claim a SOTA.
Joking of course but I’m not sure, the line for practical significance doesn’t seem that great, though 0.03 is still pretty small, what’s that, like 1SD. Probably depends a bit on the theoretical side. If you just tweaked something I’d think there’s probably not much chance of publishing (but I could be wrong).

2 Likes

And while it can be a slippery slope, increasing the sample size isn’t necessarily wrong. Certainly doing a small run (in some areas you’d even publish that as a pilot study) and then a full-scale study would be common. If you’re including previous runs that also helps, harder to get a significant result then. Then in some areas you’d regularly remove things from your sample (usually participants here) which can get dicey. It’s also more relevant when doing lots of comparisons, with a single t-test there’s less room for manipulation (or similar for a one to many like with mish where you’re looking to see mish beat all not just any, easy to manipulate to beat one or two but not across many). But there’s published reports where they do like 20 independent comparisons, does living near a factory increase any of these 20 things say. Then you’re basically statistically guaranteed to find something significant.

1 Like

So, I put together a CUDA version of Mish, mixed results so far. It’s at https://github.com/thomasbrandon/mish-cuda (@Diganta - check the attributions to you are all good, also happy to give it to you).
First issue is that while it builds it causes an import error as it can’t find a PyTorch function needed for the nice easy way I implemented it. I’ve posted to the PyTorch forums but not reply yet (if not replied I might raise an issue, partly scared they’ll take away the really easy way to do a kernel, not sure it’s meant to be public given the error). I’ve added a pretty nasty hack that lets it import in the nasty-hack branch. The issue is a function for potentially overlapping tensors so it doesn’t get called on contiguous tensors which you usually have. But the hack may break all non-contiguous tensors as it just provides a version of the needed function that raises an exception (ideally only for within the extension but maybe anytime it’s imported).

Second issues is it doesn’t train, gives all nan for loss. I’ve got tests to check the gradients against the standard PyTorch implementation and they pass except for torch.autograd.gradgradcheck (second derivative) which I think is only for when you do funky stuff (sending requires_grad=True tensors through the backward pass). So, not sure what’s happening here. I guess that may be the issue with the torch.autograd.Function implementation above. Haven’t delved into this much, only just tried an actual training run.

Third issue is performance is mixed at the moment:

Profiling over 100 runs after 10 warmup runs.
Profiling on GeForce RTX 2070
relu_fwd:      248.4µs ± 1.573µs (235.3µs - 253.7µs)
relu_bwd:      423.6µs ± 59.06µs (416.1µs - 1.011ms)
softplus_fwd:  275.1µs ± 28.11µs (254.7µs - 324.2µs)
softplus_bwd:  423.2µs ± 5.204µs (418.6µs - 434.3µs)
mish_pt_fwd:   797.6µs ± 1.826µs (783.3µs - 803.6µs)
mish_pt_bwd:   1.690ms ± 964.0ns (1.688ms - 1.695ms)
mish_cuda_fwd: 280.6µs ± 2.585µs (260.6µs - 294.7µs)
mish_cuda_bwd: 7.871ms ± 1.251µs (7.867ms - 7.876ms)

mish_pt being the standard PyTorch implementation, though just as:

mish_pt = lambda x: x.mul(torch.tanh(F.softplus(x)))

So forward is good, faster than pytorch, around the same as softplus and near relu. But backward is horrible. Think that’s as I just used the calcs from the autograd function one (direct from paper) unchanged except for converting to C++. So like 6 exps which are quite slow in CUDA. There is a fast exp instruction in CUDA but it has limited accuracy (and especially limited range in which it’s accurate). Enabling it globally through a compiler option made all the gradient checks blow up, but may be able to use it more judiciously. And I should at least be able to use exp(c * inp) == exp(inp)**c to reduce the number. And a some other things to try to increase performance of both forward and backward (not much to optimise though without optimising the PyTorch function I use, though there is a note in there about an optimisation that might help).

As noted in the repo readme you will need to have everything setup to compile PyTorch extensions which can be a trick. There is a dockerfile if you can run GPU dockers (need newish docker and the nVidia container toolkit). Or you might be able to build in the docker and pull out a wheel (Ubuntu based docker so may have issues using wheels from it in conda or non-ubuntu systems). If I can resolve the other issues I can look at a conda package. You should be able to just pip install git+ttps://github.com/thomasbrandon/mish-cuda@nasty-hack or pip install . from a clone(noting the needed branch for the hack).

EDIT: Just tried it in Colab and it doesn’t work in PyTorch 1.1. You can:

!pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html 
!pip -v install git+https://github.com/thomasbrandon/mish-cuda@nasty-hack

and it works. I’ll look into getting it to work with 1.1. Created a quick notebook.

6 Likes

Thank you @TomB for this. Amazing work. Positive thing being the CUDA version could bring down the forward pass compute time to comparable terms with ReLU and Softplus. I will take a look at the second issue related to the Autograd. I really suggest to open it as an issue in their repository and could you please redirect me to the post in the forums?

Here’s the post: https://discuss.pytorch.org/t/cuda-tensor-apply-in-extension-gives-undefined-symbol/56736
Linker errors like this tend to be bad configuration by inexperienced C++ devs (or compiler issues) but I don’t think that’s the case here.
It’s tied to their quite complicated dispatch and declaration generation system so it’s hard for me to figure out exactly what’s wrong and how to fix. Plus as noted I’m not quite sure whether the functions I’m using are actually meant to be available to non-internal code. Yeah, should raise an issue, a core dev should be able to quickly address it just likely hasn’t been seen on forum given the traffic.

Optimisation might bring down the backward time a little. Also worth looking into the backward of tanh/softplus as presumably, similar to forward, the time should be mainly limited by memory access/cuda launch overhead rather than computational complexity (as shown by running as fast as softplus/tanh and presumably mul individually while combining all three). The only possible thing being increased register usage could mean you can’t achieve the occupancy of those functions which would reduce performance.

1 Like

OK, looking into the training issues it seems to be that sometimes it’s returning inf/nan and then that throws the whole model off. Here’s a semi-successful training run, gets through 2 epochs, then falls apart: https://gist.github.com/thomasbrandon/16915d8a01bcdd9d4b74abbc7cf6638b

At a guess, as you have grad_inp = grad_out * exp(inp) * w / pow(d, 2), if d==0.0 (within float32 limits), then you’ll be dividing by 0. Might need an epsilon in there. Seem reasonable? Is there a more sensible correction (unlike tensor ops I should be able to do an if d==0.0: d=... or some fix with minimal performance impact, though a non-conditional add will be slightly faster but likely dwarfed by memory access, at least if I get backward optimised, forward is highly memory bound, haven’t profiled backward)? Are there any funky points in exp, like log(0)==inf (showing my very limited maths)?

UPDATE: Adding in an epsilon didn’t seem to help (assuming I did it right). Debugging it, the issue is in backward, it’s giving infinite input gradients. If anyone else is playing around, I made a callback that stops training and keeps the last input/output/loss so hopefully should be able to recreate from that. It doesn’t do the detection until end of backward though (so I’m not trying to retrieve a GPU tensor every batch) so by this point the network state might be screwed, but maybe not, it’s hopefully before weight update so just bas gradients. Or may need to do something about that to properly recreate (you can return modified gradients from a callback so could probably do that through GPU only ops).

5 Likes

Hey guys lots to catch up on. I’ll finally have some time to play with things this weekend. Do we need to run some 80 epoch tests with the current setup we had before to measure long-training? I can do anywhere between 5-10 runs.

Haven’t looked at the backward issues any further but I’ve now pushed a CPU version (nice and easy with the PyTorch functions I’m using, basically just changing CUDA to CPU in a few places, though at least with my limited C++, and not wanting to put everything in defines, there’s a fair bit of copy-pasting). It’s still called MishCuda, but works on CPU tensors now. This should help a little with debugging, not needing to do everything on the GPU, and going forward means you could run the CPU tests in CI (they just use cuda:0 if there). Interestingly it even seems to do half precision on CPU. However I had to disable tests for it because none of the PyTorch ops support half on CPU and I compare with those, so I’d need to compute expected values on GPU and then compare to actual CPU results.

On the gradients, I’m thinking I might just look at reworking the gradient calculation. Given the performance this would be needed anyway, so seems it might be better sensible to try and borrow from the presumably optimised and numerically stable backwards’ in PyTorch. Just to need to learn some calculus (the guide by Jeremy and Terence Parr is nice if others are looking).
Having a play in sympy I get:
image
@Diganta - does this seem OK? Think this would perform better, only exp(x) and a log and tanh versus all the various exps in the one you give (those sort of ops I think being key as they use either slow functions or fast but inaccurate CUDA ops, whereas other math is all fast).

The tanh seems to be along the lines of torch’s backward which uses 1-tanh(x)**2 (well, 1-Y**2 where Y is cached forward output) which is what sympy gives. Though for softplus torch does 1- exp(-Y). From sympy:
image
So don’t get quite what’s going on there. Sympy doesn’t think they’re equal but then it doesn’t think diff(Y) == 1/(1+exp(-x)) which I get, so it’s only handling basic rearranging.

1 Like

This is fantastic work! Thanks to all of you for this!

I have applied some of the techniques on an autoencoder (Ranger optimizer, flat LR during 70% of the run, Mish) and I have got a much faster training as well as a lower loss.
In blue the AdamW with OneCycle, in red the Ranger + Mish + Flat cycle (it’s a 3h training, the x-axis is time and y-axis MSE)
Screenshot%20from%202019-09-27%2018-24-53
The best is the much faster convergence!

Next week, I will play with the Simple Self-Attention layer to see how it helps!
Did you compare the “classical” self-attention to the simple version on Imagenette? I have read all the thread but I didn’t find why you use a simplified version (is it for a better result or to have a quicker training and fewer parameters)?

6 Likes

Thank you so much for this. Yes, it seems correct. I’ll double cross on that again. But weird to see Sympy taking them to be not equal.

Amazing Work. Can you please share the repository?

Hi Tom,

I’m the author of SimpleSelfAttention, thank you for your interest. You might be interested in the results below:

Model Dataset Image Size Epochs Learning Rate # of runs Avg (Max Accuracy) Stdev (Max Accuracy) Avg Wall Time (# of obs)
xresnet18 Imagewoof 128 50 8e-3 20 0.8498 0.00782 9:37 (4)
xresnet18 + simple sa Imagewoof 128 47 8e-3 20 0.8567 0.00937 9:28 (4)
xresnet18 + original sa Imagewoof 128 47 8e-3 20 0.8547 0.00652 11:20 (1)

So yes, definitely quicker as a layer(this is easy to test) and potentially similar results (as shown anecdotally in the table above, and my intuition says it should be equivalent or better).

Even though I see a small improvement on Imagewoof/nette for the same run time,I am still not convinced that it is useful for image classification. More work to be done!

But, if you are using self-attention somewhere and finding it useful (e.g.Sagan), I recommend trying the simplified version.

One important thing is that SimpleSelfAttention is O(C^2* H* W), therefore not as sensitive to increasing image size as the original layer in terms of complexity. Scroll down to " How does this compare to the original Self Attention layer" in this link.

E.g. for a (minibatch,C,H,W) = (64, 32,64,64) input , self-attention takes 2min29s while SimpleSelfAttention takes 296ms (I might have a faster version coming up).

Let me know if you have any questions and how your tests go!

5 Likes