Meet Mish: New Activation function, possible successor to ReLU?

TomB · September 25, 2019, 1:32pm

Well, it depends on the field.
In psychology where I’m more familiar(-ish) probably not, that’s a pretty small effect size that’s likely not significant (/in which you aren’t confident if we’re eschewing t-tests entirely). In medicine maybe, I’d guess it would depend on practical significance (depending on other treatment options etc) but still probably not though even that small a margin could be lots of lives in certain cases.

In ML you probably delete the analysis as no one publishes SDs even, make up some elaborate explanation accompanied by a few pages of incomprehensible formulas and claim a SOTA.
Joking of course but I’m not sure, the line for practical significance doesn’t seem that great, though 0.03 is still pretty small, what’s that, like 1SD. Probably depends a bit on the theoretical side. If you just tweaked something I’d think there’s probably not much chance of publishing (but I could be wrong).

TomB · September 25, 2019, 1:39pm

And while it can be a slippery slope, increasing the sample size isn’t necessarily wrong. Certainly doing a small run (in some areas you’d even publish that as a pilot study) and then a full-scale study would be common. If you’re including previous runs that also helps, harder to get a significant result then. Then in some areas you’d regularly remove things from your sample (usually participants here) which can get dicey. It’s also more relevant when doing lots of comparisons, with a single t-test there’s less room for manipulation (or similar for a one to many like with mish where you’re looking to see mish beat all not just any, easy to manipulate to beat one or two but not across many). But there’s published reports where they do like 20 independent comparisons, does living near a factory increase any of these 20 things say. Then you’re basically statistically guaranteed to find something significant.

TomB · September 25, 2019, 4:10pm

So, I put together a CUDA version of Mish, mixed results so far. It’s at https://github.com/thomasbrandon/mish-cuda (@Diganta - check the attributions to you are all good, also happy to give it to you).
First issue is that while it builds it causes an import error as it can’t find a PyTorch function needed for the nice easy way I implemented it. I’ve posted to the PyTorch forums but not reply yet (if not replied I might raise an issue, partly scared they’ll take away the really easy way to do a kernel, not sure it’s meant to be public given the error). I’ve added a pretty nasty hack that lets it import in the nasty-hack branch. The issue is a function for potentially overlapping tensors so it doesn’t get called on contiguous tensors which you usually have. But the hack may break all non-contiguous tensors as it just provides a version of the needed function that raises an exception (ideally only for within the extension but maybe anytime it’s imported).

Second issues is it doesn’t train, gives all nan for loss. I’ve got tests to check the gradients against the standard PyTorch implementation and they pass except for torch.autograd.gradgradcheck (second derivative) which I think is only for when you do funky stuff (sending requires_grad=True tensors through the backward pass). So, not sure what’s happening here. I guess that may be the issue with the torch.autograd.Function implementation above. Haven’t delved into this much, only just tried an actual training run.

Third issue is performance is mixed at the moment:

Profiling over 100 runs after 10 warmup runs.
Profiling on GeForce RTX 2070
relu_fwd:      248.4µs ± 1.573µs (235.3µs - 253.7µs)
relu_bwd:      423.6µs ± 59.06µs (416.1µs - 1.011ms)
softplus_fwd:  275.1µs ± 28.11µs (254.7µs - 324.2µs)
softplus_bwd:  423.2µs ± 5.204µs (418.6µs - 434.3µs)
mish_pt_fwd:   797.6µs ± 1.826µs (783.3µs - 803.6µs)
mish_pt_bwd:   1.690ms ± 964.0ns (1.688ms - 1.695ms)
mish_cuda_fwd: 280.6µs ± 2.585µs (260.6µs - 294.7µs)
mish_cuda_bwd: 7.871ms ± 1.251µs (7.867ms - 7.876ms)

mish_pt being the standard PyTorch implementation, though just as:

mish_pt = lambda x: x.mul(torch.tanh(F.softplus(x)))

So forward is good, faster than pytorch, around the same as softplus and near relu. But backward is horrible. Think that’s as I just used the calcs from the autograd function one (direct from paper) unchanged except for converting to C++. So like 6 exps which are quite slow in CUDA. There is a fast exp instruction in CUDA but it has limited accuracy (and especially limited range in which it’s accurate). Enabling it globally through a compiler option made all the gradient checks blow up, but may be able to use it more judiciously. And I should at least be able to use exp(c * inp) == exp(inp)**c to reduce the number. And a some other things to try to increase performance of both forward and backward (not much to optimise though without optimising the PyTorch function I use, though there is a note in there about an optimisation that might help).

As noted in the repo readme you will need to have everything setup to compile PyTorch extensions which can be a trick. There is a dockerfile if you can run GPU dockers (need newish docker and the nVidia container toolkit). Or you might be able to build in the docker and pull out a wheel (Ubuntu based docker so may have issues using wheels from it in conda or non-ubuntu systems). If I can resolve the other issues I can look at a conda package. You should be able to just pip install git+ttps://github.com/thomasbrandon/mish-cuda@nasty-hack or pip install . from a clone(noting the needed branch for the hack).

EDIT: Just tried it in Colab and it doesn’t work in PyTorch 1.1. You can:

!pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html 
!pip -v install git+https://github.com/thomasbrandon/mish-cuda@nasty-hack

and it works. I’ll look into getting it to work with 1.1. Created a quick notebook.

Diganta · September 26, 2019, 3:12am

Thank you @TomB for this. Amazing work. Positive thing being the CUDA version could bring down the forward pass compute time to comparable terms with ReLU and Softplus. I will take a look at the second issue related to the Autograd. I really suggest to open it as an issue in their repository and could you please redirect me to the post in the forums?

TomB · September 26, 2019, 5:11am

Here’s the post: https://discuss.pytorch.org/t/cuda-tensor-apply-in-extension-gives-undefined-symbol/56736
Linker errors like this tend to be bad configuration by inexperienced C++ devs (or compiler issues) but I don’t think that’s the case here.
It’s tied to their quite complicated dispatch and declaration generation system so it’s hard for me to figure out exactly what’s wrong and how to fix. Plus as noted I’m not quite sure whether the functions I’m using are actually meant to be available to non-internal code. Yeah, should raise an issue, a core dev should be able to quickly address it just likely hasn’t been seen on forum given the traffic.

Optimisation might bring down the backward time a little. Also worth looking into the backward of tanh/softplus as presumably, similar to forward, the time should be mainly limited by memory access/cuda launch overhead rather than computational complexity (as shown by running as fast as softplus/tanh and presumably mul individually while combining all three). The only possible thing being increased register usage could mean you can’t achieve the occupancy of those functions which would reduce performance.

TomB · September 26, 2019, 10:03am

OK, looking into the training issues it seems to be that sometimes it’s returning inf/nan and then that throws the whole model off. Here’s a semi-successful training run, gets through 2 epochs, then falls apart: https://gist.github.com/thomasbrandon/16915d8a01bcdd9d4b74abbc7cf6638b

At a guess, as you have grad_inp = grad_out * exp(inp) * w / pow(d, 2), if d==0.0 (within float32 limits), then you’ll be dividing by 0. Might need an epsilon in there. Seem reasonable? Is there a more sensible correction (unlike tensor ops I should be able to do an if d==0.0: d=... or some fix with minimal performance impact, though a non-conditional add will be slightly faster but likely dwarfed by memory access, at least if I get backward optimised, forward is highly memory bound, haven’t profiled backward)? Are there any funky points in exp, like log(0)==inf (showing my very limited maths)?

UPDATE: Adding in an epsilon didn’t seem to help (assuming I did it right). Debugging it, the issue is in backward, it’s giving infinite input gradients. If anyone else is playing around, I made a callback that stops training and keeps the last input/output/loss so hopefully should be able to recreate from that. It doesn’t do the detection until end of backward though (so I’m not trying to retrieve a GPU tensor every batch) so by this point the network state might be screwed, but maybe not, it’s hopefully before weight update so just bas gradients. Or may need to do something about that to properly recreate (you can return modified gradients from a callback so could probably do that through GPU only ops).

muellerzr · September 27, 2019, 2:31am

Hey guys lots to catch up on. I’ll finally have some time to play with things this weekend. Do we need to run some 80 epoch tests with the current setup we had before to measure long-training? I can do anywhere between 5-10 runs.

TomB · September 27, 2019, 10:41am

Haven’t looked at the backward issues any further but I’ve now pushed a CPU version (nice and easy with the PyTorch functions I’m using, basically just changing CUDA to CPU in a few places, though at least with my limited C++, and not wanting to put everything in defines, there’s a fair bit of copy-pasting). It’s still called MishCuda, but works on CPU tensors now. This should help a little with debugging, not needing to do everything on the GPU, and going forward means you could run the CPU tests in CI (they just use cuda:0 if there). Interestingly it even seems to do half precision on CPU. However I had to disable tests for it because none of the PyTorch ops support half on CPU and I compare with those, so I’d need to compute expected values on GPU and then compare to actual CPU results.

On the gradients, I’m thinking I might just look at reworking the gradient calculation. Given the performance this would be needed anyway, so seems it might be better sensible to try and borrow from the presumably optimised and numerically stable backwards’ in PyTorch. Just to need to learn some calculus (the guide by Jeremy and Terence Parr is nice if others are looking).
Having a play in sympy I get:

@Diganta - does this seem OK? Think this would perform better, only exp(x) and a log and tanh versus all the various exps in the one you give (those sort of ops I think being key as they use either slow functions or fast but inaccurate CUDA ops, whereas other math is all fast).

The tanh seems to be along the lines of torch’s backward which uses 1-tanh(x)**2 (well, 1-Y**2 where Y is cached forward output) which is what sympy gives. Though for softplus torch does 1- exp(-Y). From sympy:

So don’t get quite what’s going on there. Sympy doesn’t think they’re equal but then it doesn’t think diff(Y) == 1/(1+exp(-x)) which I get, so it’s only handling basic rearranging.

tomsthom · September 27, 2019, 4:32pm

This is fantastic work! Thanks to all of you for this!

I have applied some of the techniques on an autoencoder (Ranger optimizer, flat LR during 70% of the run, Mish) and I have got a much faster training as well as a lower loss.
In blue the AdamW with OneCycle, in red the Ranger + Mish + Flat cycle (it’s a 3h training, the x-axis is time and y-axis MSE)
Screenshot%20from%202019-09-27%2018-24-53
The best is the much faster convergence!

Next week, I will play with the Simple Self-Attention layer to see how it helps!
Did you compare the “classical” self-attention to the simple version on Imagenette? I have read all the thread but I didn’t find why you use a simplified version (is it for a better result or to have a quicker training and fewer parameters)?

Diganta · September 27, 2019, 9:16pm

Thank you so much for this. Yes, it seems correct. I’ll double cross on that again. But weird to see Sympy taking them to be not equal.

Diganta · September 27, 2019, 9:17pm

Amazing Work. Can you please share the repository?

Seb · September 27, 2019, 9:43pm

Hi Tom,

I’m the author of SimpleSelfAttention, thank you for your interest. You might be interested in the results below:

Model	Dataset	Image Size	Epochs	Learning Rate	# of runs	Avg (Max Accuracy)	Stdev (Max Accuracy)	Avg Wall Time (# of obs)
xresnet18	Imagewoof	128	50	8e-3	20	0.8498	0.00782	9:37 (4)
xresnet18 + simple sa	Imagewoof	128	47	8e-3	20	0.8567	0.00937	9:28 (4)
xresnet18 + original sa	Imagewoof	128	47	8e-3	20	0.8547	0.00652	11:20 (1)

So yes, definitely quicker as a layer(this is easy to test) and potentially similar results (as shown anecdotally in the table above, and my intuition says it should be equivalent or better).

Even though I see a small improvement on Imagewoof/nette for the same run time,I am still not convinced that it is useful for image classification. More work to be done!

But, if you are using self-attention somewhere and finding it useful (e.g.Sagan), I recommend trying the simplified version.

One important thing is that SimpleSelfAttention is O(C^2* H* W), therefore not as sensitive to increasing image size as the original layer in terms of complexity. Scroll down to " How does this compare to the original Self Attention layer" in this link.

E.g. for a (minibatch,C,H,W) = (64, 32,64,64) input , self-attention takes 2min29s while SimpleSelfAttention takes 296ms (I might have a faster version coming up).

Let me know if you have any questions and how your tests go!

TomB · September 28, 2019, 10:11am

It works!!! Pushed a working version (still on the nasty-hack branch).

I think the main issue was that exp quickly overflows:

>>>for dtype in [np.float16, np.float32, np.float64]:
...  max = np.nonzero(np.exp(np.arange(0, 1000, 10, dtype=dtype)) == np.inf)[0][0]*10
...  print(F"{dtype.__name__}: exp({max}) == inf")
float16: exp(20) == inf
float32: exp(90) == inf
float64: exp(710) == inf

It now uses the tricks in PyTorch’s softplus to maintatin stability, derivative as 1-exp(-Y) and just using the input when above a threshold (20 by default). There’s a Derivatives notebook in extras that goes through it if people with more maths want to verify.

Passes all tests except the second derivative one which still fails, I’m thinking I’m not following something needed for that to pass, the PyTorch op autograd function version passes.

A stability check that failed on the first iteration or two before now does 1000 iterations fine (random each iteration).

And as a bonus the performance is MUCH better:

relu_fwd:      248.6µs ± 1.617µs (234.4µs - 251.1µs)
relu_bwd:      421.8µs ± 44.19µs (416.2µs - 861.5µs)
softplus_fwd:  305.4µs ± 26.03µs (254.5µs - 321.4µs)
softplus_bwd:  427.0µs ± 4.278µs (419.1µs - 433.9µs)
mish_pt_fwd:   795.8µs ± 1.882µs (780.4µs - 801.0µs)
mish_pt_bwd:   1.691ms ± 808.9ns (1.689ms - 1.692ms)
mish_cuda_fwd: 281.2µs ± 2.849µs (260.0µs - 292.4µs)
mish_cuda_bwd: 494.3µs ± 1.470µs (491.4µs - 497.4µs)

Real-world performance on the network I was using before (7 layers of conv/actn with more features and less aggressive strides than typical to emphasise actn performance) is the same as RelU (just going on epoch time). And, while I wasn’t at all designing that network for accuracy (notably no BN), it topped that too, one run, no SD etc so definitely more tests needed, but final topk:
Relu: 91.92%
Mish PyTorch: 94.40%
Mish CUDA: 95.03%

That’s without any optimisation. Though it’s likely largely bound by memory access so there may not be much scope for that, but we’ll see.

Challenge accepted

Musing on optimisation:
Currently on backward I recalculate the forward from input. Many ops in PyTorch calculate based on output, that may be better. Or may be better to stash an intermediate, and calculate out from that. One issue I have currently there is I can’t see how to numerically stabilise the exp(inp)+1 that pops up in possible simplifications. I currently just use the stable 1-exp(-Softplus(inp)) trick from PyTorch, but avoiding calculating softplus again might help performance, e.g. taking that as my intermediate to calculate out from, but then don’t know how to get exp(inp)+1 stably and maintaining derivative. Maybe someone who actually knows math can help here.

Diganta · September 28, 2019, 11:26am

Wow. This is awesome. I take back my words haha, @Seb you should take a look at this. This is absolutely great news considering now Mish can be like the SOTA in every aspect. If we could get the 2nd derivative working too that would seal the deal. Thanks again!

TomB · September 28, 2019, 11:29am

Oh, I wasn’t actually saying it is as fast as ReLU. Close, but still a bit behind. That’s the challenge.

Diganta · September 28, 2019, 11:30am

That is still a huge incremental update on the previous timings.

TomB · September 28, 2019, 11:46am

Yeah, a nice improvement. Definitely happy to see that. I had no idea how much work it would be to get it to perform well.
That PyTorch CUDA_tensor_apply stuff is very nice (it is like 50Kb to implement 4 functions in an optimised way). That’s where all the performance is. So really hoping I can resolve the issue with that and use it. If not I can look to pull it out of PyTorch (it’s just a header so it’s actually all compiled into the extension other than support functions). The hack I made just completely disables the private function that causes issues, raising an error if called, but I’ve never triggered it so at least I could probably fork and edit that out. Still nicer to avoid. Now that other major issues are resolved 'll focus on resolving that.
I also haven’t looked at CPU performance at all. I’m thinking it won’t perform that well, possibly slower than the PyTorch ops as there’s no explicit vectorisation (it should be parallelised across cores though I think). So at best it has whatever optimisation the compiler manages to apply (generally not much without careful coding), at worst it’s all on single elements. So I’ll look at that at some point. As much to play around with that as anything. Though of course most inference at scale is CPU so it is pretty important.

And yeah, the second derivative thing needs looking at. If someone with more knowledge of derivatives/Jacobians etc. wants to take a look that would be great. Skimming the function I had no idea what was going on in torch.autograd.gradgradcheck. Though I haven’t stepped through it or anything.

Seb · September 28, 2019, 12:23pm

This is great news! I look forward to the upcoming test results.

oguiza · September 28, 2019, 12:28pm

It’s great to see this type of improvement!! Another step closer to making Mish a really competitive activation function!!

LessW2020 · September 28, 2019, 7:41pm

Wow, fantastic work @TomB!
This is very exciting to see as it will definitely help Mish go mainstream.
Thanks for all the work on this!