So, I put together a CUDA version of Mish, mixed results so far. It’s at https://github.com/thomasbrandon/mish-cuda (@Diganta - check the attributions to you are all good, also happy to give it to you).
First issue is that while it builds it causes an import error as it can’t find a PyTorch function needed for the nice easy way I implemented it. I’ve posted to the PyTorch forums but not reply yet (if not replied I might raise an issue, partly scared they’ll take away the really easy way to do a kernel, not sure it’s meant to be public given the error). I’ve added a pretty nasty hack that lets it import in the nasty-hack
branch. The issue is a function for potentially overlapping tensors so it doesn’t get called on contiguous tensors which you usually have. But the hack may break all non-contiguous tensors as it just provides a version of the needed function that raises an exception (ideally only for within the extension but maybe anytime it’s imported).
Second issues is it doesn’t train, gives all nan for loss. I’ve got tests to check the gradients against the standard PyTorch implementation and they pass except for torch.autograd.gradgradcheck
(second derivative) which I think is only for when you do funky stuff (sending requires_grad=True
tensors through the backward pass). So, not sure what’s happening here. I guess that may be the issue with the torch.autograd.Function
implementation above. Haven’t delved into this much, only just tried an actual training run.
Third issue is performance is mixed at the moment:
Profiling over 100 runs after 10 warmup runs.
Profiling on GeForce RTX 2070
relu_fwd: 248.4µs ± 1.573µs (235.3µs - 253.7µs)
relu_bwd: 423.6µs ± 59.06µs (416.1µs - 1.011ms)
softplus_fwd: 275.1µs ± 28.11µs (254.7µs - 324.2µs)
softplus_bwd: 423.2µs ± 5.204µs (418.6µs - 434.3µs)
mish_pt_fwd: 797.6µs ± 1.826µs (783.3µs - 803.6µs)
mish_pt_bwd: 1.690ms ± 964.0ns (1.688ms - 1.695ms)
mish_cuda_fwd: 280.6µs ± 2.585µs (260.6µs - 294.7µs)
mish_cuda_bwd: 7.871ms ± 1.251µs (7.867ms - 7.876ms)
mish_pt being the standard PyTorch implementation, though just as:
mish_pt = lambda x: x.mul(torch.tanh(F.softplus(x)))
So forward is good, faster than pytorch, around the same as softplus and near relu. But backward is horrible. Think that’s as I just used the calcs from the autograd function one (direct from paper) unchanged except for converting to C++. So like 6 exp
s which are quite slow in CUDA. There is a fast exp
instruction in CUDA but it has limited accuracy (and especially limited range in which it’s accurate). Enabling it globally through a compiler option made all the gradient checks blow up, but may be able to use it more judiciously. And I should at least be able to use exp(c * inp) == exp(inp)**c
to reduce the number. And a some other things to try to increase performance of both forward and backward (not much to optimise though without optimising the PyTorch function I use, though there is a note in there about an optimisation that might help).
As noted in the repo readme you will need to have everything setup to compile PyTorch extensions which can be a trick. There is a dockerfile if you can run GPU dockers (need newish docker and the nVidia container toolkit). Or you might be able to build in the docker and pull out a wheel (Ubuntu based docker so may have issues using wheels from it in conda or non-ubuntu systems). If I can resolve the other issues I can look at a conda package. You should be able to just pip install git+ttps://github.com/thomasbrandon/mish-cuda@nasty-hack
or pip install .
from a clone(noting the needed branch for the hack).
EDIT: Just tried it in Colab and it doesn’t work in PyTorch 1.1. You can:
!pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html
!pip -v install git+https://github.com/thomasbrandon/mish-cuda@nasty-hack
and it works. I’ll look into getting it to work with 1.1. Created a quick notebook.