Meet Mish: New Activation function, possible successor to ReLU?

Those simple implementations use more memory as intermediates need to be saved for the backward pass. Either the autograd function implementations as in fastai2 here or my CUDA implementations will use similar memory to ReLU (possibly a bit more than the inplace version of ReLU but this shouldn’t be much difference).
The comparison notebook I made for Swish shows the difference in memory usage between the various implementations.

3 Likes