I recently read the mobilenet v3 paper and they used an approximation for swish called hard-swish: x.Relu6(x+3)/6. I extended this to mish and found that x.Relu5(x+3)/5 also seems to be a good approximation for mish. I guess hard-mish would be a good name. Here’s the notebook with my workings.
It also made me think whether the following activation would be any good x.Relu4(x+3)/4:
I haven’t had time to experiment with neural nets, but maybe someone here can try them out.
Thanks for sharing! Any results to share yet?..
Can you shift the minima to -1? And also for the unnamed function curve (marked as yellow) in the 2nd figure, I see a discontinuity at x = 1. Based on intuition we want the function curve to be as continuous as possible, can the function be made purely smooth in the positive domain like for instance just have the effect on yellow line and keep the positive domain as ReLU?
For instance if f(x) = yellow line, then maybe we can write it as:
return (x < 0).float() * f(x) + (x > 0).float() * F.relu(x)
Thanks. I think the closest to those comments is x.Relu3(x+3)/3. It’s pretty smooth and goes further towards -1. All of the other functions (including hard-swish) actually have a slight discontinuity where the Relu part is equal to the divisor. The equivalent problem area is x=0 for the Relu3 function, but it doesn’t seem to have a discontinuity.
I appreciate the comments. My interest is in making faster object detection models, but I haven’t had time to test it yet.
I like the “Mish” Activation Function of yours, but isn’t it expensive to compute than relu or relu6?
Yes, it is, however with CUDA optimization, we were able to get it faster and more closer to ReLU. You can find the CUDA optimized version of Mish here - https://github.com/thomasbrandon/mish-cuda