Hard-mish activation function

I recently read the mobilenet v3 paper and they used an approximation for swish called hard-swish: x.Relu6(x+3)/6. I extended this to mish and found that x.Relu5(x+3)/5 also seems to be a good approximation for mish. I guess hard-mish would be a good name. Here’s the notebook with my workings.

It also made me think whether the following activation would be any good x.Relu4(x+3)/4:

I haven’t had time to experiment with neural nets, but maybe someone here can try them out.

8 Likes

Thanks for sharing! Any results to share yet?..

1 Like

Can you shift the minima to -1? And also for the unnamed function curve (marked as yellow) in the 2nd figure, I see a discontinuity at x = 1. Based on intuition we want the function curve to be as continuous as possible, can the function be made purely smooth in the positive domain like for instance just have the effect on yellow line and keep the positive domain as ReLU?
For instance if f(x) = yellow line, then maybe we can write it as:

def unnamed(x):
     return (x < 0).float() * f(x) + (x > 0).float() * F.relu(x)

Thanks. I think the closest to those comments is x.Relu3(x+3)/3. It’s pretty smooth and goes further towards -1. All of the other functions (including hard-swish) actually have a slight discontinuity where the Relu part is equal to the divisor. The equivalent problem area is x=0 for the Relu3 function, but it doesn’t seem to have a discontinuity.

I appreciate the comments. My interest is in making faster object detection models, but I haven’t had time to test it yet.

1 Like

I like the “Mish” Activation Function of yours, but isn’t it expensive to compute than relu or relu6?

Yes, it is, however with CUDA optimization, we were able to get it faster and more closer to ReLU. You can find the CUDA optimized version of Mish here - https://github.com/thomasbrandon/mish-cuda

2 Likes

I was thinking that SWISH should be more efficient to compute than MISH, and I found this piecewise function based on SWISH which follows MISH closely:

approx_mish(x) =
x * sigmoid(x) for x <= 0 (SWISH with β = 1)
x * sigmoid(2x) for x >= 0 (SWISH with β = 2)

I noticed this after reading the unpublished APTx paper. Their proposed APTx function to approximate MISH turns out to be just SWISH with β = 2, it’s not a distinct function and it’s not very close to MISH for negative x, but it was thought provoking.

The Cuda MISH or hard MISH would likely be a better option.

I also had an idea that it would be possible to change the activation functions for a model after training, by running some epochs while gradually lerping from the old to the new activation function. This might enable to do the bulk of training with an expensive activation function, but perform inference using a cheaper function such as ReLU or hard-mish. I don’t know if such fine-tuning to change activation functions has been tried. We could also try fine tuning to prune down a model to use fewer neurons.

The performance of Cuda MISH is only 20% slower than ReLU so I guess it’s okay as it is, but perhaps it would be cheaper to use ReLU if running inference on a CPU or a weaker GPU.

Plot of that piecewise SWISH vs MISH:

Plot of the derivatives:

I thought maybe it would be good for an activation function to be linear for all positive x, like ReLU. The following one is quite close to MISH, but linear for positive x. The gradient at the origin is 1, but not smooth there. The minimum is near (-1.023, -0.446), south east of MISH’s minimum near (-1.192, -0.309).

linear_swish(x) =
2x*sigmoid(1.25x) for x <= 0
x for x >= 0

(I’m a noob so if this happens all to be ignorant nonsense, I apologise!)

1 Like

@Diganta - you might be interested in the above idea from @sswam

Hi,

APTx Activation Function: [i.e. short for Alpha Plus Tanh Times]

𝞍(𝑥) = (𝞪 + 𝑡𝑎𝑛h(𝛽𝑥)) * γ𝑥

The derivative of APTx:

𝞍’(𝑥) = γ(𝞪 + 𝑡𝑎𝑛h(𝛽𝑥) + 𝛽𝑥𝑠𝑒𝑐h^2(𝛽𝑥))

We can also convert Sech^2 to tanh form [using tanh^2(x) + sech^2(x) = 1], to further reduce the computations required to calculate the derivatives during backward propagation.

Published Paper Link: https://doi.org/10.51483/IJAIML.2.2.2022.56-61

The paper shows positive part approximation of MISH at 𝞪 = 1, 𝛽 = 1 and γ = 1⁄2, As at that time we were focussed on the positive gradients only.
Paper also mentions that- Even more overlapping between MISH, and APTx derivatives can be generated by varying values for 𝞪 , 𝛽 and γ parameters

One can vary 𝞪 , 𝛽 and γ to generate the complete i.e. positive and negative part mapping for MISH. I think it might work for values: 𝞪 = 1, 𝛽 = 1/2, γ=1/1.9 or γ=1/1.95

Advantages/Objective of creating APTx activation function was:
The derivative and function of APTx requires lesser computation, when compared to expressions used for SWISH, and MISH.

  1. Using the expression of APTx will speed up the computation part.
  2. Can use other mappings of APTx by varying 𝞪, 𝛽, and γ for positive and negative part. [that will still be faster than MISH, AND SWISH TO COMPUTE]
  3. APTx is same at 𝞪 = 1, 𝛽 = 1 and γ = 1⁄2 for MISH for the positive part. But its computation requirements for derivative is lesser than SWISH(x, 2)
  4. The Objective of APTx was- a) lesser computing requirements 2) Generalised function with ability to approximate MISH, SWISH by varying 𝞪 , 𝛽 and γ
  5. Other mappings can be generated by varying 𝞪 , 𝛽 and γ

By the way, Good Finding !

APTx with 𝞪 = 1, 𝛽 = 1/2 and γ = 1⁄2 approximate for negative domain of MISH. But its computation requirements for derivative is lesser than SWISH(x, 1) although its output is same as SWISH(x, 1) for these values of 𝞪 = 1, 𝛽 = 1/2 and γ = 1⁄2

Using APTx one can also generate the SWISH(x, ρ) activation function at parameters 𝞪=1, 𝛽 = ρ/2 and γ=1⁄2 But with lesser computations in forward and backward propagation.

Thanks @jeremy for tagging me. Been a while since I have been following this thread.

@sswam In regards to run-time comparison you might be interested in this evaluation
At the end, I feel while per epoch run-time/ FLOP cost is important, I am more concerned with total compute budget, i.e., epochs for convergence \times run-time per epoch. Even though Mish might be 20% slower per epoch, it ideally in most cases converges earlier than ReLU with mostly better performance. So in my opinion, incremental changes like improving run-time per epoch of Mish can be very beneficial for certainly different hardware, its just not as significant since any smooth function with composition of functions are inherently by design more expensive than ReLU. But the avenue is certainly open to explore, if you could demonstrate some results with the approximation you proposed on ImageNet or equivalent large-scale benchmarks, it can definitely be a significant contribution. :slight_smile:

1 Like