Meet Mish: New Activation function, possible successor to ReLU?

There have been a couple of papers and projects which have explored doing a mixed activation training. One was where they trained the model for 20 epochs with ReLU and then 15 more with Mish. Additionally, there was another project which used different activation for different layers. This is very empirical and doesn’t have any universal justification. There’s a lot you can experiment with. Weight initialization, optimizers , better LR policy.

1 Like