Meet Mish: New Activation function, possible successor to ReLU?

Redknight · December 18, 2019, 1:41pm

I tried this by error, at least for Mish the performance drop is very hard.

Diganta · December 18, 2019, 6:54pm

What did you try exactly? I might be missing something.

Redknight · December 18, 2019, 7:07pm

I trained with Mish and then run with Relu activation instead.

Diganta · December 19, 2019, 2:14am

Right. Okay

ilovescience · December 27, 2019, 6:41am

Found this post of a 4th place solution to Kannada MNIST using Over9000 and Mish:

Diganta · December 28, 2019, 4:57am

I was recently trying to train ImageNet on a p3.16xlarge cluster (8 V100 GPUs) using a ShuffleNet but have no clue why it’s so slow. 15 minutes into training not even a single epoch gets completed and the connection times out because of idle instance. Any clue what I might be doing wrong?

Updated CUDA to 10.2
And increased idle instance wait time

Still the same result

Repository - https://github.com/LandskapeAI/ImageNet

Diganta · January 6, 2020, 12:35am

A somewhat big news. A custom ResNet 9 using Mish activation function beat the fastest CIFAR-10 training time to 94% accuracy on the Stanford DAWN Benchmark. Previous best timing - 28 seconds. New Score - 10.7 seconds. Find the PR - https://github.com/stanford-futuredata/dawn-bench-entries/pull/124 . Official Results (not updated yet) - https://dawn.cs.stanford.edu/benchmark/#cifar10-train-time
c10

iyaja · January 6, 2020, 5:47am

Hi, I’m the one who made that submission. I’m surprised that someone found it so fast. I just submitted the PR yesterday.

Mish was super helpful. It helped shave approximately 2 seconds off the final training time, bringing my 4 GPU timing close to the 8 GPU timing submitted by Apple (which was 9.x seconds, I think). But if both Apple’s PR and my PR get merged, this won’t be the fastest overall but will be the fastest on 4 GPUs. Unfortunately, I don’t have access to 8 GPU nodes on HAL (which is the computing cluster I used to train), but I’m trying out a few more things to speed it up, like the CUDA version of mish and maybe a different optimizer.

Note: most of the work was done by David Page and Apple. I tweaked a few hyperparameters and ran the training on different hardware.

Diganta · January 6, 2020, 12:20pm

Hey
Firstly, thanks for giving Mish a try, glad you got good results with it. I wasn’t aware of the Apple submission timing since it wasn’t updated on the DAWN website but being the fastest on 4 GPU is remarkable in itself.
Did you use the CUDA implementation of Mish by @TomB ?

iyaja · January 7, 2020, 7:25am

Nope. Not in this submission. I used the JIT version that’s in fastai. I’ll probably try @TomB’s CUDA sometime soon, though.

iyaja · January 8, 2020, 5:41pm

@Diganta It got merged. https://dawn.cs.stanford.edu/benchmark/#cifar10-train-time

Diganta · January 9, 2020, 8:01am

ResNet-50 with Mish trained on ImageNet is now available on my repository to download. Find the link here - https://github.com/digantamisra98/Mish#imagenet-scores

Diganta · January 10, 2020, 2:53am

DarkNet 53 + Mish on ImageNet is now available to download - https://github.com/digantamisra98/Mish#imagenet-scores

ai_padawan · January 11, 2020, 8:30pm

Is there a version of mxresnet that can work with unet_learner in fastai v1?

Diganta · January 12, 2020, 5:27am

Faster Approximation for Mish? https://github.com/digantamisra98/Mish/issues/22

cha39421 · January 13, 2020, 3:17am

As is pointed out by tmassingham-ont in the linked thread,
It is an identity rather an approximation. But of course whether the results are indeed “identical” is hard to tell considering all the floating point operations and different platform involved.

Anyway, preliminary results with naive code show faster in the CPU, slower than original code and CUDA versions in the GPU.

However tmassingham-ont pointed out that the code might be accelerated with @torch.jit.script.

Could anybody suggest any further potential optimizations to the code?

Diganta · January 13, 2020, 3:39am

I missed out that it was an identical function. My bad. I think Fast.ai v2 uses torch.jit decorator for the function so maybe @muellerzr can comment more on the same?

muellerzr · January 13, 2020, 3:42am

Yes, using the @torch.jit script (similar to how Mish is defined) in the v2 library is how you’d want to go about that (I wish I had the time to test it out but I do not). Look into here if you want to see how to do it in fastai2:

github.com

fastai/fastai2/blob/master/fastai2/layers.py#544

# AUTOGENERATED! DO NOT EDIT! File to edit: nbs/01_layers.ipynb (unless otherwise specified).

__all__ = ['module', 'Identity', 'Lambda', 'PartialLambda', 'Flatten', 'View', 'ResizeBatch', 'Debugger',
           'sigmoid_range', 'SigmoidRange', 'AdaptiveConcatPool2d', 'PoolType', 'adaptive_pool', 'PoolFlatten',
           'NormType', 'BatchNorm', 'InstanceNorm', 'BatchNorm1dFlat', 'LinBnDrop', 'init_default', 'ConvLayer',
           'AdaptiveAvgPool', 'MaxPool', 'AvgPool', 'BaseLoss', 'CrossEntropyLossFlat', 'BCEWithLogitsLossFlat',
           'BCELossFlat', 'MSELossFlat', 'L1LossFlat', 'LabelSmoothingCrossEntropy', 'trunc_normal_', 'Embedding',
           'SelfAttention', 'PooledSelfAttention2d', 'SimpleSelfAttention', 'icnr_init', 'PixelShuffle_ICNR',
           'SequentialEx', 'MergeLayer', 'Cat', 'SimpleCNN', 'ProdLayer', 'inplace_relu', 'SEModule', 'ResBlock',
           'SEBlock', 'SEResNeXtBlock', 'SeparableBlock', 'swish', 'Swish', 'MishJitAutoFn', 'mish', 'MishJit',
           'ParameterModule', 'children_and_parameters', 'flatten_model', 'NoneReduce', 'in_channels']

# Cell
from .imports import *
from .torch_imports import *
from .torch_core import *
from torch.nn.utils import weight_norm, spectral_norm

# Cell
def module(*flds, **defaults):

This file has been truncated. show original

kshitijpatil09 · January 22, 2020, 3:08pm

Hi @TomB , I’m trying to use mish activation with MobileNetv2. The model implementation is borrowed from pytorchcv, it works fine with relu6 activation but throws
CUDA: Out of memory
error for mish, swish, h-swish (Haven’t tried any others). I’ve created issue regarding this on my repository.
I’m using Google Colab with fastai_1.0.61dev0, have tried torch.cuda.empty_cache() and restarted the runtime as well. But the error still persists except for relu6 activation.

The code producing this error could be found in experiments.ipynb

Please help me in this regard.

TomB · January 23, 2020, 7:47am

Those simple implementations use more memory as intermediates need to be saved for the backward pass. Either the autograd function implementations as in fastai2 here or my CUDA implementations will use similar memory to ReLU (possibly a bit more than the inplace version of ReLU but this shouldn’t be much difference).
The comparison notebook I made for Swish shows the difference in memory usage between the various implementations.