Meet Mish: New Activation function, possible successor to ReLU?

muellerzr · August 27, 2019, 7:47pm

Sure! I’ll just run each run in a separate cell. Simple enough

Seb · August 27, 2019, 7:49pm

Maybe you could reproduce mgrankin’s loop to run multiple runs in one cell (less work for you)? Or does it not work with colab?

muellerzr · August 27, 2019, 7:52pm

It seems to cut it off for some odd reason also @LessW2020 I was getting tensor mismatch size errors… went to using Ralamb’s direct code

Cut off is probably due to the learn.validate() being called too

LessW2020 · August 27, 2019, 7:54pm

Thanks @muellerzr , I’ve got a server now at last and just fixed that comma. Let me see what’s up with tensor mismatch but running direct sounds faster for now!

muellerzr · August 27, 2019, 7:59pm

@LessW2020 I’ll leave it to you to see but I called it early.I may have done something wrong but with the new fixes it’s actually worse ~59%

I may have missed something as I’m half rushing before class right now. Let me know how you wind up doing.

jwuphysics · August 27, 2019, 7:59pm

Thanks @LessW2020 for getting this together so quickly! However, I’m also getting the tensor size mismatch:

    134                     continue
    135                 #at k interval: take the difference of (RAdam params - LookAhead params) * LookAhead alpha param
--> 136                 q.data.add_(self.alpha,p_data_fp32 - q.data)
    137                 #update novo's weights with the interpolated weights
    138                 p.data.copy_(q.data)

RuntimeError: The size of tensor a (512) must match the size of tensor b (3) at non-singleton dimension 3

LessW2020 · August 27, 2019, 8:02pm

Hi @jwuphysics,
sorry, it’s fixed now - please sync one more time
That said, I’m running now and not getting impressive results with the changes…super slow, and nothing better than baseline so far.
May revert back to pre-changes.

muellerzr · August 27, 2019, 8:05pm

@LessW2020 @Seb here is the notebook:

github.com

muellerzr/fastai-Experiments-and-tips/blob/master/ImageWoofTests/with_fix_did_not_work.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "ProperTest.ipynb",
      "version": "0.3.2",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "id": "bfCN2QLKlrHs",

This file has been truncated. show original

Let me know if you see anything inherently wrong (I know it’s not much to go on, I may have done something obvious

@LessW2020 I saw the same thing!

Seb · August 27, 2019, 8:09pm

I chose to stay with vanilla Adam+oneCycle because I am not convinced by the other optimizers yet (“Over9000”/RangerLars does better on 5 epochs, but is slower).

Imagewoof 128, 5 epochs:
–bs 64 --mixup 0 --sa 0 --epoch 5 --lr 3e-3

Mish:
[0.658 0.67 0.656 0.644 0.642 0.652 0.65 0.668 0.648 0.632] (10 runs)
mean: 0.6512
stdev: 0.011027233

ReLU: (baseline by mgrankin)
[0.66 0.61 0.616 0.606 0.614 0.628 0.628 0.626 0.62 0.576 0.61 0.608 0.57 0.588 0.628 0.634 0.616 0.584 0.6
0.628] (20 runs)
mean: 0.6125
stdev: 0.020889001

Mish beats ReLU at a high significance level (P < 0.0001).
Obviously we have to see what happens with more epochs.
Another concern is that Mish is a bit slower than ReLU (31 vs 26 second/epoch). Eventually I’d like to see “same runtime” comparisons. Or maybe Mish’s implementation can be improved???

LessW2020 · August 27, 2019, 8:12pm

Cool, thanks for the testing @Seb!
I did not test with onecycle for Mish btw - so far I keep seeing worse results on everything but Adam when using OneCycle.
The flat+ anneal outperforms in my testing vs OneCycle.
Also, I only had 1 second change in epochs with Mish…but it’s possible the implementation could be done in place for example and that should speed it up.

LessW2020 · August 27, 2019, 8:15pm

Just wanted to highlight this p value - thanks for putting better stats into our testing here!

Seb · August 27, 2019, 8:16pm

Ya, in this case the gap is so big that it was clearly significant…

muellerzr · August 27, 2019, 8:19pm

So the new test (for SOTA) is my vanilla run for 20 epochs correct? So I know what to test tonight if you’re already doing it let me know!

Along with 5 for 10 times? Or do we want 20

I’ll post whichever into the proper thread

Seb · August 27, 2019, 8:27pm

I’m running Adam+mish on 80 epochs (Imagewoof 128), 3 times. I might have to rerun a baseline too…

You could do 20 epochs. I don’t think we have a baseline result for that either… 5 times might be a good start! If it’s too close we’ll run more.

muellerzr · August 27, 2019, 8:30pm

I’ll run it along with a baseline once I’m out of class tonight!

Baseline is just native Adam for 1e-3?

Seb · August 27, 2019, 8:32pm

Adam + ReLU yes, but 3e-3
–epochs 20 --bs 64 --lr 3e-3 --mixup 0 --woof 1 --size 128

That’s assuming you’re doing Adam+Mish?

muellerzr · August 27, 2019, 8:37pm

I can do Adam mish.

I was originally just doing vanilla Adam with the setup I had before.

Seb · August 27, 2019, 8:50pm

Oh I see, I misunderstood.

My suggestion is to have a baseline that is whatever you ran, with ReLU instead of Mish so that we can isolate the effect of mish.

muellerzr · August 27, 2019, 8:51pm

Got it! I can run that

fgfm · August 27, 2019, 9:07pm

Thanks @LessW2020 for the incredible work you’ve been doing this week!

Regarding your runs, I was wondering two things:

have you only been training from scratch?
if not, before unfreezing all layers, did you already have had replaced relu by mish in frozen layers?

I’m actually wondering about the potential results of heterogeneity of activation inside a single network. My point being that, if it doesn’t impact performance, it drastically reduces the need to retrain previous architectures!

Using pretrained models from torchvision and only training late layers with the activation replacement in those unfrozen layers, we can benefit from all pretrained models without having to retrain everything the torchvision team has

I will be experimenting on that myself, but just in case you had already walked down that road