Meet Mish: New Activation function, possible successor to ReLU?

Sure! I’ll just run each run in a separate cell. Simple enough :slight_smile:

Maybe you could reproduce mgrankin’s loop to run multiple runs in one cell (less work for you)? Or does it not work with colab?

It seems to cut it off for some odd reason :confused: also @LessW2020 I was getting tensor mismatch size errors… went to using Ralamb’s direct code :frowning:

  • Cut off is probably due to the learn.validate() being called too
1 Like

Thanks @muellerzr , I’ve got a server now at last and just fixed that comma. Let me see what’s up with tensor mismatch but running direct sounds faster for now!

@LessW2020 I’ll leave it to you to see but I called it early.I may have done something wrong but with the new fixes it’s actually worse ~59%

I may have missed something as I’m half rushing before class right now. Let me know how you wind up doing.

1 Like

Thanks @LessW2020 for getting this together so quickly! However, I’m also getting the tensor size mismatch:

    134                     continue
    135                 #at k interval: take the difference of (RAdam params - LookAhead params) * LookAhead alpha param
--> 136                 q.data.add_(self.alpha,p_data_fp32 - q.data)
    137                 #update novo's weights with the interpolated weights
    138                 p.data.copy_(q.data)

RuntimeError: The size of tensor a (512) must match the size of tensor b (3) at non-singleton dimension 3

Hi @jwuphysics,
sorry, it’s fixed now - please sync one more time :slight_smile:
That said, I’m running now and not getting impressive results with the changes…super slow, and nothing better than baseline so far.
May revert back to pre-changes.

2 Likes

@LessW2020 @Seb here is the notebook:

Let me know if you see anything inherently wrong (I know it’s not much to go on, I may have done something obvious

@LessW2020 I saw the same thing!

1 Like

I chose to stay with vanilla Adam+oneCycle because I am not convinced by the other optimizers yet (“Over9000”/RangerLars does better on 5 epochs, but is slower).

Imagewoof 128, 5 epochs:
–bs 64 --mixup 0 --sa 0 --epoch 5 --lr 3e-3

Mish:
[0.658 0.67 0.656 0.644 0.642 0.652 0.65 0.668 0.648 0.632] (10 runs)
mean: 0.6512
stdev: 0.011027233

ReLU: (baseline by mgrankin)
[0.66 0.61 0.616 0.606 0.614 0.628 0.628 0.626 0.62 0.576 0.61 0.608 0.57 0.588 0.628 0.634 0.616 0.584 0.6
0.628] (20 runs)
mean: 0.6125
stdev: 0.020889001

Mish beats ReLU at a high significance level (P < 0.0001).
Obviously we have to see what happens with more epochs.
Another concern is that Mish is a bit slower than ReLU (31 vs 26 second/epoch). Eventually I’d like to see “same runtime” comparisons. Or maybe Mish’s implementation can be improved???

3 Likes

Cool, thanks for the testing @Seb!
I did not test with onecycle for Mish btw - so far I keep seeing worse results on everything but Adam when using OneCycle.
The flat+ anneal outperforms in my testing vs OneCycle.
Also, I only had 1 second change in epochs with Mish…but it’s possible the implementation could be done in place for example and that should speed it up.

1 Like

Just wanted to highlight this p value :slight_smile: - thanks for putting better stats into our testing here!

Ya, in this case the gap is so big that it was clearly significant…

2 Likes

So the new test (for SOTA) is my vanilla run for 20 epochs correct? So I know what to test tonight :slight_smile: if you’re already doing it let me know!

Along with 5 for 10 times? Or do we want 20

I’ll post whichever into the proper thread

I’m running Adam+mish on 80 epochs (Imagewoof 128), 3 times. I might have to rerun a baseline too…

You could do 20 epochs. I don’t think we have a baseline result for that either… 5 times might be a good start! If it’s too close we’ll run more.

1 Like

I’ll run it along with a baseline once I’m out of class tonight!

Baseline is just native Adam for 1e-3?

Adam + ReLU yes, but 3e-3
–epochs 20 --bs 64 --lr 3e-3 --mixup 0 --woof 1 --size 128

That’s assuming you’re doing Adam+Mish?

I can do Adam mish.

I was originally just doing vanilla Adam with the setup I had before.

Oh I see, I misunderstood.

My suggestion is to have a baseline that is whatever you ran, with ReLU instead of Mish so that we can isolate the effect of mish.

1 Like

Got it! I can run that :slight_smile:

Thanks @LessW2020 for the incredible work you’ve been doing this week!

Regarding your runs, I was wondering two things:

  • have you only been training from scratch?
  • if not, before unfreezing all layers, did you already have had replaced relu by mish in frozen layers?

I’m actually wondering about the potential results of heterogeneity of activation inside a single network. My point being that, if it doesn’t impact performance, it drastically reduces the need to retrain previous architectures!

Using pretrained models from torchvision and only training late layers with the activation replacement in those unfrozen layers, we can benefit from all pretrained models without having to retrain everything the torchvision team has :sweat_smile:

I will be experimenting on that myself, but just in case you had already walked down that road :slight_smile:

2 Likes