Meet Mish: New Activation function, possible successor to ReLU?

@LessW2020 this was a question asked in the other forum post, did we try with LAMB yet? Or LAMB + LookAhead? (or can we run that test briefly?)

1 Like

Generally you’ll need 100+ epochs for those augmentations to help much, in my experience.

5 Likes

Hi -we didn’t test LAMB yet. That was because I thought LAMB was just LARS but optimized for larger batches…however, it seems to be a slight improvement over LARS:

  1. If the numerator (r₁ below) or denominator (r₂ below) of the trust ratio is 0, then use 1 instead.
  2. Fixing weight decay: in LARS, the denominator of the trust ratio is |∇L| + β |w|, whereas in LAMB it’s |∇L + β w|. This preserves more information.
  3. Instead of using the SGD update rule, they use the Adam update rule.
  4. Clip the trust ratio at 10.
    Source for the above:
    https://towardsdatascience.com/an-intuitive-understanding-of-the-lamb-optimizer-46f8c0ae4866

So let me get it up and running and will test in an hour or so!

4 Likes

Thanks Less!!!

I see that you two have kept up with the awesome work, congratulations!

Regarding the clipping, I recall we’ve had a discussion about whether we should only clip the numerator r1 or the full trust ratio (empirically if I remember correctly, numerator clip was performing better). But regarding the weight decay fix, I disagree that it was new in Lamb.

In LARS paper, the author mentions it, just makes a simple illustration with SGD and maximize |∇L + β w| with |∇L| + β |w| (β being positive, the case of equality is only possible if the two tensors are colinear). But I can confirm it performed better on my training runs :slight_smile:

Keep up the good work!

EDIT: typo on the tensor condition for norm equality

2 Likes

I ran quick tests on Lamb, RangerLamb (lookahead+lamb), RangerAdam (adam+lookahead)…at least for five epochs, none were competitive. Small sample set of 1x5 epochs each, but the big ding on lamb is it’s like 2-3x as slow as Ranger…so for the same time, Ranger could run 2x+ epochs and beat on that basis alone.

Regardless, here’s the runs and their loss curves…basically what appears to really set Ranger ahead is it continues with a fairly aggressive path in the middle of the run. I believe this is from the solid launch pad that RAdam provides at the start. Lamb and variants all start off ok but then flatten quite a bit in the middle. I suppose this might be the trust ratio clamping down too hard there, but not sure.

Ranger curve:

Lamb curve:

RangerLamb:

RangerAdam (Lamb running as Adam is why it’s so slow…same calcs but then ditches trust ratio):

4 Likes

I was the one asking about using LAMB. Thanks for doing the experiments!

3 Likes

fyi, I’ve rewritten parts of Ranger to provide a much tighter integration for the slow_weights…

New version 9.3.19

*Refactoring slow_weights into state dictionary, leverage single step counter and allow for single pass updates for all param tensors.
*Much improved efficiency though no change in per second epoch time.
*This refactor should eliminate any random issues with save/load as everything is self contained in state dict now.
*Group learning rates are now supported (thanks to github @SHolderbach) This should be a big help as we move from R&D to more production style use ala freeze/unfreeze with other models.

Passes verification testing :slight_smile:

6 Likes

Great work @LessW2020!!!

1 Like

Hey guys, need a small help. I’m having a bit of trouble replicating the outputs of this paper - https://drive.google.com/file/d/11K9Fi0n0BTq22dEl4V6YSCpJ4yLxVPBQ/view?usp=drivesdk
Especially figure 3 where the authors have plotted the Edge of Chaos for 3 activation functions. Can someone provide me any helper or boiler plate code? Thanks!

1 Like

@LessW2020 We haven’t dig much yet on the Ralamb thingy, but apparently a ‘coding mistake’ appears to be the source of the high performance of Ralamb against a properly implemented RAdam + LARS. @grankin repo has the latest iteration (mistake included). https://github.com/mgrankin/over9000 Those numbers are vanilla – pre improvements – with the only addition of the annealing schedule.

You should try that implementation as it appears to be much more aggressive at the start and during the optimization and still work.

For details of what we know of the error so far: https://gist.github.com/redknightlois/c4023d393eb8f92bb44b2ab582d7ec20#gistcomment-3012275

3 Likes

Here is my results:
woof, 5 epochs.
Xresnet50, Adam, as base, only without fp16.
Selu, a=1, lr=3e-3
[0.602, 0.616, 0.628, 0.608, 0.626, 0.6, 0.590, 0.632, 0.630, 0.608]
mean: 0.6140, std: 0.0138

Selu, lr 3e-3
[0.62, 0.636, 0.616, 0.606, 0.578, 0.63, 0.618, 0.626, 0.62, 0.626]
mean: 0.6176, std: 0.0153

Celu
[0.65, 0.616, 0.598, 0.622, 0.610, 0.596, 0.590, 0.636, 0.626, 0.622]
mean: 0.6166, std: 0.0178

Mish
[0.586, 0.590, 0.612, 0.608, 0.636, 0.616, 0.6, 0.616, 0.62, 0.646]
mean: 0.6130, std: 0.0177
So - it close, need to check on 20 epochs.
Actually a i thought it is better than baseline, Relu, it was ~0.59 on my comp.
But now i’m rerun baseline and it is:
lr=3e-3
[0.628, 0.614, 0.630, 0.610, 0.618, 0.610, 0.626, 0.624, 0.618, 0.592]
mean: 0.6170, std: 0.0107
So - i’m confused, all equal.
Nbs:

@diganta That’s a very interesting paper indeed. Took the time to read after work. Would be great to find the edge of chaos for Mish. If you have code I can take a look, already using Mish for my work with interesting results, so anything that can help make it better would be good for me. The math behind the papers aint my strong suit, but it serve me well to help.

2 Likes

I have to run but here’s something I want to add into MXResNet that will help us boost things further I think:

Basically, replace the 1-3-1 bottleneck with a series of convs to achieve multi-scale resolution (kind of like Seb self attention to some degree).
Code is on their github but I couldn’t get it working yet…(spent about two hours on it, constant tensor size mismatch).

Anyway, provides about a 2% boost for imagenet (similar complexity as regular resnet bottleneck) and hoping it can do the same for us.

4 Likes

Good find! I’ll try to take a look at it and fumble around later this week.

@Redknight and @Diganta let me know how that goes, I definitely want to learn how to visualize those techniques!

1 Like

@Redknight It is indeed a very interesting paper. The notion of EOC was proposed in this paper here - https://arxiv.org/pdf/1611.01232.pdf and was also explained excellently in this paper - https://arxiv.org/pdf/1711.04735.pdf. Their findings if replicated for Mish might just confirm the mathematical superiority of Mish over other activation functions. However, I’m facing some issues building up the algorithm they used to generate those plots. I have no doubts in the mathematical aspect of it however need help in the coding part.

1 Like

You can find their ICML slides here: https://icml.cc/media/Slides/icml/2019/104(12-11-00)-12-11-35-4383-on_the_impact.pdf

1 Like

@diganta LOL you just pushed click a few seconds before I was going to send the link. From what I could gather the plot is actually the array of outputs of a random initialization under those conditions when feeding a constant (I am ‘guessing’ that part based on similar stuff that I have done that looks like that) and then graph them as a 2d function.

@Redknight The issue is the integral term in the fixed point dynamic q* equation which generates the phase plane boundary separating the Ordered and Chaotic Phase. For Mish, this integral doesn’t hold valid. I tried obtaining the integral from Mathematica and it said “No integral found within scope”. I’m not sure if I’m doing it correctly or not. I have emailed the authors. Hopefully, they reply.

I just tried for Swish and Wolfram gives me the same result. I guess I’m doing something incorrect.