Meet Mish: New Activation function, possible successor to ReLU?

rwightman · August 28, 2019, 11:08pm

With the inlining I have 22 seconds with ReLU and 24 seconds per epoch with Mish. So it’s 9% slower, not sure about the memory difference though.

As mentioned, the imagewoof/imagenette setup is not great for measuring the timing. The dataset is so small the epoch dataloader transitions from train, test and back take up a lot of relative time and have a lot of overhead (proportionally) and variability. Train or validate on a bigger dataset like Imagenet itself to get a better measure.

A comparitve measure taken right before the end of longer validation run. The in brackets numbers are cumalitve averages and quite stable at this stage. GPU utilization is 99%.

ResNet50-D batch size 512, FP32
Mish - Test: [ 90/98] Time: 0.739s (0.801s, 639.35/s) GPU mem: 11812MiB / 24220MiB
ReLU - Test: [ 90/98] Time: 0.549s (0.613s, 834.73/s) GPU mem: 9264MiB / 24220MiB

LessW2020 · August 28, 2019, 11:10pm

Excellent, thanks @rwightman - this is great info to learn from!

rwightman · August 28, 2019, 11:15pm

Isn’t GPU memory directly linked to number of parameters?

Parameters is just part of it. The input size, parameters, and a whole lot of little details wrt to forward and backward mechanics and the caching allocator determine the practical memory usage for a given task. Even changing the arguments for a given conv could have a significant impact on the CUDNN workspace size for forward and/or backward as it may use a different algorithm winograd vs gemm and others that all have different layouts for the tensor data and require different allocations.

The EfficientNets are a great example, even at roughly 1/10 the parameter count they actually utilize as much or more GPU memory for a given performance range (accuracy). Increased input size, conv algo selection seems to be resulting in larger workspace sizes, and activations like swish that are implemented (currently) as sequences of python ops (more operations).

LessW2020 · August 28, 2019, 11:18pm

wow, no wonder that thing was so slow when I tested it (and gave up on it). Thanks again for the info!

muellerzr · August 29, 2019, 1:25am

@LessW2020 @Seb I am writing a walkthrough notebook for my meetup in two weeks over this, and I’m trying to go through and gather all of the papers that were used. Where did the Flatten and Anneal originate from?

Otherwise here is what I’ve gathered so far (I’ll update this post here in case anyone wants a quick reference to the papers):

Papers Referenced:

Other Equally as Important Noteables:

Flatten + Anneal Scheduling - Mikhail Grankin
Simple Self Attention - Seb

Seb · August 29, 2019, 1:42am

@grankin came up with flat+anneal I believe.
simpleselfattention is inspired by Self-Attention GANs, I highly modified their layer, and came up with the positioning in xresnet that we used. Maybe I’ll write a paper if I find a good use to it. https://github.com/sdoria/SimpleSelfAttention (I need to improve that readme)
( I should add that @grankin implemented a “symmetrical” version which we didn’t use here and participated in the testing)

A big jump from the leaderboard was fixing that learning rate oversight in learn.py
Another one was from using the full size dataset rather than Imagewoof160
Another smaller one was adding more channels in xresnet.
I think you got the rest.

LessW2020 · August 29, 2019, 1:46am

@grankin invented that

muellerzr · August 29, 2019, 1:46am

Thanks Seb! I appreciate the double check. @LessW2020 thanks for the post! I believe you missed the LARS paper though

I will certainly reference him twice then. Once I have the notebook written up, I can post it on your forum if you think it would be nice Less. I won’t do the 5 for 5 like we have been, it’s colab so it’ll just be 1 run of 5 for each effort

I also just put an “Other Equally as Important Noteables” for the non-papers

LessW2020 · August 29, 2019, 1:55am

I made the promised post to try and provide an overview for everyone on the new techniques we’ve been using here:

re: missed paper - Good spot @muellerzr - I’ve added the LARS link!

Re: notebook - yes, please add it to the github repo, that will be a nice add for sure. Thanks for making that list of papers, that’s a big help for anyone to delve into more details.

@Seb - I had to stretch to summarize the self attention aspect in my post, so I’ve referenced you in that thread for people to ask for a tutorial about it It does look promising though after seeing the results here and a quick read of the paper.

LessW2020 · August 29, 2019, 1:58am

Still calling it with 6 lines - a one line function call for it would be awesome!

muellerzr · August 29, 2019, 2:01am

Thanks Less! I will work on it eventually this week, as converting scheduler to callback is…a welcome challenge. I’m following this for the scheduler, but I think if I follow the fit_one_cycle code I should be fine

LessW2020 · August 29, 2019, 2:04am

I would vote for you to write a paper on it. Especially if we can continue to show it’s advantages testing on more datasets, etc.

muellerzr · August 29, 2019, 2:06am

I would vote for you to write a paper on it. Especially if we can continue to show it’s advantages testing on more datasets, etc.

Seconded!

muellerzr · August 29, 2019, 2:13am

Just FFT, did we ever attempt MixUp with this? If not, why not? (as in did we just not get to it or is there reasoning)

Seb · August 29, 2019, 2:22am

If I see results that show we can train Imagenette/woof to convergence faster with ssa on different image sizes, then a paper would make sense. So far I’ve only seen that on Imagewoof128 and it didn’t work (equal results) on Imagewoof256. Weird!

Re: Mixup. I’ve just gone by Jeremy’s intuition on the leaderboard. He uses Mixup for 80 epochs and more. When I did runs on 80 epochs, I used it.

LessW2020 · August 29, 2019, 2:27am

I did briefly test with it but similar to @Seb, I figured if Jeremy wasn’t using it then it wasn’t a high priority.

That said, I did see consistently better short term validation results with it (i.e. val curve was much more below the training curve) vs not, but at the same time at least with OneCycle, I didn’t end up any more accurate.

So I think it’s worth testing now that we have the new lr schedule and Ranger…and for that matter, I think progressive sprinkles is another thing to test as I had really good luck with that (better than cutmix usually).

muellerzr · August 29, 2019, 2:32am

Thanks for the input guys!

Last little bit then I think I’m good to go. Did I miss any of our runs that were especially important here:

Baseline (Adam + xResnet50) + OneCycle
Ranger (RAdam + LookAhead) + OneCycle
Ranger + Flatten Anneal
Ranger + MXResnet (xResnet50 + Mish) + Flatten Anneal
RangerLars (Ralamb + LARS + Ranger) + MXResnet + Flatten Anneal
RangerLars + xResnet50 + Flatten Anneal
Ranger + SimpleSelfAttention + MXResnet + Flatten Anneal

There was quite a few so trying to get the most important <10

*Edit: forgot to add Ranger to the SSA

muellerzr · August 29, 2019, 4:33am

@LessW2020 I put in the PR. Let me know when it’s good and I’ll post on the other forum!

LessW2020 · August 29, 2019, 4:44am

It’s in, thanks!

LessW2020 · August 29, 2019, 4:47am

I made a new simple® setup for flat+cosine, fcfit:

#flat and cosine annealer - @mgrankin invented
#let's make it fast and easy - @lessw2020

def fcfit(learn, num_epoch=2, lr=4e-3, start_pct=.72, f_show_curve=True):
    if num_epoch<1:
        raiseValueError("num_epoch must be 1 or higher")
    n = len(learn.data.train_dl)
    anneal_start = int(n*num_epoch*start_pct) #compute what batch to start
    batch_finish = (n*num_epoch - anneal_start)
    phase0 = TrainingPhase(anneal_start).schedule_hp('lr', lr)
    phase1 = TrainingPhase(n*5 - anneal_start).schedule_hp('lr', lr, anneal=annealing_cos)
    phases = [phase0, phase1]
    sched = GeneralScheduler(learn, phases)
    #save the setup
    learn.callbacks.append(sched)
    #start the training
    print(f"fcfit: num_epochs: {num_epoch}, lr = {lr}") 
    print(f"Flat for {anneal_start} epochs, then cosine anneal for {batch_finish}")
    learn.fit(num_epoch)
    #bonus -show lr curve?
    if f_show_curve:    
        learn.recorder.plot_lr()

looks like this:

I added it as fcfit.py to the repo.

Let’s improve it and then see about getting it into FastAI so we can just call learn.fcfit()