Meet Mish: New Activation function, possible successor to ReLU?

With the inlining I have 22 seconds with ReLU and 24 seconds per epoch with Mish. So it’s 9% slower, not sure about the memory difference though.

As mentioned, the imagewoof/imagenette setup is not great for measuring the timing. The dataset is so small the epoch dataloader transitions from train, test and back take up a lot of relative time and have a lot of overhead (proportionally) and variability. Train or validate on a bigger dataset like Imagenet itself to get a better measure.

A comparitve measure taken right before the end of longer validation run. The in brackets numbers are cumalitve averages and quite stable at this stage. GPU utilization is 99%.

ResNet50-D batch size 512, FP32
Mish - Test: [ 90/98] Time: 0.739s (0.801s, 639.35/s) GPU mem: 11812MiB / 24220MiB
ReLU - Test: [ 90/98] Time: 0.549s (0.613s, 834.73/s) GPU mem: 9264MiB / 24220MiB

4 Likes

Excellent, thanks @rwightman - this is great info to learn from!

Isn’t GPU memory directly linked to number of parameters?

Parameters is just part of it. The input size, parameters, and a whole lot of little details wrt to forward and backward mechanics and the caching allocator determine the practical memory usage for a given task. Even changing the arguments for a given conv could have a significant impact on the CUDNN workspace size for forward and/or backward as it may use a different algorithm winograd vs gemm and others that all have different layouts for the tensor data and require different allocations.

The EfficientNets are a great example, even at roughly 1/10 the parameter count they actually utilize as much or more GPU memory for a given performance range (accuracy). Increased input size, conv algo selection seems to be resulting in larger workspace sizes, and activations like swish that are implemented (currently) as sequences of python ops (more operations).

4 Likes

wow, no wonder that thing was so slow when I tested it (and gave up on it). Thanks again for the info!

@LessW2020 @Seb I am writing a walkthrough notebook for my meetup in two weeks over this, and I’m trying to go through and gather all of the papers that were used. Where did the Flatten and Anneal originate from?

Otherwise here is what I’ve gathered so far (I’ll update this post here in case anyone wants a quick reference to the papers):

Papers Referenced:

Other Equally as Important Noteables:

3 Likes

@grankin came up with flat+anneal I believe.
simpleselfattention is inspired by Self-Attention GANs, I highly modified their layer, and came up with the positioning in xresnet that we used. Maybe I’ll write a paper if I find a good use to it. https://github.com/sdoria/SimpleSelfAttention (I need to improve that readme)
( I should add that @grankin implemented a “symmetrical” version which we didn’t use here and participated in the testing)

A big jump from the leaderboard was fixing that learning rate oversight in learn.py
Another one was from using the full size dataset rather than Imagewoof160
Another smaller one was adding more channels in xresnet.
I think you got the rest.

2 Likes

@grankin invented that :slight_smile:

2 Likes

Thanks Seb! I appreciate the double check. @LessW2020 thanks for the post! I believe you missed the LARS paper though :wink:

I will certainly reference him twice then. Once I have the notebook written up, I can post it on your forum if you think it would be nice Less. I won’t do the 5 for 5 like we have been, it’s colab so it’ll just be 1 run of 5 for each effort

I also just put an “Other Equally as Important Noteables” for the non-papers

I made the promised post to try and provide an overview for everyone on the new techniques we’ve been using here:

re: missed paper - Good spot @muellerzr - I’ve added the LARS link!

Re: notebook - yes, please add it to the github repo, that will be a nice add for sure. Thanks for making that list of papers, that’s a big help for anyone to delve into more details.

@Seb - I had to stretch to summarize the self attention aspect in my post, so I’ve referenced you in that thread for people to ask for a tutorial about it :slight_smile: It does look promising though after seeing the results here and a quick read of the paper.

1 Like

Still calling it with 6 lines - a one line function call for it would be awesome!

1 Like

Thanks Less! I will work on it eventually this week, as converting scheduler to callback is…a welcome challenge. I’m following this for the scheduler, but I think if I follow the fit_one_cycle code I should be fine

1 Like

I would vote for you to write a paper on it. Especially if we can continue to show it’s advantages testing on more datasets, etc.

1 Like

I would vote for you to write a paper on it. Especially if we can continue to show it’s advantages testing on more datasets, etc.

Seconded!

Just FFT, did we ever attempt MixUp with this? If not, why not? (as in did we just not get to it or is there reasoning)

If I see results that show we can train Imagenette/woof to convergence faster with ssa on different image sizes, then a paper would make sense. So far I’ve only seen that on Imagewoof128 and it didn’t work (equal results) on Imagewoof256. Weird!

Re: Mixup. I’ve just gone by Jeremy’s intuition on the leaderboard. He uses Mixup for 80 epochs and more. When I did runs on 80 epochs, I used it.

1 Like

I did briefly test with it but similar to @Seb, I figured if Jeremy wasn’t using it then it wasn’t a high priority.

That said, I did see consistently better short term validation results with it (i.e. val curve was much more below the training curve) vs not, but at the same time at least with OneCycle, I didn’t end up any more accurate.

So I think it’s worth testing now that we have the new lr schedule and Ranger…and for that matter, I think progressive sprinkles is another thing to test as I had really good luck with that (better than cutmix usually).

1 Like

Thanks for the input guys!

Last little bit then I think I’m good to go. Did I miss any of our runs that were especially important here:

  • Baseline (Adam + xResnet50) + OneCycle
  • Ranger (RAdam + LookAhead) + OneCycle
  • Ranger + Flatten Anneal
  • Ranger + MXResnet (xResnet50 + Mish) + Flatten Anneal
  • RangerLars (Ralamb + LARS + Ranger) + MXResnet + Flatten Anneal
  • RangerLars + xResnet50 + Flatten Anneal
  • Ranger + SimpleSelfAttention + MXResnet + Flatten Anneal

There was quite a few so trying to get the most important <10

*Edit: forgot to add Ranger to the SSA

1 Like

@LessW2020 I put in the PR. Let me know when it’s good and I’ll post on the other forum!

1 Like

It’s in, thanks!

1 Like

I made a new simple® setup for flat+cosine, fcfit:

#flat and cosine annealer - @mgrankin invented
#let's make it fast and easy - @lessw2020

def fcfit(learn, num_epoch=2, lr=4e-3, start_pct=.72, f_show_curve=True):
    if num_epoch<1:
        raiseValueError("num_epoch must be 1 or higher")
    n = len(learn.data.train_dl)
    anneal_start = int(n*num_epoch*start_pct) #compute what batch to start
    batch_finish = (n*num_epoch - anneal_start)
    phase0 = TrainingPhase(anneal_start).schedule_hp('lr', lr)
    phase1 = TrainingPhase(n*5 - anneal_start).schedule_hp('lr', lr, anneal=annealing_cos)
    phases = [phase0, phase1]
    sched = GeneralScheduler(learn, phases)
    #save the setup
    learn.callbacks.append(sched)
    #start the training
    print(f"fcfit: num_epochs: {num_epoch}, lr = {lr}") 
    print(f"Flat for {anneal_start} epochs, then cosine anneal for {batch_finish}")
    learn.fit(num_epoch)
    #bonus -show lr curve?
    if f_show_curve:    
        learn.recorder.plot_lr()   

looks like this:

I added it as fcfit.py to the repo.

Let’s improve it and then see about getting it into FastAI so we can just call learn.fcfit() :slight_smile:

2 Likes