Meet Mish: New Activation function, possible successor to ReLU?

wow, no wonder that thing was so slow when I tested it (and gave up on it). Thanks again for the info!

@LessW2020 @Seb I am writing a walkthrough notebook for my meetup in two weeks over this, and I’m trying to go through and gather all of the papers that were used. Where did the Flatten and Anneal originate from?

Otherwise here is what I’ve gathered so far (I’ll update this post here in case anyone wants a quick reference to the papers):

Papers Referenced:

Other Equally as Important Noteables:

3 Likes

@grankin came up with flat+anneal I believe.
simpleselfattention is inspired by Self-Attention GANs, I highly modified their layer, and came up with the positioning in xresnet that we used. Maybe I’ll write a paper if I find a good use to it. https://github.com/sdoria/SimpleSelfAttention (I need to improve that readme)
( I should add that @grankin implemented a “symmetrical” version which we didn’t use here and participated in the testing)

A big jump from the leaderboard was fixing that learning rate oversight in learn.py
Another one was from using the full size dataset rather than Imagewoof160
Another smaller one was adding more channels in xresnet.
I think you got the rest.

2 Likes

@grankin invented that :slight_smile:

2 Likes

Thanks Seb! I appreciate the double check. @LessW2020 thanks for the post! I believe you missed the LARS paper though :wink:

I will certainly reference him twice then. Once I have the notebook written up, I can post it on your forum if you think it would be nice Less. I won’t do the 5 for 5 like we have been, it’s colab so it’ll just be 1 run of 5 for each effort

I also just put an “Other Equally as Important Noteables” for the non-papers

I made the promised post to try and provide an overview for everyone on the new techniques we’ve been using here:

re: missed paper - Good spot @muellerzr - I’ve added the LARS link!

Re: notebook - yes, please add it to the github repo, that will be a nice add for sure. Thanks for making that list of papers, that’s a big help for anyone to delve into more details.

@Seb - I had to stretch to summarize the self attention aspect in my post, so I’ve referenced you in that thread for people to ask for a tutorial about it :slight_smile: It does look promising though after seeing the results here and a quick read of the paper.

1 Like

Still calling it with 6 lines - a one line function call for it would be awesome!

1 Like

Thanks Less! I will work on it eventually this week, as converting scheduler to callback is…a welcome challenge. I’m following this for the scheduler, but I think if I follow the fit_one_cycle code I should be fine

1 Like

I would vote for you to write a paper on it. Especially if we can continue to show it’s advantages testing on more datasets, etc.

1 Like

I would vote for you to write a paper on it. Especially if we can continue to show it’s advantages testing on more datasets, etc.

Seconded!

Just FFT, did we ever attempt MixUp with this? If not, why not? (as in did we just not get to it or is there reasoning)

If I see results that show we can train Imagenette/woof to convergence faster with ssa on different image sizes, then a paper would make sense. So far I’ve only seen that on Imagewoof128 and it didn’t work (equal results) on Imagewoof256. Weird!

Re: Mixup. I’ve just gone by Jeremy’s intuition on the leaderboard. He uses Mixup for 80 epochs and more. When I did runs on 80 epochs, I used it.

1 Like

I did briefly test with it but similar to @Seb, I figured if Jeremy wasn’t using it then it wasn’t a high priority.

That said, I did see consistently better short term validation results with it (i.e. val curve was much more below the training curve) vs not, but at the same time at least with OneCycle, I didn’t end up any more accurate.

So I think it’s worth testing now that we have the new lr schedule and Ranger…and for that matter, I think progressive sprinkles is another thing to test as I had really good luck with that (better than cutmix usually).

1 Like

Thanks for the input guys!

Last little bit then I think I’m good to go. Did I miss any of our runs that were especially important here:

  • Baseline (Adam + xResnet50) + OneCycle
  • Ranger (RAdam + LookAhead) + OneCycle
  • Ranger + Flatten Anneal
  • Ranger + MXResnet (xResnet50 + Mish) + Flatten Anneal
  • RangerLars (Ralamb + LARS + Ranger) + MXResnet + Flatten Anneal
  • RangerLars + xResnet50 + Flatten Anneal
  • Ranger + SimpleSelfAttention + MXResnet + Flatten Anneal

There was quite a few so trying to get the most important <10

*Edit: forgot to add Ranger to the SSA

1 Like

@LessW2020 I put in the PR. Let me know when it’s good and I’ll post on the other forum!

1 Like

It’s in, thanks!

1 Like

I made a new simple® setup for flat+cosine, fcfit:

#flat and cosine annealer - @mgrankin invented
#let's make it fast and easy - @lessw2020

def fcfit(learn, num_epoch=2, lr=4e-3, start_pct=.72, f_show_curve=True):
    if num_epoch<1:
        raiseValueError("num_epoch must be 1 or higher")
    n = len(learn.data.train_dl)
    anneal_start = int(n*num_epoch*start_pct) #compute what batch to start
    batch_finish = (n*num_epoch - anneal_start)
    phase0 = TrainingPhase(anneal_start).schedule_hp('lr', lr)
    phase1 = TrainingPhase(n*5 - anneal_start).schedule_hp('lr', lr, anneal=annealing_cos)
    phases = [phase0, phase1]
    sched = GeneralScheduler(learn, phases)
    #save the setup
    learn.callbacks.append(sched)
    #start the training
    print(f"fcfit: num_epochs: {num_epoch}, lr = {lr}") 
    print(f"Flat for {anneal_start} epochs, then cosine anneal for {batch_finish}")
    learn.fit(num_epoch)
    #bonus -show lr curve?
    if f_show_curve:    
        learn.recorder.plot_lr()   

looks like this:

I added it as fcfit.py to the repo.

Let’s improve it and then see about getting it into FastAI so we can just call learn.fcfit() :slight_smile:

2 Likes

Great job @LessW2020! I’ve been busy busting out that notebook so haven’t gotten there yet… I still have a few hours at work, let me see what I can do :wink:

1 Like

Awesome, hope you can tweak and improve it. Clearly, the whole flat + cosine has had a big impact with the new optimizers.

I’m also wondering about how the accuracy variance jumps for example on epochs 8-15 (on average) with the 20 epochs and wondering if we need another bump up or down in the middle to help that.
Then I also think that time is probably better spent working on getting AutoOpt up and running so we don’t need to hand tune lol.

1 Like

I was thinking the same. FFT for others perhaps. I’ll mention it at my meetup and perhaps get some other students who want to explore this.

If AutoOpt lives up to its name, perhaps it would be better :slight_smile:

1 Like