Meet Mish: New Activation function, possible successor to ReLU?

Thanks @a_yasyrev! I like that very much! I think I fixed the code… @LessW2020 do you want to run it on one you’re comfortable with the results to be sure? I ran it with Ranger + SSA + MX and results were pretty much as expected, but before a pr and such I’d like a double check :slight_smile:

@grankin can you check as well? As it is your brain child :slight_smile:

class FlatCosAnnealScheduler(LearnerCallback):
    """
    Manage FCFit training as found in the ImageNette experiments. 
    Code format is based on OneCycleScheduler
    Based on idea by Mikhail Grankin
    """
    def __init__(self, learn:Learner, lr:float=4e-3, moms:Floats=(0.95,0.999),
               start_pct:float=0.72, start_epoch:int=None, tot_epochs:int=None,
                curve='cosine'):
        super().__init__(learn)
        n = len(learn.data.train_dl)
        self.anneal_start = int(n * tot_epochs * start_pct)
        self.batch_finish = (n * tot_epochs - self.anneal_start)
        if curve=="cosine":
            curve_type=annealing_cos
        elif curve=="linear":
            curve_type=annealing_linear
        elif curve=="exponential":
            curve_type=annealing_exp
        else:
            raiseValueError(f"annealing type not supported {curve}")
        phase0 = TrainingPhase(self.anneal_start).schedule_hp('lr', lr).schedule_hp('mom', moms[0])
        phase1 = TrainingPhase(self.batch_finish).schedule_hp('lr', lr, anneal=curve_type).schedule_hp('mom', moms[1])
        phases = [phase0, phase1]
        self.phases,self.start_epoch = phases,start_epoch

        
    def on_train_begin(self, epoch:int, **kwargs:Any)->None:
        "Initialize the schedulers for training."
        res = {'epoch':self.start_epoch} if self.start_epoch is not None else None
        self.start_epoch = ifnone(self.start_epoch, epoch)
        self.scheds = [p.scheds for p in self.phases]
        self.opt = self.learn.opt
        for k,v in self.scheds[0].items(): 
            v.restart()
            self.opt.set_stat(k, v.start)
        self.idx_s = 0
        return res
    
    
    def jump_to_epoch(self, epoch:int)->None:
        for _ in range(len(self.learn.data.train_dl) * epoch):
            self.on_batch_end(True)

            
    def on_batch_end(self, train, **kwargs:Any)->None:
        "Take a step in lr,mom sched, start next stepper when the current one is complete."
        if train:
            if self.idx_s >= len(self.scheds): return {'stop_training': True, 'stop_epoch': True}
            sched = self.scheds[self.idx_s]
            for k,v in sched.items(): self.opt.set_stat(k, v.step())
            if list(sched.values())[0].is_done: self.idx_s += 1
def fit_fc(learn:Learner, tot_epochs:int=None, lr:float=defaults.lr,  moms:Tuple[float,float]=(0.95,0.85), start_pct:float=0.72,
                  wd:float=None, callbacks:Optional[CallbackList]=None, show_curve:bool=False)->None:
    "Fit a model with Flat Cosine Annealing"
    max_lr = learn.lr_range(lr)
    callbacks = listify(callbacks)
    callbacks.append(FlatCosAnnealScheduler(learn, lr, moms=moms, start_pct=start_pct, tot_epochs=tot_epochs))
    learn.fit(tot_epochs, max_lr, wd=wd, callbacks=callbacks)

Edit: I believe I have working momentum
Edit x2: I do not… reverted to original

3 Likes

I definitely know it’s doing something right… first run: 78%! second run: 75% (running for 5)

Also, should we PR with Ranger et al for the main lib?

Looks to fit up to what we had before. Above note is still true. I’ll wait for a day in case we come up with any more discoveries on how to improve this functionality

1 Like

Hi @muellerzr, @LessW2020 (Sorry I’m a new user and can only tag 2) it’s a very long thread and I’m still going through it. I created Mish activation function with the view of improving the SOTA on testing and I do acknowledge the fact that it has comparatively slower run time and I’m working on optimizing it more. Thanks for giving it a try. If you have any specific questions regarding Mish, feel free to ask.
Best.
Diganta

6 Likes

Hi Diganta - great to see you here on the boards! And congrats again on developing Mish!
Definitely let us know if you are able to further the performance of Mish.
I hope you’ll stick around as you have time here on the boards, it’s great to have you here :slight_smile:
I’m hoping we’ll be able to get AutoOpt up and running and can use that with Mish in the near future to further explore it’s potential.

3 Likes

Nice job @muellerzr - it’s resulting in same accuracy and expected curve.
I have one question though - I’m confused by:
cyc_len and tot_epochs

I put cyc_len =2 thinking I would get 2 cycles in one epoch…instead I got 2 epochs. :slight_smile:

I think that needs to be clarified for users and me, and I also think total_epochs should be right up front after learner instead of second to last.
I think in most cases people will just use cyc_len = 1 (assuming that means spread the whole curve over total epochs) so that should have a default value of 1 and be put way in the back since it’s not often used?
edit update: I tested this:
fit_fc(learn,2, 4e-3, tot_epochs=1,show_curve=True)
so I expressly said “1 epoch” but apparently cyc_len =2 is overriding and it then ran 2 epochs… :slight_smile:
Anyway, that’s very confusing so I think we should make it simpler/easier to use.
Otherwise as noted, curves and accuracy look great!

1 Like

Congrats on those results @LessW2020 @muellerzr !

Hopefully I may have something for you to get slightly better results. In case you guys are using Lookahead (even combined version), right before evaluation, there is a decision that should be made:

  • At the end of an epoch, most likely nb_batches % k != 0. Which means, that you are evaluating your model on your fast weights (before the next synchronization).

  • The difference might be slim but positive as there are two choices right before evaluation: copy slow weights to fast weights (walking a few steps back), or perform synchronization even though you haven’t yet performed k fast steps since last sync.

I’m still investigating which option is giving the best results but at least, it’s better to have the choice. You can find the method I implemented in commit, that could be used as follows:

from torch.optim import Adam
optimizer = Adam(model_params)
optimizer = Lookahead(optimizer, sync_rate=0.5, sync_period=6)

for _ in range(nb_epochs):
    # Train here
    optimizer.sync_params()
    # not specifying sync_rate means model params <- slow params
    # otherwise optimizer.sync_params(0.5) will force early synchronization
    # Evaluate here

Hope this helps!
Cheers

2 Likes

@LessW2020 yes, I need to adjust that. It’s really epochs not cycles. Will adjust soon. My apologies!!!

I’ll simplify it’s structure a lot. Glad to see it’s working though as expected :slight_smile:

When we’re all good and ready I have the basic structure for like one cycle done and we can just PR it.

1 Like

Hi @fgfm - thanks for pointing this out! I see your point and it’s a good one.
We are using Ranger, so yes lookahead is in there…that’s a really good option to control so we can test and see what works better.
Your code there looks really good btw.

1 Like

I tried to quickly load up @fgfm’s new impl but I’m hitting an issue with the momentum settings:

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in on_batch_begin(self, train, **kwargs)
473         if train:
474             self.lrs.append(self.opt.lr)

–> 475 self.moms.append(self.opt.mom)
476
477 def on_backward_begin(self, smooth_loss:Tensor, **kwargs:Any)->None:

~/anaconda3/lib/python3.7/site-packages/fastai/callback.py in mom(self)
82
83 @property
—> 84 def mom(self)->float:return self._mom[-1]
85 @mom.setter
86 def mom(self, val:float)->None:

TypeError: ‘NoneType’ object is not subscriptable

I set it up with this:
def Ranger(params, alpha=0.5, k=6, *args, **kwargs):
radam = RAdam(params, *args, **kwargs)
return Lookahead(radam, alpha, k)

optar=partial(Ranger, betas = (0.95,0.99), eps=1e-6)

I also tried removing the betas to see if just using defaults would work but didn’t change.

Anyway, I have to run for a bit so if anyone can debug that would be great. Otherwise I think I’ll rewrite Ranger to use @fgfm’s improvements and see if I can isolate it while doing that.

Happy to check but do you have a URL to your code or notebook?

1 Like

I am posting on the new v2 forum about how to implement the callbacks for our fit method

I’ll work on getting two versions of it for us for both libraries.

@LessW2020 I updated the function earlier with the proper format. One question I do have for this, how are we hoping to schedule the momentum? Eg should it be:

phase0.schedule_hp('mom', mom[1])
phase1.schedule_hp('mom', mom[0])

Or what would be the better way to do that (closer to how it really is now). Thanks!

Looking at how one_cycle works I believe that’s the proper way to do it

Momentum is done in the function above.

1 Like

Anything else needed for the callback? If not, I’ll put in the PR so it’s easier for others to go ahead and use it? And then I’ll leave either of you, @LessW2020 or @Seb for the optimizers as those were your collaborations? :slight_smile: (Or you may choose to wait for AutoOpt)

Just give me the go-head :slight_smile:

1 Like

Awesome, that is great work @muellerzr!

1 Like

Please go ahead :slight_smile:

Re: optimizers - Ranger was my idea, but @rwightman and @fgfm have both coded up improved implementations. I’m hoping to have time to go through theirs and update mine and we can leverage the sync control that @fgfm mentioned earlier. The reason for integrating is of course to get ready for AutoOpt integration in the future.

That said…I spent several hours today on AutoOpt and hit issues with params computations between GPU and CPU…so the researcher/inventor behind it is working on it now ( Selçuk). But at least I’m getting a lot more familiar with the code behind it and how it works in more detail.

Note that Jeremy mentioned I should turn the summary post into an article on Medium, so I’m working on that now. Hopefully we’ll get some movie and book deals from that lol.

1 Like

Got it! I’ll send in the PR and leave the optimizers to you guys :wink:

Good to hear that you’re making progress with AutoOpt! (I wonder if we should try the autoLR that’s floating on the forums… AutoLRFinder ) some more FFT!

Thanks a bunch @fgfm! I don’t have the notebook public right now, but I’m going to try and rewrite from scratch tomorrow anyway to integrate both your and @rwightman’s impl and maybe the issue will go away as part of that.

1 Like

hmm, good find - I think I’m going to fire up a server and try that out right now with everything else the same. I’ll let you know in an hour!

1 Like

Sounds good! Also don’t forget to make a docs for our pr’s. I’ll work on that tommorrow as best I can.

PR for the code base posted :slight_smile:

1 Like

well, the lr finder didn’t really help.
In general it was suggesting way too aggressive, though I can’t fault the auto aspect. I think it’s more that our setup doesn’t go well with the lr finder concept in general.

I tried running the finder after each epoch, thinking it might let us tune epoch by epoch…but just ended up with .11 accuracy b/c it was nearly blowing it up.
The only time it did alright was on the very first analysis (i.e. clean network) as it suggested .005, which isn’t too far from our .004.
Otherwise, way too aggressive. As noted, that’s not a chart reading issue but rather I think Ranger, etc. doesn’t work that well with the lr find concept as well.

Anyway, thanks for the link to this - always good to keep trying new things and we’ll hit some winners from it.

Related - I read a paper last night on KSAC (Kernel Something Atrous Convolution). The concept was very cool and reminded me a bit of SSA. Basically in each layer, instead of having the kernel just run one pass (i.e. 3x3, step 1)…it goes 3x3, step 1, step 4, step 8) as an example.
So basically it’s scanning at a tight density and then a wider and wider perspective.
As a result it captures more long range interdependencies and they set a new SOTA for segmentation with it.
The code isn’t out yet and it will be in TF, but I think it’s worth checking out as it is similar in spirit to @Sebs SSA.

2 Likes

Hi @LessW2020, writer of the the auto lr finder here. Gotta say your Mish activation function looks very promising and results seem great!

I wanted to comment about the aggressive LR that you’re getting. Usually when the LR you’re getting from the finder is too high, it may be a good idea to increase the lr_diff param when running the model (increments of 5 might be good to try). This should generalize a little depending on the model and hopefully give some better results through out training.

It would also be interesting see the loss and learning rate plots to see why such aggressive learning rates were given in the first place.

Please let me know if you have any questions or suggestions/feedback on any of this!

2 Likes