Meet Mish: New Activation function, possible successor to ReLU?

muellerzr · August 29, 2019, 4:48am

Great job @LessW2020! I’ve been busy busting out that notebook so haven’t gotten there yet… I still have a few hours at work, let me see what I can do

LessW2020 · August 29, 2019, 4:51am

Awesome, hope you can tweak and improve it. Clearly, the whole flat + cosine has had a big impact with the new optimizers.

I’m also wondering about how the accuracy variance jumps for example on epochs 8-15 (on average) with the 20 epochs and wondering if we need another bump up or down in the middle to help that.
Then I also think that time is probably better spent working on getting AutoOpt up and running so we don’t need to hand tune lol.

muellerzr · August 29, 2019, 4:53am

I was thinking the same. FFT for others perhaps. I’ll mention it at my meetup and perhaps get some other students who want to explore this.

If AutoOpt lives up to its name, perhaps it would be better

LessW2020 · August 29, 2019, 4:56am

Yes probably. I went through the code one more time tonight so it’s becoming clearer. What concerns me most of all though is they are computing for every layer a hessian and so wondering if the calcs may become huge on ResNet50+ such that it’s super slow.
They only used MNIST toy example so really hard to tell how it will perform on a bigger nn…
But, it’s the most promising thing I’ve seen in a while and would solve so many problems related to constant hand tuning so likely worth doing it.

sgebrial · August 29, 2019, 5:02am

I’m not sure what 6 lines you’re referring to but couldn’t you just do this:

learn.fit(epochs=1, callbacks=OneCycleScheduler(learn, lr_max=0.003, div_factor=1, final_div=1e5, pct_start=0.7))

The div_factor ensures the first phase is flat and the large final_div gets the annealing_cos to go down to close to zero (lr_max/final_div)

LessW2020 · August 29, 2019, 5:11am

Nice job @sgebrial - that indeed works perfectly!

I think I like the full function better so we can auto call the curve plot, and show the details (batch break point, etc), but for functionality, this one liner definitely works exactly the same!

Thanks again - this is really nice to have!

LessW2020 · August 29, 2019, 5:24am

I added in some different anneals for the second half - might be interesting to compare linear vs exponential vs cosine:

def fcfit(learn, num_epoch=1, lr=4e-3, start_pct=.72, curve="cosine",f_show_curve=True):

if num_epoch<1:
    raiseValueError("num_epoch must be 1 or higher")
n = len(learn.data.train_dl)
anneal_start = int(n*num_epoch*start_pct) #compute what batch to start
batch_finish = (n*num_epoch - anneal_start)
phase0 = TrainingPhase(anneal_start).schedule_hp('lr', lr)
if curve=="cosine":
    curve_type=annealing_cos
elif curve=="linear":
    curve_type=annealing_linear
elif curve=="exponential":
    curve_type=annealing_exp
else:
    raiseValueError(f"annealing type not supported {curve}")
phase1 = TrainingPhase(n*5 - anneal_start).schedule_hp('lr', lr, anneal=curve_type)
phases = [phase0, phase1]
sched = GeneralScheduler(learn, phases)
#save the setup
learn.callbacks.append(sched)
#start the training
print(f"fcfit: num_epochs: {num_epoch}, lr = {lr}") 
print(f"Flat for {anneal_start} epochs, then {curve} anneal for {batch_finish}")
learn.fit(num_epoch)
#bonus -show lr curve?
if f_show_curve:    
    learn.recorder.plot_lr()

muellerzr · August 29, 2019, 5:30am

Here is what I have so far:

class FlatCosAnnealScheduler(LearnerCallback):
  """
  Manage FCFit training as found in the ImageNette experiments. 
  Code format is based on OneCycleScheduler
  """
  def __init__(self, learn:Learner, lr:float=4e-3, moms:Floats=(0.95, 0.99), start_pct:float=0.72,
              tot_epochs:int=2, start_epoch:int=None):
    super().__init__(learn)
    self.lr, self.start_pct = lr, start_pct
    self.moms = tuple(listify(moms,2))
    self.start_epoch, self.tot_epochs = start_epoch, tot_epochs
  
  def steps(self, *steps_cfg:StartOptEnd):
    "Build anneal schedule for parameters"
    return [Scheduler(step, n_iter, func=func)
           for (step, (n_iter,func)) in zip(steps_cfg, self.phases)]
  
  def on_train_begin(self, n_epochs:int, **kwargs:Any)->None:
    "Initialize optimization parameters based on schedule"
    res = {'epoch':self.start_epoch} if self.start_epoch is not None else None
    self.start_epoch=0
    self.tot_epochs = ifnone(self.tot_epochs, n_epochs)
    n = len(self.learn.data.train_dl)
    anneal_start = int(n*self.tot_epochs*self.start_pct)
    batch_finish = (n*self.tot_epochs - anneal_start)
    self.phases = ((anneal_start, _), (batch_finish, annealing_cos))
    self.mom_scheds = self.steps(self.moms, (self.moms[1], self.moms[0]))
    self.opt = self.learn.opt
    self.opt.lr, self.opt.mom = self.lr, self.mom_scheds[0].start
    self.idx_s=0
    return res
  
  def on_batch_end(self, train, **kwargs:Any)->None:
    "Take one step forward on annealing schedule"
    if train:
      if self.idx_s >= (len(self.mom_scheds)): return {'stop_training': True, 'stop_epoch': True}
      self.opt.mom = self.mom_scheds[self.idx_s].step()
      self.idx_s+=1
  
  def on_epoch_end(self, epoch, **kwargs:Any)->None:
    "Tell Learner to stop if the cycle is finished."
    if epoch > self.tot_epochs: return {'stop_training': True}

def fcfit(learn:Learner, cyc_len:int, lr:float=defaults.lr,
                  moms:Tuple[float,float]=(0.95,0.85), div_factor:float=25., start_pct:float=0.72,
                  wd:float=None, callbacks:Optional[CallbackList]=None, tot_epochs:int=None, show_curve:bool=False)->None:
    "Fit a model following the 1cycle policy."
    max_lr = learn.lr_range(lr)
    callbacks = listify(callbacks)
    callbacks.append(FlatCosAnnealScheduler(learn, lr, moms=moms, start_pct=start_pct, tot_epochs=cyc_len))
    learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
    if show_curve: learn.recorder.plot_lr()

I currently am debugging the mom_scheds as I get a ‘float’ object is not callable. Stack trace:

<ipython-input-85-8536cad3848a> in on_batch_end(self, train, **kwargs)
     35     if train:
     36       print(self.mom_scheds[self.idx_s])
---> 37       self.opt.mom = self.mom_scheds[self.idx_s].step()
     38       self.idx_s+=1
     39 

/usr/local/lib/python3.6/dist-packages/fastai/callback.py in step(self)
    388         "Return next value along annealed schedule."
    389         self.n += 1
--> 390         return self.func(self.start, self.end, self.n/self.n_iter)
    391 
    392     @property

TypeError: 'float' object is not callable

LessW2020 · August 29, 2019, 5:32am

oh that is nice - now we can also control the momentum with this in one function!

muellerzr · August 29, 2019, 5:38am

That’s the goal! I’m just unsure where this bug is going from and how to deal with it… if it persists til morning I’ll post on a separate thread and hopefully get sgugger or someone’s help

Probably won’t finish tonight. If you or someone else see’s something noticeable let me know!

grankin · August 29, 2019, 5:59am

Probably it’s worth to replace n*5 - anneal_start with batch_finish.

This type of schedule was the first thing that I’ve tried. I believe there is a HUGE opportunity in finding an optimal schedule for new optimisers. For instance I don’t use any schedule for momentum, this one could be improved. I’m going to explore this a bit in the next couple of days.

a_yasyrev · August 29, 2019, 6:49am

name proposal: instead of fcfit - fit_fc, like fit_one_cycle

muellerzr · August 29, 2019, 1:34pm

Thanks @a_yasyrev! I like that very much! I think I fixed the code… @LessW2020 do you want to run it on one you’re comfortable with the results to be sure? I ran it with Ranger + SSA + MX and results were pretty much as expected, but before a pr and such I’d like a double check

@grankin can you check as well? As it is your brain child

class FlatCosAnnealScheduler(LearnerCallback):
    """
    Manage FCFit training as found in the ImageNette experiments. 
    Code format is based on OneCycleScheduler
    Based on idea by Mikhail Grankin
    """
    def __init__(self, learn:Learner, lr:float=4e-3, moms:Floats=(0.95,0.999),
               start_pct:float=0.72, start_epoch:int=None, tot_epochs:int=None,
                curve='cosine'):
        super().__init__(learn)
        n = len(learn.data.train_dl)
        self.anneal_start = int(n * tot_epochs * start_pct)
        self.batch_finish = (n * tot_epochs - self.anneal_start)
        if curve=="cosine":
            curve_type=annealing_cos
        elif curve=="linear":
            curve_type=annealing_linear
        elif curve=="exponential":
            curve_type=annealing_exp
        else:
            raiseValueError(f"annealing type not supported {curve}")
        phase0 = TrainingPhase(self.anneal_start).schedule_hp('lr', lr).schedule_hp('mom', moms[0])
        phase1 = TrainingPhase(self.batch_finish).schedule_hp('lr', lr, anneal=curve_type).schedule_hp('mom', moms[1])
        phases = [phase0, phase1]
        self.phases,self.start_epoch = phases,start_epoch

        
    def on_train_begin(self, epoch:int, **kwargs:Any)->None:
        "Initialize the schedulers for training."
        res = {'epoch':self.start_epoch} if self.start_epoch is not None else None
        self.start_epoch = ifnone(self.start_epoch, epoch)
        self.scheds = [p.scheds for p in self.phases]
        self.opt = self.learn.opt
        for k,v in self.scheds[0].items(): 
            v.restart()
            self.opt.set_stat(k, v.start)
        self.idx_s = 0
        return res
    
    
    def jump_to_epoch(self, epoch:int)->None:
        for _ in range(len(self.learn.data.train_dl) * epoch):
            self.on_batch_end(True)

            
    def on_batch_end(self, train, **kwargs:Any)->None:
        "Take a step in lr,mom sched, start next stepper when the current one is complete."
        if train:
            if self.idx_s >= len(self.scheds): return {'stop_training': True, 'stop_epoch': True}
            sched = self.scheds[self.idx_s]
            for k,v in sched.items(): self.opt.set_stat(k, v.step())
            if list(sched.values())[0].is_done: self.idx_s += 1

def fit_fc(learn:Learner, tot_epochs:int=None, lr:float=defaults.lr,  moms:Tuple[float,float]=(0.95,0.85), start_pct:float=0.72,
                  wd:float=None, callbacks:Optional[CallbackList]=None, show_curve:bool=False)->None:
    "Fit a model with Flat Cosine Annealing"
    max_lr = learn.lr_range(lr)
    callbacks = listify(callbacks)
    callbacks.append(FlatCosAnnealScheduler(learn, lr, moms=moms, start_pct=start_pct, tot_epochs=tot_epochs))
    learn.fit(tot_epochs, max_lr, wd=wd, callbacks=callbacks)

Edit: I believe I have working momentum
Edit x2: I do not… reverted to original

muellerzr · August 29, 2019, 1:44pm

I definitely know it’s doing something right… first run: 78%! second run: 75% (running for 5)

Also, should we PR with Ranger et al for the main lib?

Looks to fit up to what we had before. Above note is still true. I’ll wait for a day in case we come up with any more discoveries on how to improve this functionality

Diganta · August 29, 2019, 3:30pm

Hi @muellerzr, @LessW2020 (Sorry I’m a new user and can only tag 2) it’s a very long thread and I’m still going through it. I created Mish activation function with the view of improving the SOTA on testing and I do acknowledge the fact that it has comparatively slower run time and I’m working on optimizing it more. Thanks for giving it a try. If you have any specific questions regarding Mish, feel free to ask.
Best.
Diganta

LessW2020 · August 29, 2019, 5:00pm

Hi Diganta - great to see you here on the boards! And congrats again on developing Mish!
Definitely let us know if you are able to further the performance of Mish.
I hope you’ll stick around as you have time here on the boards, it’s great to have you here
I’m hoping we’ll be able to get AutoOpt up and running and can use that with Mish in the near future to further explore it’s potential.

LessW2020 · August 29, 2019, 8:08pm

Nice job @muellerzr - it’s resulting in same accuracy and expected curve.
I have one question though - I’m confused by:
cyc_len and tot_epochs

I put cyc_len =2 thinking I would get 2 cycles in one epoch…instead I got 2 epochs.

I think that needs to be clarified for users and me, and I also think total_epochs should be right up front after learner instead of second to last.
I think in most cases people will just use cyc_len = 1 (assuming that means spread the whole curve over total epochs) so that should have a default value of 1 and be put way in the back since it’s not often used?
edit update: I tested this:
fit_fc(learn,2, 4e-3, tot_epochs=1,show_curve=True)
so I expressly said “1 epoch” but apparently cyc_len =2 is overriding and it then ran 2 epochs…
Anyway, that’s very confusing so I think we should make it simpler/easier to use.
Otherwise as noted, curves and accuracy look great!

fgfm · August 29, 2019, 8:08pm

Congrats on those results @LessW2020 @muellerzr !

Hopefully I may have something for you to get slightly better results. In case you guys are using Lookahead (even combined version), right before evaluation, there is a decision that should be made:

At the end of an epoch, most likely nb_batches % k != 0. Which means, that you are evaluating your model on your fast weights (before the next synchronization).
The difference might be slim but positive as there are two choices right before evaluation: copy slow weights to fast weights (walking a few steps back), or perform synchronization even though you haven’t yet performed k fast steps since last sync.

I’m still investigating which option is giving the best results but at least, it’s better to have the choice. You can find the method I implemented in commit, that could be used as follows:

from torch.optim import Adam
optimizer = Adam(model_params)
optimizer = Lookahead(optimizer, sync_rate=0.5, sync_period=6)

for _ in range(nb_epochs):
    # Train here
    optimizer.sync_params()
    # not specifying sync_rate means model params <- slow params
    # otherwise optimizer.sync_params(0.5) will force early synchronization
    # Evaluate here

Hope this helps!
Cheers

muellerzr · August 29, 2019, 8:13pm

@LessW2020 yes, I need to adjust that. It’s really epochs not cycles. Will adjust soon. My apologies!!!

I’ll simplify it’s structure a lot. Glad to see it’s working though as expected

When we’re all good and ready I have the basic structure for like one cycle done and we can just PR it.

LessW2020 · August 29, 2019, 8:15pm

fgfm:

Hopefully I may have something for you to get slightly better results. In case you guys are using Lookahead (even combined version), right before evaluation, there is a decision that should be made:

At the end of an epoch, most likely nb_batches % k != 0 . Which means, that you are evaluating your model on your fast weights (before the next synchronization).

The difference might be slim but positive as there are two choices right before evaluation: copy slow weights to fast weights (walking a few steps back), or perform synchronization even though you haven’t yet performed k fast steps since last sync.

I’m still investigating which option is giving the best results but at least, it’s better to have the choice. You can find the method I implemented in commit, that could be used as follows:

Hi @fgfm - thanks for pointing this out! I see your point and it’s a good one.
We are using Ranger, so yes lookahead is in there…that’s a really good option to control so we can test and see what works better.
Your code there looks really good btw.