Great job @LessW2020! I’ve been busy busting out that notebook so haven’t gotten there yet… I still have a few hours at work, let me see what I can do
Awesome, hope you can tweak and improve it. Clearly, the whole flat + cosine has had a big impact with the new optimizers.
I’m also wondering about how the accuracy variance jumps for example on epochs 8-15 (on average) with the 20 epochs and wondering if we need another bump up or down in the middle to help that.
Then I also think that time is probably better spent working on getting AutoOpt up and running so we don’t need to hand tune lol.
I was thinking the same. FFT for others perhaps. I’ll mention it at my meetup and perhaps get some other students who want to explore this.
If AutoOpt lives up to its name, perhaps it would be better
Yes probably. I went through the code one more time tonight so it’s becoming clearer. What concerns me most of all though is they are computing for every layer a hessian and so wondering if the calcs may become huge on ResNet50+ such that it’s super slow.
They only used MNIST toy example so really hard to tell how it will perform on a bigger nn…
But, it’s the most promising thing I’ve seen in a while and would solve so many problems related to constant hand tuning so likely worth doing it.
I’m not sure what 6 lines you’re referring to but couldn’t you just do this:
learn.fit(epochs=1, callbacks=OneCycleScheduler(learn, lr_max=0.003, div_factor=1, final_div=1e5, pct_start=0.7))
The div_factor ensures the first phase is flat and the large final_div gets the annealing_cos to go down to close to zero (lr_max/final_div)
Nice job @sgebrial - that indeed works perfectly!
I think I like the full function better so we can auto call the curve plot, and show the details (batch break point, etc), but for functionality, this one liner definitely works exactly the same!
Thanks again - this is really nice to have!
I added in some different anneals for the second half - might be interesting to compare linear vs exponential vs cosine:
def fcfit(learn, num_epoch=1, lr=4e-3, start_pct=.72, curve="cosine",f_show_curve=True):
if num_epoch<1:
raiseValueError("num_epoch must be 1 or higher")
n = len(learn.data.train_dl)
anneal_start = int(n*num_epoch*start_pct) #compute what batch to start
batch_finish = (n*num_epoch - anneal_start)
phase0 = TrainingPhase(anneal_start).schedule_hp('lr', lr)
if curve=="cosine":
curve_type=annealing_cos
elif curve=="linear":
curve_type=annealing_linear
elif curve=="exponential":
curve_type=annealing_exp
else:
raiseValueError(f"annealing type not supported {curve}")
phase1 = TrainingPhase(n*5 - anneal_start).schedule_hp('lr', lr, anneal=curve_type)
phases = [phase0, phase1]
sched = GeneralScheduler(learn, phases)
#save the setup
learn.callbacks.append(sched)
#start the training
print(f"fcfit: num_epochs: {num_epoch}, lr = {lr}")
print(f"Flat for {anneal_start} epochs, then {curve} anneal for {batch_finish}")
learn.fit(num_epoch)
#bonus -show lr curve?
if f_show_curve:
learn.recorder.plot_lr()
Here is what I have so far:
class FlatCosAnnealScheduler(LearnerCallback):
"""
Manage FCFit training as found in the ImageNette experiments.
Code format is based on OneCycleScheduler
"""
def __init__(self, learn:Learner, lr:float=4e-3, moms:Floats=(0.95, 0.99), start_pct:float=0.72,
tot_epochs:int=2, start_epoch:int=None):
super().__init__(learn)
self.lr, self.start_pct = lr, start_pct
self.moms = tuple(listify(moms,2))
self.start_epoch, self.tot_epochs = start_epoch, tot_epochs
def steps(self, *steps_cfg:StartOptEnd):
"Build anneal schedule for parameters"
return [Scheduler(step, n_iter, func=func)
for (step, (n_iter,func)) in zip(steps_cfg, self.phases)]
def on_train_begin(self, n_epochs:int, **kwargs:Any)->None:
"Initialize optimization parameters based on schedule"
res = {'epoch':self.start_epoch} if self.start_epoch is not None else None
self.start_epoch=0
self.tot_epochs = ifnone(self.tot_epochs, n_epochs)
n = len(self.learn.data.train_dl)
anneal_start = int(n*self.tot_epochs*self.start_pct)
batch_finish = (n*self.tot_epochs - anneal_start)
self.phases = ((anneal_start, _), (batch_finish, annealing_cos))
self.mom_scheds = self.steps(self.moms, (self.moms[1], self.moms[0]))
self.opt = self.learn.opt
self.opt.lr, self.opt.mom = self.lr, self.mom_scheds[0].start
self.idx_s=0
return res
def on_batch_end(self, train, **kwargs:Any)->None:
"Take one step forward on annealing schedule"
if train:
if self.idx_s >= (len(self.mom_scheds)): return {'stop_training': True, 'stop_epoch': True}
self.opt.mom = self.mom_scheds[self.idx_s].step()
self.idx_s+=1
def on_epoch_end(self, epoch, **kwargs:Any)->None:
"Tell Learner to stop if the cycle is finished."
if epoch > self.tot_epochs: return {'stop_training': True}
def fcfit(learn:Learner, cyc_len:int, lr:float=defaults.lr,
moms:Tuple[float,float]=(0.95,0.85), div_factor:float=25., start_pct:float=0.72,
wd:float=None, callbacks:Optional[CallbackList]=None, tot_epochs:int=None, show_curve:bool=False)->None:
"Fit a model following the 1cycle policy."
max_lr = learn.lr_range(lr)
callbacks = listify(callbacks)
callbacks.append(FlatCosAnnealScheduler(learn, lr, moms=moms, start_pct=start_pct, tot_epochs=cyc_len))
learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
if show_curve: learn.recorder.plot_lr()
I currently am debugging the mom_scheds as I get a ‘float’ object is not callable. Stack trace:
<ipython-input-85-8536cad3848a> in on_batch_end(self, train, **kwargs)
35 if train:
36 print(self.mom_scheds[self.idx_s])
---> 37 self.opt.mom = self.mom_scheds[self.idx_s].step()
38 self.idx_s+=1
39
/usr/local/lib/python3.6/dist-packages/fastai/callback.py in step(self)
388 "Return next value along annealed schedule."
389 self.n += 1
--> 390 return self.func(self.start, self.end, self.n/self.n_iter)
391
392 @property
TypeError: 'float' object is not callable
oh that is nice - now we can also control the momentum with this in one function!
That’s the goal! I’m just unsure where this bug is going from and how to deal with it… if it persists til morning I’ll post on a separate thread and hopefully get sgugger or someone’s help
Probably won’t finish tonight. If you or someone else see’s something noticeable let me know!
Probably it’s worth to replace n*5 - anneal_start
with batch_finish
.
This type of schedule was the first thing that I’ve tried. I believe there is a HUGE opportunity in finding an optimal schedule for new optimisers. For instance I don’t use any schedule for momentum, this one could be improved. I’m going to explore this a bit in the next couple of days.
name proposal: instead of fcfit - fit_fc, like fit_one_cycle
Thanks @a_yasyrev! I like that very much! I think I fixed the code… @LessW2020 do you want to run it on one you’re comfortable with the results to be sure? I ran it with Ranger + SSA + MX and results were pretty much as expected, but before a pr and such I’d like a double check
@grankin can you check as well? As it is your brain child
class FlatCosAnnealScheduler(LearnerCallback):
"""
Manage FCFit training as found in the ImageNette experiments.
Code format is based on OneCycleScheduler
Based on idea by Mikhail Grankin
"""
def __init__(self, learn:Learner, lr:float=4e-3, moms:Floats=(0.95,0.999),
start_pct:float=0.72, start_epoch:int=None, tot_epochs:int=None,
curve='cosine'):
super().__init__(learn)
n = len(learn.data.train_dl)
self.anneal_start = int(n * tot_epochs * start_pct)
self.batch_finish = (n * tot_epochs - self.anneal_start)
if curve=="cosine":
curve_type=annealing_cos
elif curve=="linear":
curve_type=annealing_linear
elif curve=="exponential":
curve_type=annealing_exp
else:
raiseValueError(f"annealing type not supported {curve}")
phase0 = TrainingPhase(self.anneal_start).schedule_hp('lr', lr).schedule_hp('mom', moms[0])
phase1 = TrainingPhase(self.batch_finish).schedule_hp('lr', lr, anneal=curve_type).schedule_hp('mom', moms[1])
phases = [phase0, phase1]
self.phases,self.start_epoch = phases,start_epoch
def on_train_begin(self, epoch:int, **kwargs:Any)->None:
"Initialize the schedulers for training."
res = {'epoch':self.start_epoch} if self.start_epoch is not None else None
self.start_epoch = ifnone(self.start_epoch, epoch)
self.scheds = [p.scheds for p in self.phases]
self.opt = self.learn.opt
for k,v in self.scheds[0].items():
v.restart()
self.opt.set_stat(k, v.start)
self.idx_s = 0
return res
def jump_to_epoch(self, epoch:int)->None:
for _ in range(len(self.learn.data.train_dl) * epoch):
self.on_batch_end(True)
def on_batch_end(self, train, **kwargs:Any)->None:
"Take a step in lr,mom sched, start next stepper when the current one is complete."
if train:
if self.idx_s >= len(self.scheds): return {'stop_training': True, 'stop_epoch': True}
sched = self.scheds[self.idx_s]
for k,v in sched.items(): self.opt.set_stat(k, v.step())
if list(sched.values())[0].is_done: self.idx_s += 1
def fit_fc(learn:Learner, tot_epochs:int=None, lr:float=defaults.lr, moms:Tuple[float,float]=(0.95,0.85), start_pct:float=0.72,
wd:float=None, callbacks:Optional[CallbackList]=None, show_curve:bool=False)->None:
"Fit a model with Flat Cosine Annealing"
max_lr = learn.lr_range(lr)
callbacks = listify(callbacks)
callbacks.append(FlatCosAnnealScheduler(learn, lr, moms=moms, start_pct=start_pct, tot_epochs=tot_epochs))
learn.fit(tot_epochs, max_lr, wd=wd, callbacks=callbacks)
Edit: I believe I have working momentum
Edit x2: I do not… reverted to original
I definitely know it’s doing something right… first run: 78%! second run: 75% (running for 5)
Also, should we PR with Ranger et al for the main lib?
Looks to fit up to what we had before. Above note is still true. I’ll wait for a day in case we come up with any more discoveries on how to improve this functionality
Hi @muellerzr, @LessW2020 (Sorry I’m a new user and can only tag 2) it’s a very long thread and I’m still going through it. I created Mish activation function with the view of improving the SOTA on testing and I do acknowledge the fact that it has comparatively slower run time and I’m working on optimizing it more. Thanks for giving it a try. If you have any specific questions regarding Mish, feel free to ask.
Best.
Diganta
Hi Diganta - great to see you here on the boards! And congrats again on developing Mish!
Definitely let us know if you are able to further the performance of Mish.
I hope you’ll stick around as you have time here on the boards, it’s great to have you here
I’m hoping we’ll be able to get AutoOpt up and running and can use that with Mish in the near future to further explore it’s potential.
Nice job @muellerzr - it’s resulting in same accuracy and expected curve.
I have one question though - I’m confused by:
cyc_len and tot_epochs
I put cyc_len =2 thinking I would get 2 cycles in one epoch…instead I got 2 epochs.
I think that needs to be clarified for users and me, and I also think total_epochs should be right up front after learner instead of second to last.
I think in most cases people will just use cyc_len = 1 (assuming that means spread the whole curve over total epochs) so that should have a default value of 1 and be put way in the back since it’s not often used?
edit update: I tested this:
fit_fc(learn,2, 4e-3, tot_epochs=1,show_curve=True)
so I expressly said “1 epoch” but apparently cyc_len =2 is overriding and it then ran 2 epochs…
Anyway, that’s very confusing so I think we should make it simpler/easier to use.
Otherwise as noted, curves and accuracy look great!
Congrats on those results @LessW2020 @muellerzr !
Hopefully I may have something for you to get slightly better results. In case you guys are using Lookahead (even combined version), right before evaluation, there is a decision that should be made:
-
At the end of an epoch, most likely
nb_batches % k != 0
. Which means, that you are evaluating your model on your fast weights (before the next synchronization). -
The difference might be slim but positive as there are two choices right before evaluation: copy slow weights to fast weights (walking a few steps back), or perform synchronization even though you haven’t yet performed k fast steps since last sync.
I’m still investigating which option is giving the best results but at least, it’s better to have the choice. You can find the method I implemented in commit, that could be used as follows:
from torch.optim import Adam
optimizer = Adam(model_params)
optimizer = Lookahead(optimizer, sync_rate=0.5, sync_period=6)
for _ in range(nb_epochs):
# Train here
optimizer.sync_params()
# not specifying sync_rate means model params <- slow params
# otherwise optimizer.sync_params(0.5) will force early synchronization
# Evaluate here
Hope this helps!
Cheers
@LessW2020 yes, I need to adjust that. It’s really epochs not cycles. Will adjust soon. My apologies!!!
I’ll simplify it’s structure a lot. Glad to see it’s working though as expected
When we’re all good and ready I have the basic structure for like one cycle done and we can just PR it.
Hi @fgfm - thanks for pointing this out! I see your point and it’s a good one.
We are using Ranger, so yes lookahead is in there…that’s a really good option to control so we can test and see what works better.
Your code there looks really good btw.