Jeremy notes that Imagenette was created “…to quickly see if my algorithm ideas might have a chance of working. They normally don’t…”
Thought I’d share one such idea that doesn’t work (as far as I can tell): adding a large-amplitude, high-frequency oscillation on top of the one_cycle
schedule. So far, nothing I’ve tried with this idea seems to ‘help’ (or ‘hurt’).
The motivation was, given that a too-large learning rate can lead to poor convergence and even divergent behavior if it’s kept going for long, what if there are only quick “bursts” where the learning rate briefly exceeds its (‘recommended’) max value?
Programmatically, this looked like (fastai2
code)…
@annealer
def SchedCosSpike(start, end, pos, amp=0.3, freq=50, offset=.8):
return (start + (1 + math.cos(math.pi*(1-pos))) * (end-start) / 2) * (1+amp*(math.sin(freq*math.pi*(1-pos))+offset))
def combined_cos_spike(pct, start, middle, end):
"Return a scheduler with cosine annealing and spikes from `start`→`middle` & `middle`→`end`"
return combine_scheds([pct,1-pct], [SchedCos(start, middle), SchedCosSpike(middle, end)])
def godziLR(self:Learner, n_epoch, lr_max=None, div=25., div_final=1e5, pct_start=0.25, wd=defaults.wd,
moms=(0.95, 0.35, 0.95), cbs=None, reset_opt=False):
"Fit `self.model` for `n_epoch` using the 1cycle policy."
if self.opt is None: self.create_opt()
self.opt.set_hyper('lr', self.lr if lr_max is None else lr_max)
lr_max = np.array([h['lr'] for h in self.opt.hypers])
scheds = {'lr': combined_cos_spike(pct_start, lr_max/div, lr_max, lr_max/div_final),
'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
The name godziLR
was a visual pun at how the LR schedule looks like when graphed (depending on how you choose amp
, offset
and freq
), one example being like this:
Initially I tried leaving the momentum the same as in fit_one_cycle
, then later tried (as in the code above) dropping the momentum way down (so that we don’t ‘shoot’ the optimizer off on some fast trajectory). I tried varying the amp
, offset
and freq
parameters quite a bit.
…So far nothing has produced any appreciable changes to the final accuracy (compared to the default LR schedule), using models such as xresnet18 up through xse_resnext101, for 5 to 20 epochs.
This is only 24 hours’ work, and its motivation may have been naive, but I often like to answer questions of “would this work?” by actually trying it. Nice to have @jeremy’s Imagenette to try ideas out on!
If anyone has any thoughts on this – whether theoretically why this was a waste of time, or ideas that might actually improve on this – let me know!