Fun little experiment: Optimizer Benchmarks
Main conclusions from my project page/first blog:
- OneCycle LR > Constant LR
-
Making a new optimizer vs. Preserving
state
and re-using the same optimizer both achieve very similar performance. i.e. Discarding an optimizer’sstate
didn’t really hurt the model’s performance, with or without an LR Scheduler. Maybe the state is learned quickly.
Note: Conclusions are based on the Adam optimizer and OneCycle LR Scheduler. I haven’t experimented with other optimizers to see if dropping their state
is more impactful
Edit Note: I’m not proposing to always throw optimizers away, I still believe the general guideline is to use the same optimizer and keep the history. Kindly share resources if anyone found results showing the importance of using the same optimizer