Importance of Maintaining Optimizer State

Fun little experiment: Optimizer Benchmarks

Main conclusions from my project page/first blog:

  • OneCycle LR > Constant LR
  • Making a new optimizer vs. Preserving state and re-using the same optimizer both achieve very similar performance. i.e. Discarding an optimizer’s state didn’t really hurt the model’s performance, with or without an LR Scheduler. Maybe the state is learned quickly.

Note: Conclusions are based on the Adam optimizer and OneCycle LR Scheduler. I haven’t experimented with other optimizers to see if dropping their state is more impactful

Edit Note: I’m not proposing to always throw optimizers away, I still believe the general guideline is to use the same optimizer and keep the history. Kindly share resources if anyone found results showing the importance of using the same optimizer :slight_smile: