Jupyter notebook explaining the 4 papers by Leslie N. Smith

The following papers by Leslie N. Smith are covered in this notebook :-

  1. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay. paper
  2. Super-Convergence: Very Fast Training of Neural Networks Using Learning Rates. paper
  3. Exploring loss function topology with cyclical learning rates. paper
  4. Cyclical Learning Rates for Training Neural Networks. paper

This notebook covers all the topics discussed with theory as well as the fastai implementations of the relevant topics.

Table of Contents:

  1. Summary of hyper-parameters
  2. Hyper-params not discussed
  3. Things to remember
  4. Underfitting vs Overfitting
  5. Deep Dive into Underfitting and Overfitting
    1. Underfitting
    2. Overfitting
  6. Choosing Learning Rate
    1. Cyclic Learning Rate (CLR) and Learning Rate Test
    2. ResNet-56
    3. Cyclic Learning Rate
    4. Difference from Original paper
    5. One-cycle policy summary
    6. Learning rate finder test
  7. Introducing Super-Convergence
    1. Testing Linear Interpolation tests
    2. How it was found in the first place?
    3. Coding Linear INterpolation
  8. Explanation behind Super-Convergence
  9. Choosing Momentum
    1. Some good values of momentum to test
  10. Choosing Wight Decay
    1. How to set the value
  11. Train a final classifier model with above param values

There is cyclic momentum and weight decay left to cover, but seeing most of the stuff from the papers is covered I decide to share now.

Notebook link Reproducing Leslie N. Smith’s papers using fastai

19 Likes

excellent notebook
thank you

1 Like

I updated the notebook with code for cyclic momentum and weight decay also. Almost everything is covered from the papers in the notebook.

One problem that I faced while implementing this was, I was not able to get the results of interpolation. In the exploring loss function paper, a topic of interpolation is discussed where it states that the minimas found by each cycle are different. But when I tested out for my code, I found the opposite, that there was no visible peak in the interpolation figure.