The 1cycle policy - an experiment that investigate super-convergence phenomenon described in Leslie Smith's research

(Cedric Chee) #1

This is an interesting experiment conducted by a fellow under’s International Fellowship 2018 that dig into Leslie Smith’s work that Leslie describes the super-convergence phenomenon in this paper, “A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 - Learning Rate, Batch Size, Momentum, and Weight Decay”.

Results from the experiments:

By training with high learning rates we can reach a model that gets 93% accuracy in 70 epochs which is less than 7k iterations (as opposed to the 64k iterations which made roughly 360 epochs in the original paper).

This cyclical learning rate and momentums notebook contains all the experiments.

IMO, I think it’s too early to tell how well this technique works in general until we do more work to evaluate this. Nevertheless, I think this is an interesting and promising technique.

Note: everything that follows is unofficial.

The bleeding edge version (beta) of fastai library supports this technique. We can try it out by doing a git pull from fastai repo. Next is a high level summary of fastai library changes for this feature and some quick documentations:

1. New cyclical momentum

To use, add use_clr_beta parameter in the fit function that controls the 1cycle policy. For example:, 1, cycle_len=95, use_clr_beta=(10, 13.68, 0.95, 0.85), wds=1e-4)

The arguments of the use_clr_beta=(div, pct, max_mom, min_mom) tuples mean:

  • div: the amount to divide the passed learning rate to get the minimum learning rate. E.g.: pick 1/10th of the maximum learning rate for the minimum learning rate.
  • pct: the part of the cycle (in percent) that will be devoted to the LR annealing after the triangular cycle. E.g.: dedicate 13.68% of the cycle to the annealing at the end (that’s 13 epochs over 95).
  • max_mom: maximum momentum. E.g.: 0.95.
  • min_mom: minimum momentum. E.g.: 0.85.

Note, the two last args can be skipped if you don’t want to use cyclical momentum.

2. New learning rate finder function, lr_find2

This is a variant of lr_find. It doesn’t do an epoch but a fixed number of iterations (which may be more or less than an epoch depending on your data). At each step, it computes the validation loss and the metrics on the next batch in the training loop for the next batch of the validation data, so it’s slower than lr_find.

An example from the notebook under “Tuning weight decay” section:

learn.lr_find2(wds=1e-2, start_lr=0.01, end_lr=100, num_it=100)

The arguments of lr_find2(start_lr, end_lr, num_it, wds, linear, stop_dv)

  • start_lr: learning rate(s) for a learner’s layer_groups.
  • end_lr: the maximum learning rate to try.
  • num_it: the number of iterations you want it to run weight decays, wds.
  • stop_dv: stops (or not) when the losses starts to explode.

3. New plots

With lr_find2(), validation losses and metrics are saved each time they are computed (whether in normal training or LR find) so we can plot them after if we want.

(Charm) #2

thanks , it is very useful to me

(Cedric Chee) #3

Update: Jeremy wrote about this in a blog post titled “Training Imagenet in 3 hours for $25; and CIFAR10 for $0.26”.

Congrats to + students team! Great work. Good to see these great results.

I think the blog post should also link to the source code of the ImageNet model for the DAWNBench entries when it’s ready.

To add to the list of findings in the blog post, here’s what I found by looking at the code developed in PyTorch:

  • Distributed training using a modified version of PyTorch DataParallel and DistributedDataParallel modules to a custom DistributedDataParallel. This is important for speed.
    • What is the difference between DataParallel and DistributedDataParallel?
      • DataParallel is for performing training on multiple GPUs, single machine.
      • DistributedDataParallel is useful when you want to use multiple machines.
    • Writing distributed applications with PyTorch tutorial.
    • DistributedDataParallel is a PyTorch extension based on NVIDIA’s APEx contributions.
  • Prefetch data using DataPrefetcher class, a custom wrapper for PyTorch data loader. Seems to speed up training by ~2%.
    • Can’t confirm whether this prefetch directly onto the GPU or CPU.
  • set cudnn.benchmark = True. A well know CUDA setting for performance.
  • method teach in lesson 1 & 2 of the deep learning course such as cyclical learning rate, progressive image resizing technique, data augmentation, Test Time Augmentation (TTA), and so on.
  • using the latest methods from Leslie Smith’s work implemented in the fastai library like the 1cycle policy above.

(Ammar Ahmad Awan) #4

Hello guys, I am new to this forum so kindly forgive me if I am asking about it in a wrong thread. I am trying to reproduce the fast-imagenet work you guys have done with PyTorch but I am getting errors. I have filed an issue but it seems like this is a more active forum than the issue.

The link to the issue is:

Any help is much appreciated.