The 1cycle policy - an experiment that investigate super-convergence phenomenon described in Leslie Smith's research

cedric · April 10, 2018, 1:30pm

This is an interesting experiment conducted by a fellow under fast.ai’s International Fellowship 2018 that dig into Leslie Smith’s work that Leslie describes the super-convergence phenomenon in this paper, “A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 - Learning Rate, Batch Size, Momentum, and Weight Decay”.

Results from the experiments:

By training with high learning rates we can reach a model that gets 93% accuracy in 70 epochs which is less than 7k iterations (as opposed to the 64k iterations which made roughly 360 epochs in the original paper).

This cyclical learning rate and momentums notebook contains all the experiments.

IMO, I think it’s too early to tell how well this technique works in general until we do more work to evaluate this. Nevertheless, I think this is an interesting and promising technique.

Note: everything that follows is unofficial.

The bleeding edge version (beta) of fastai library supports this technique. We can try it out by doing a git pull from fastai repo. Next is a high level summary of fastai library changes for this feature and some quick documentations:

1. New cyclical momentum

To use, add use_clr_beta parameter in the fit function that controls the 1cycle policy. For example:

learn.fit(0.8, 1, cycle_len=95, use_clr_beta=(10, 13.68, 0.95, 0.85), wds=1e-4)

The arguments of the use_clr_beta=(div, pct, max_mom, min_mom) tuples mean:

div: the amount to divide the passed learning rate to get the minimum learning rate. E.g.: pick 1/10th of the maximum learning rate for the minimum learning rate.

pct: the part of the cycle (in percent) that will be devoted to the LR annealing after the triangular cycle. E.g.: dedicate 13.68% of the cycle to the annealing at the end (that’s 13 epochs over 95).

max_mom: maximum momentum. E.g.: 0.95.

min_mom: minimum momentum. E.g.: 0.85.

Note, the two last args can be skipped if you don’t want to use cyclical momentum.

2. New learning rate finder function, lr_find2

This is a variant of lr_find. It doesn’t do an epoch but a fixed number of iterations (which may be more or less than an epoch depending on your data). At each step, it computes the validation loss and the metrics on the next batch in the training loop for the next batch of the validation data, so it’s slower than lr_find.

An example from the notebook under “Tuning weight decay” section:

learn.lr_find2(wds=1e-2, start_lr=0.01, end_lr=100, num_it=100)

The arguments of lr_find2(start_lr, end_lr, num_it, wds, linear, stop_dv)

start_lr: learning rate(s) for a learner’s layer_groups.

end_lr: the maximum learning rate to try.

num_it: the number of iterations you want it to run weight decays, wds.

stop_dv: stops (or not) when the losses starts to explode.

3. New plots

With lr_find2(), validation losses and metrics are saved each time they are computed (whether in normal training or LR find) so we can plot them after if we want.

charming · May 1, 2018, 3:42pm

thanks , it is very useful to me

cedric · May 3, 2018, 5:10am

Update: Jeremy wrote about this in a blog post titled “Training Imagenet in 3 hours for $25; and CIFAR10 for $0.26”.

Congrats to fast.ai + students team! Great work. Good to see these great results.

I think the blog post should also link to the source code of the ImageNet model for the DAWNBench entries when it’s ready.

To add to the list of findings in the blog post, here’s what I found by looking at the code developed in PyTorch:

Distributed training using a modified version of PyTorch DataParallel and DistributedDataParallel modules to a custom DistributedDataParallel. This is important for speed.
- What is the difference between DataParallel and DistributedDataParallel?
  - DataParallel is for performing training on multiple GPUs, single machine.
  - DistributedDataParallel is useful when you want to use multiple machines.
- Writing distributed applications with PyTorch tutorial.
- DistributedDataParallel is a PyTorch extension based on NVIDIA’s APEx contributions.
Prefetch data using DataPrefetcher class, a custom wrapper for PyTorch data loader. Seems to speed up training by ~2%.
- Can’t confirm whether this prefetch directly onto the GPU or CPU.
set cudnn.benchmark = True. A well know CUDA setting for performance.
method teach in lesson 1 & 2 of the deep learning course such as cyclical learning rate, progressive image resizing technique, data augmentation, Test Time Augmentation (TTA), and so on.
using the latest methods from Leslie Smith’s work implemented in the fastai library like the 1cycle policy above.

ammarawan · May 31, 2018, 8:57pm

Hello guys, I am new to this forum so kindly forgive me if I am asking about it in a wrong thread. I am trying to reproduce the fast-imagenet work you guys have done with PyTorch but I am getting errors. I have filed an issue but it seems like this is a more active forum than the issue.

The link to the issue is: https://github.com/fastai/imagenet-fast/issues/9

Any help is much appreciated.

cahya · September 24, 2018, 6:05am

Hi, I am trying to use the function learner.lr_find2(num_it=500) in Language Modeling, but I got the error message “RuntimeError: cannot call .data on a torch.Tensor: did you intend to use autograd.Variable?” as shown below, anyone know what might be wrong?
Thanks

 0%|          | 0/25633 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-80-8f831a14af26> in <module>()
----> 1 learner.lr_find2(num_it=500)

/data/courses/fastai/learner.py in lr_find2(self, start_lr, end_lr, num_it, wds, linear, stop_dv, **kwargs)
    364         layer_opt = self.get_layer_opt(start_lr, wds)
    365         self.sched = LR_Finder2(layer_opt, num_it, end_lr, linear=linear, metrics=self.metrics, stop_dv=stop_dv)
--> 366         self.fit_gen(self.model, self.data, layer_opt, num_it//len(self.data.trn_dl) + 1, all_val=True, **kwargs)
    367         self.load('tmp')
    368 

/data/courses/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
    247             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
    248             swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
--> 249             swa_eval_freq=swa_eval_freq, **kwargs)
    250 
    251     def get_layer_groups(self): return self.models.get_layer_groups()

/data/courses/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, visualize, **kwargs)
    144             t.set_postfix(loss=debias_loss, refresh=False)
    145             stop=False
--> 146             los = debias_loss if not all_val else [debias_loss] + validate_next(model_stepper,metrics, val_iter)
    147             for cb in callbacks: stop = stop or cb.on_batch_end(los)
    148             if stop: return

/data/courses/fastai/model.py in validate_next(stepper, metrics, val_iter)
    220         preds,l = stepper.evaluate(VV(x), VV(y))
    221         res = [delistify(to_np(l))]
--> 222         res += [f(datafy(preds), datafy(y)) for f in metrics]
    223     stepper.reset(True)
    224     return res

/data/courses/fastai/model.py in <listcomp>(.0)
    220         preds,l = stepper.evaluate(VV(x), VV(y))
    221         res = [delistify(to_np(l))]
--> 222         res += [f(datafy(preds), datafy(y)) for f in metrics]
    223     stepper.reset(True)
    224     return res

/data/courses/fastai/core.py in datafy(x)
     17 def datafy(x):
     18     if is_listy(x): return [o.data for o in x]
---> 19     else:           return x.data
     20 
     21 conv_dict = {np.dtype('int8'): torch.LongTensor, np.dtype('int16'): torch.LongTensor,

/opt/conda/envs/fastai/lib/python3.6/site-packages/torch/tensor.py in data(self)
    393     @property
    394     def data(self):
--> 395         raise RuntimeError('cannot call .data on a torch.Tensor: did you intend to use autograd.Variable?')
    396 
    397     # Numpy array interface, to support `numpy.asarray(tensor) -> ndarray`

RuntimeError: cannot call .data on a torch.Tensor: did you intend to use autograd.Variable?

cedric · September 24, 2018, 7:59am

Hi, I have checked.

# model.py in fastai library

preds,l = stepper.evaluate(VV(x), VV(y))
res = [delistify(to_np(l))]
res += [f(datafy(preds), datafy(y)) for f in metrics]

The error means preds is NOT in the correct data type which is supposed to be torch.Tensor. Thus, when PyTorch call the .data getter on it, it throws that runtime error.

Looks like the problem lies in the stepper.evaluate() function. It works for me and I am not so sure what have been changed in fastai library that caused this problem. I am using a slightly older version of fastai library. When is the last time you git clone and git pull from https://github.com/fastai/fastai?

Could you try to debug your notebook/code using PDB?

I will continue to check that stepper.evaluate function to see if I can spot any potential problem.

Update 1: I have checked the Stepper class and found nothing. So, the next thing to check is val_iter.next(), which is an iterator from the validation data loader object. The debugging scope here is beyond me.

cedric · September 24, 2018, 8:36am

Hi guys. As I can’t edit my original post. Just to put this out there. My original post at this point is outdated as fastai library changed quite a bit since. Please refer to Sylvain Gugger’s posts from here on. Sylvain implemented the 1cycle policy and cyclical momentum (use_clr_beta) method in fastai library. I suggest you start looking here:

cahya · September 24, 2018, 12:15pm

Hi Cedric, thanks a lot for your information and suggestion. I pulled the fastai last week when I had this issue, the pull from yesterday don’t solve the issue. Actually I tried also your notebook for building the malay LM, but I got the same issue. So, I will try PDB as suggested when I have time and let you know the result.

cedric · September 24, 2018, 1:41pm

OK. That’s strange.

Well then, I think I can try pulling the latest fastai lib to replicate the issue and from there I can debug my notebook to see what actually happened.

cahya · September 24, 2018, 2:53pm

I use PyCharm to debug it, here is the screenshot on the line where it has the issue:

I think the issue is not in preds, but y. the type of preds is Variable, but y is LongTensor, and y.data is a string and not a Tensor like preds.data.

cahya · September 24, 2018, 3:21pm

Ok, just for my curiosity, I replace the line
res += [f(datafy(preds), datafy(y)) for f in metrics]
with
res += [f(datafy(preds), datafy(VV(y))) for f in metrics]

and it seems that learner.lr_find2(num_it=500) is running without any problem, but I don’t know the impact what I changed at least it shows the culprit is really the variable y

cedric · September 24, 2018, 3:36pm

PyTorch Tensors don’t have a data attribute. Variables do. Variable is a wrapper around tensor that supports automatic differentiation. Variable.data is the underlying tensor. So, to go from Variable to tensor, you just use: Variable.data.

VV is a utility function that creates a single or a list of PyTorch Variable.

Oh, good. Not sure why in the first place the variable y which is the dataset labels type is not PyTorch Variable. The change you made is in fastai lib. Yeah, have to be careful. I think if that change is covered by Python unit test and we run the test suite, we can access the impact of that change, but I don’t think we should even run the test suite as we are not the library developer.

wyquek · October 11, 2018, 11:26am

For the 1 cycle policy, isn’t a difference of 10 times from max to min considered kind of a constrained range to change the LR in? it’s like going from a max of 1e-3 to a min of 1e-4.

digitalspecialists · October 11, 2018, 11:36am

The default in the new fastai v1 fit_one_cycle function is 25. In practise I use anywhere from 10 to 100.