How much epochs to train for using OneCycle Policy

benihime91 · August 7, 2020, 2:16pm

Hello.

Does anyone know how much what’s the optimal amount of epochs to train for using OneCycePolicy?
In the course I saw that the number of epochs used was around 25-30 (correct me if i am wrong) more commonly I have seen Jeremy use just 5-6 epochs while doing learn.fit_one_cycle

Let’s say i want to train for 100 epochs. Will it still be feasible then to use OneCyclePolicy. ?

Should I break my training into smaller cycles (le’s say 20 epochs each ??) instead of doing 100 epochs at a single time.
Will my results be bad if i train for 100 epochs at a stretch with OneCyclePolicy.

Finally I just saw that now PyTorch has a torch.optim.lr_scheduler.OneCycleLR .
Is it similar to the way fast.ai does OneCycleScheduling?

NB: I am a beginner so I apologize if my questions sounds funny or i get my terminologies mixed up.

stefan-ai · August 7, 2020, 3:12pm

Hi @benihime91

There isn’t really a number of epochs that can be judged as optimal in general. It depends on so many factors, e.g. what’s the prediction task, how much training data you have, what architecture and hyper-parameters you choose, if you train from scratch or fine-tune etc.

There are different approaches, but in the upcoming DL course Jeremy mentions that you should generally train until your metric of interest, e.g. accuracy starts getting worse. If you’re fine-tuning a pre-trained model on a small dataset this can already be the case after 2-4 epochs. On the other hand if you train a very deep neural net from scratch on a large dataset you will need to train for dozens of epochs until your model fits the data well.

I’m not aware of any limits regarding the number of epochs for fit_one_cycle. I read in another thread that there definitely is a difference between splitting your epochs into smaller chunks and training all epochs in one go (which is the better option). Unfortunately I cannot find that thread anymore.

benihime91 · August 7, 2020, 3:27pm

Hello @stefan-ai

Thank You for replying… And yes I do understand this:

My question was more in the lines of training from scratch in which case one would expect to train a model for dozens of epochs. I haven’t really seen someone using OneCyclePolicy while training from scratch. In transfer-learning using OneCyclePolicy the model undoubtedly reaches convergence quickly using less number of epochs (atleast in my case).
But will this still work when training from scratch?

Do you remember anything regarding what was written in the thread just the general gist would help. If so it’d help me a ton. It’s a bummer you couldn’t find the thread.

Thanks for replying …

stefan-ai · August 7, 2020, 4:00pm

Here are some examples of using OneCyclePolicy when training from scratch:

Seq2seq with attention: even though it uses pre-trained embeddings, the rest of the model is trained from scratch. It has only been trained for 15 epochs for illustration but could certainly be trained longer to improve the resulting translations.
Pre-training of Vietnamese ULMFiT: also trained from scratch with fit_one_cycle. Even though the language model here is only trained for 10 epochs, I see no reason why this wouldn’t work for more epochs (maybe on a larger dataset).
Lesson 6 - Rossmann: here a tabular learner is trained from scratch with fit_one_cycle. Interestingly, the model here is trained in 3 “parts” using 5 epochs each.

I think the main intuition is that if you are training for more epochs in one go the one cycle policy is scheduled over the entire training cycle. If you split training into smaller chunks of epochs you launch a one cycle schedule in each of these chunks separately.