Hi, I’m trying to do Resume training with fit_one_cycle to resume training after a machine crash. I have
every_epoch=True, but I can’t seem to figure out if I can resume
fit_one_cycle from a specific epoch. The
start_epoch argument seems to have gone away. Has anyone figured out how to do this?
Let’s say your command to feat one cycle is :
learn.fit_one_cycle(30, 1e-2, moms=(0.8,0.7))
Let’s say you interrupted your training as epoch 25. Each epoch was saved by your callback. Your last backup finishes by XXX_24 as epochs start from 0, not 1.
Here is your command to resume at the next epoch :
learn.fit_one_cycle(30, 1e-2, moms=(0.8,0.7), , start_epoch=26)
For this to work, you have to make sure your last backup XXX_24 is still in your model folders. Don’t move it elsewhere.
I hope it helps.
Let me know.
Here is what I did to add my callback to my learner :
I defined my callback this way :
#Add a save model callback
learner.callback_fns.append(partial(callbacks.SaveModelCallback, every='epoch', monitor='accuracy', name='classifier-model-saved-by-callback'))
Note : At this time in the code, my learner is already defined and called : learn
Then I call the function I just definied this way :
I can now feat one cycle :
learn.fit_one_cycle(30, 1e-2, moms=(0.8,0.7))
In my “model” subfolder, I’ll get those backup files, one backup at the end of each epoch :
I hope it’s less ambiguous.
Thanks for the response @Alexandre_DIEUL! Is what you’re saying applicable to fastai2? I know that you could pass
fit_one_cycle in fastai v1, but it doesn’t seem to be possible in v2: https://github.com/fastai/fastai2/blob/2c9cb629c436731352fce0ebfe3dcf3e6d6b87a3/fastai2/callback/schedule.py#L92
SaveModelCallback does work in fastai2 as well.
Well I’m using the latest version of fast ai.
And I’m using this callback as we speak
EDIT : I’m using the last version of fastai V1
Could you please tell me the command you used to install the version of fastai you have? Mine, which I got from github.com/fastai/fastai2, does not seem to support the
start_epoch argument and my reading of the code suggests it doesn’t exist.
Of course. I use collab, here are the cells used at the beginning of the notebook :
!curl -s https://course.fast.ai/setup/colab | bash
pip install git+https://github.com/fastai/fastai --upgrade
pip install git+https://github.com/fastai/fastprogress --upgrade
Tell me if you need more.
Yeah, so as I suspected, you’re not using
fastai2. You are using the public version of
fastai from github.com/fastai/fastai, not github.com/fastai/fastai2. My question (and this forum) is discussing the rewritten version, in development,
@sgugger I’m also finding some difficulty with this. It wouldn’t fall under
pct_start right because that’s a different hyperparamter. How would we do this? (Or mabye it would because of where we want to start at? IE if we made it through 75% of my 12 epochs, fit for 4 epochs with a
pct_start at .75?)
Im not sure if Im talking sometihng that is wrong, but as I get
.load are used to save a model you will continue trainning and continue trainning it.
So I guess there is no problem with the epoch you are, you just load your last saved model and continue from there?
@tyoc213 - this has to do with an interrupted
fit_one_cycle follows a specific schedule for modifying learning rate and momentum over the different epochs. If it is interrupted, you can resume training (assuming you have the most recent model, perhaps because you’re using
SaveModelCallback), but not from where you left off – if you run
fit_one_cycle again, it will start a new schedule for lr and momentum.
fastai v1, there was a way to achieve the resumption by passing
fit_one_cycle, it’s not clear how to do that in
The functionality to resume a 1cycle training has not been ported to v2 yet.
What do I do if a training with fit one cycle was interrupted halfway through?
def __init__(self,s_epoch): self.s_epoch = s_epoch
def begin_train(self): if self.epoch < self.s_epoch: raise CancelEpochException
def begin_validate(self): if self.epoch < self.s_epoch: raise CancelValidException
Then add it to your callbacks:
of course, load the model before.
hi, if I wanted to implement this functionality where should I look? Does it have to do with the ParamSheduler?
This did not work for me with fastai version 2.0.15
Do you know a tweak to make this work again?
Hi. Sorry for posting in outdated topic, but it looks like the most appropriate place to discuss related stuff. In case someone needs it, here is a version of callback that works for fastai 2.6.3 for me:
order = ProgressCallback.order + 1
def __init__(self, epoch:int):
self._skip_to = epoch
if self.epoch < self._skip_to:
The only issue is the table log in jupyter notebook will have empty rows for skipped epochs.
Even though this is a simple callback and I’ve looked at LR plots to make sure learning rate is scheduled correctly I feel some “friction” resuming training with custom stuff. Should we add a native support for resuming from checkpoint?
I’m thinking along this lines:
add “epoch” to save checkpoint, optionally we can include “base_lr” as well. Or rather save it as separate
learn_state.json to avoid breaking changes.
resume_from_ckpt argument to
fit_one_cycle etc. So we will be able to resume training simply by calling:
# <define `learn` same way as for run that generated checkpoint>
I can start with PR for this if it looks fine