Hi, I’m trying to do Resume training with fit_one_cycle to resume training after a machine crash. I have SaveModelCallback with every_epoch=True, but I can’t seem to figure out if I can resume fit_one_cycle from a specific epoch. The start_epoch argument seems to have gone away. Has anyone figured out how to do this?
Let’s say you interrupted your training as epoch 25. Each epoch was saved by your callback. Your last backup finishes by XXX_24 as epochs start from 0, not 1.
Here is your command to resume at the next epoch :
Here is what I did to add my callback to my learner :
I defined my callback this way :
def addSaveCallbackClassifier(learner):
#Add a save model callback
learner.callback_fns.append(partial(callbacks.SaveModelCallback, every='epoch', monitor='accuracy', name='classifier-model-saved-by-callback'))
Note : At this time in the code, my learner is already defined and called : learn
Then I call the function I just definied this way :
addSaveCallbackClassifier(learn)
I can now feat one cycle :
learn.fit_one_cycle(30, 1e-2, moms=(0.8,0.7))
In my “model” subfolder, I’ll get those backup files, one backup at the end of each epoch :
classifier-model-saved-by-callback_0.pth
classifier-model-saved-by-callback_1.pth
classifier-model-saved-by-callback_2.pth
…
until …
…
classifier-model-saved-by-callback_29.pth
Could you please tell me the command you used to install the version of fastai you have? Mine, which I got from github.com/fastai/fastai2, does not seem to support the start_epoch argument and my reading of the code suggests it doesn’t exist.
Yeah, so as I suspected, you’re not using fastai2. You are using the public version of fastai from github.com/fastai/fastai, not github.com/fastai/fastai2. My question (and this forum) is discussing the rewritten version, in development, fastai2.
@sgugger I’m also finding some difficulty with this. It wouldn’t fall under pct_start right because that’s a different hyperparamter. How would we do this? (Or mabye it would because of where we want to start at? IE if we made it through 75% of my 12 epochs, fit for 4 epochs with a pct_start at .75?)
Im not sure if Im talking sometihng that is wrong, but as I get .save and .load are used to save a model you will continue trainning and continue trainning it.
So I guess there is no problem with the epoch you are, you just load your last saved model and continue from there?
@tyoc213 - this has to do with an interrupted fit_one_cycle() run. fit_one_cycle follows a specific schedule for modifying learning rate and momentum over the different epochs. If it is interrupted, you can resume training (assuming you have the most recent model, perhaps because you’re using SaveModelCallback), but not from where you left off – if you run fit_one_cycle again, it will start a new schedule for lr and momentum.
In fastai v1, there was a way to achieve the resumption by passing start_epoch to fit_one_cycle, it’s not clear how to do that in fastai v2.
Hi. Sorry for posting in outdated topic, but it looks like the most appropriate place to discuss related stuff. In case someone needs it, here is a version of callback that works for fastai 2.6.3 for me:
class SkipToEpoch(Callback):
order = ProgressCallback.order + 1
def __init__(self, epoch:int):
self._skip_to = epoch
def before_epoch(self):
if self.epoch < self._skip_to:
raise CancelEpochException
The only issue is the table log in jupyter notebook will have empty rows for skipped epochs.
Even though this is a simple callback and I’ve looked at LR plots to make sure learning rate is scheduled correctly I feel some “friction” resuming training with custom stuff. Should we add a native support for resuming from checkpoint?
I’m thinking along this lines:
add “epoch” to save checkpoint, optionally we can include “base_lr” as well. Or rather save it as separate learn_state.json to avoid breaking changes.
add resume_from_ckpt argument to fit, fit_one_cycle etc. So we will be able to resume training simply by calling:
# <define `learn` same way as for run that generated checkpoint>
learn.fit_one_cycle(..., resume_from_ckpt="model_24.pth")