Resuming fit_one_cycle with model from SaveModelCallback?

Hi, I’m trying to do Resume training with fit_one_cycle to resume training after a machine crash. I have SaveModelCallback with every_epoch=True, but I can’t seem to figure out if I can resume fit_one_cycle from a specific epoch. The start_epoch argument seems to have gone away. Has anyone figured out how to do this?

Thanks!

Hi,

Let’s say your command to feat one cycle is :

learn.fit_one_cycle(30, 1e-2, moms=(0.8,0.7))

Let’s say you interrupted your training as epoch 25. Each epoch was saved by your callback. Your last backup finishes by XXX_24 as epochs start from 0, not 1.

Here is your command to resume at the next epoch :

learn.fit_one_cycle(30, 1e-2, moms=(0.8,0.7), , start_epoch=26)

For this to work, you have to make sure your last backup XXX_24 is still in your model folders. Don’t move it elsewhere.

I hope it helps.
Let me know.

Greetings,
Alexandre.

4 Likes

Hi again,

Here is what I did to add my callback to my learner :

I defined my callback this way :

def addSaveCallbackClassifier(learner):
    #Add a save model callback
    learner.callback_fns.append(partial(callbacks.SaveModelCallback, every='epoch', monitor='accuracy', name='classifier-model-saved-by-callback'))

Note : At this time in the code, my learner is already defined and called : learn

Then I call the function I just definied this way :

addSaveCallbackClassifier(learn)

I can now feat one cycle :

learn.fit_one_cycle(30, 1e-2, moms=(0.8,0.7))

In my “model” subfolder, I’ll get those backup files, one backup at the end of each epoch :

classifier-model-saved-by-callback_0.pth
classifier-model-saved-by-callback_1.pth
classifier-model-saved-by-callback_2.pth

until …

classifier-model-saved-by-callback_29.pth

I hope it’s less ambiguous.

Greetings,
Alexandre.

Thanks for the response @Alexandre_DIEUL! Is what you’re saying applicable to fastai2? I know that you could pass start_epoch to fit_one_cycle in fastai v1, but it doesn’t seem to be possible in v2: https://github.com/fastai/fastai2/blob/2c9cb629c436731352fce0ebfe3dcf3e6d6b87a3/fastai2/callback/schedule.py#L92

SaveModelCallback does work in fastai2 as well.

Well I’m using the latest version of fast ai.
And I’m using this callback as we speak :wink:

EDIT : I’m using the last version of fastai V1
Thanks indigoviolet

Could you please tell me the command you used to install the version of fastai you have? Mine, which I got from github.com/fastai/fastai2, does not seem to support the start_epoch argument and my reading of the code suggests it doesn’t exist.

Of course. I use collab, here are the cells used at the beginning of the notebook :

!curl -s https://course.fast.ai/setup/colab | bash

pip install git+https://github.com/fastai/fastai --upgrade

pip install git+https://github.com/fastai/fastprogress --upgrade

Tell me if you need more.

Yeah, so as I suspected, you’re not using fastai2. You are using the public version of fastai from github.com/fastai/fastai, not github.com/fastai/fastai2. My question (and this forum) is discussing the rewritten version, in development, fastai2.

@sgugger I’m also finding some difficulty with this. It wouldn’t fall under pct_start right because that’s a different hyperparamter. How would we do this? (Or mabye it would because of where we want to start at? IE if we made it through 75% of my 12 epochs, fit for 4 epochs with a pct_start at .75?)

1 Like

My guess was that we’d want to make it possible to set pct_train here: https://github.com/fastai/fastai2/blob/2c9cb629c436731352fce0ebfe3dcf3e6d6b87a3/fastai2/callback/schedule.py#L72

Im not sure if Im talking sometihng that is wrong, but as I get .save and .load are used to save a model you will continue trainning and continue trainning it.

So I guess there is no problem with the epoch you are, you just load your last saved model and continue from there?

@tyoc213 - this has to do with an interrupted fit_one_cycle() run. fit_one_cycle follows a specific schedule for modifying learning rate and momentum over the different epochs. If it is interrupted, you can resume training (assuming you have the most recent model, perhaps because you’re using SaveModelCallback), but not from where you left off – if you run fit_one_cycle again, it will start a new schedule for lr and momentum.

In fastai v1, there was a way to achieve the resumption by passing start_epoch to fit_one_cycle, it’s not clear how to do that in fastai v2.

Good to know, thanks!

The functionality to resume a 1cycle training has not been ported to v2 yet.

2 Likes

What do I do if a training with fit one cycle was interrupted halfway through?

Try this:

class SkipToEpoch(Callback):
    def __init__(self,s_epoch): self.s_epoch = s_epoch
    def begin_train(self):  if self.epoch < self.s_epoch: raise CancelEpochException
    def begin_validate(self):  if self.epoch < self.s_epoch: raise CancelValidException

Then add it to your callbacks:
learn.fit_one_cycle(1,3e-3, cbs=cbs+SkipToEpoch(11))

of course, load the model before.

4 Likes

Any update???

hi, if I wanted to implement this functionality where should I look? Does it have to do with the ParamSheduler?

This did not work for me with fastai version 2.0.15
Do you know a tweak to make this work again?

2 Likes

Hi. Sorry for posting in outdated topic, but it looks like the most appropriate place to discuss related stuff. In case someone needs it, here is a version of callback that works for fastai 2.6.3 for me:

class SkipToEpoch(Callback):
    order = ProgressCallback.order + 1
    def __init__(self, epoch:int):
        self._skip_to = epoch

    def before_epoch(self):
        if self.epoch < self._skip_to:
            raise CancelEpochException

The only issue is the table log in jupyter notebook will have empty rows for skipped epochs.

Even though this is a simple callback and I’ve looked at LR plots to make sure learning rate is scheduled correctly I feel some “friction” resuming training with custom stuff. Should we add a native support for resuming from checkpoint?
I’m thinking along this lines:
add “epoch” to save checkpoint, optionally we can include “base_lr” as well. Or rather save it as separate learn_state.json to avoid breaking changes.
add resume_from_ckpt argument to fit, fit_one_cycle etc. So we will be able to resume training simply by calling:

# <define `learn` same way as for run that generated checkpoint>
learn.fit_one_cycle(..., resume_from_ckpt="model_24.pth")

I can start with PR for this if it looks fine :slight_smile: