Learn.fit_fc() won't complete the given total epochs without any errors

Hi everyone,

I’m doing an image classification with cnn_learner method, architecture: resnext50_32x4d, optimizer: Ranger from here by @LessW2020 (Hi LessW2020), inspired from LessW2020’s amazing work with ImageWoof challenges, and additional OverSamplingCallback from official fastai repo.

I used fit_fc() method from official fastai repo, which is known as “flat cosine annealing” lr scheduler. But I’m facing this issue many times with various choices of total epochs and platforms (google colab, kaggle) when the notebook line learn.fit_fc() didn’t complete the given total epochs but no errors were returned.

What can I do about this? Thanks for reading.

Hi, when I made fit_fc I noticed this issue occurring every once in while. I believe that it was the optimizer itself but I’ll have to double check. Are you retraining the same learner and it’s having this issue? Eg: train one model for 20 epochs and then retrain again? That’s the exact scenario where I saw some troubles

1 Like

Hi, I often have the issue with the first run of fit_fc. You might be right about the optimizer, as I have only used fit_fc with Ranger. Have you found any workaround?

Are you running lr_find() first? IIRC that’s how I got this bug to show. There’s no reason it should be the scheduler (I can think of)

If so, the workaround Less described is restarting the kernel after the lr_find. The bug is with the fact of how the optimizer is using LookAhead

1 Like

Yeah I also found an issue when calling lr_find() first causing the learner didn’t reset its weights and start fitting with a too big lr using fit_fc, but nothing like this issue here. Also the actual running epoch (9/20 in this case) is very random.

1 Like

Try running it again with Adam with the same steps for me (fc_fit)? Just want to be sure where the problem is first :slight_smile:

1 Like

I did the test and was able to confirm that with the same total epoch (20), same other settings (run in the same notebook), fit_fc with AdamW ran into the same issue. Maybe you can check it out? Thank you.

Sure! I’ll give it a look at here in a minute. Thanks for investigating @MichaelScofield :slight_smile:

1 Like

@MichaelScofield could you show exactly the steps you’re doing from start to finish? As I just ran it start to finish for 20 and then 30 epochs and it worked fine. (mabye share the kernel if you can)

Edit: AHA! I found the bug. It has to do with the callbacks. Let me see which one

It’s the oversampling callback and I know exactly why. fit_fc goes and looks at total training batches to determine when to start annealing, and oversampling changes that on the fly, causing the bug as it thinks its “finished” much earlier than it has.

1 Like

Great to hear that. I guessed so when add oversampling callback due to class imbalance in my problem, how dumb did I not include it :rofl:. But how quickly you can identify it was amazing. So do you have any suggestions using this callback with fit_fc?

Considering they both begin on on_train_begin i’m unsure quite how that would wind up working, my apologies :slight_smile: Perhaps mabye merging the two together from the OverSampling callback source code?

The current issue is I believe that the end point is being set first, then the oversampling occurs and so we finish much sooner than anticipated.

Perhaps look into oversampling before?

1 Like

I will have a deeper look and compare fit_fc and LessW2020’s flattenAnneal function in his ImageWoof notebook since I also did a test on the latter and didn’t find the bug. Anyway, thank you for your support @muellerzr.

They’re the same as I wrote them :slight_smile: fc_fit simply calls the flattenAnneal callback easier. You can find flattenAnnealCosine in the callbacks.py

Ah. Didn’t know that you also wrote the flattenAnneal function in LessW2020’ notebook, I thought that it’s not official.

Nope it is. At the time it was not (I was working on getting it merged). Also wrote the championship notebook :wink:

Back to the issue, I can try to modify the callback into an actual callback and adjust to try to come up with something. Give me a few moments (a merged scheduler with flattenAnneal to get it working)

1 Like

Definitely looking forward to it :grinning:

Hmmmm… pinging @ilovescience as this is your callback (IIRC). FlatCosAnnealing works by taking an n to begin annealing before making the schedule:

n = len(self.data.train_dl)
anneal_start = int(n*tot_epochs*start_pct)
batch_finish = ((n * tot_epochs) - anneal_start)

Then the phases are added and it’s ending earlier than expected because there’s more being added (due to oversampling). Ideas on how to go about this?

1 Like

So right now, it is creating a new train_dl with the correct oversampled length. So I am not sure why the problem is arising. Could it be due to the order of which the callbacks are applied? If so, isn’t there an _order attribute to control this?

1 Like

Thanks! It’s working I just need help with one thing. How do we access the number of epochs? (is it self.n_epochs)?

Otherwise I believe I’ve got it figured out

Ok so you were able to figure it all out? Is a fix needed for OverSamplingCallback or for fit_fc?