Learn.fit_fc() won't complete the given total epochs without any errors

MichaelScofield · October 27, 2019, 1:29pm

Hi everyone,

I’m doing an image classification with cnn_learner method, architecture: resnext50_32x4d, optimizer: Ranger from here by @LessW2020 (Hi LessW2020), inspired from LessW2020’s amazing work with ImageWoof challenges, and additional OverSamplingCallback from official fastai repo.

I used fit_fc() method from official fastai repo, which is known as “flat cosine annealing” lr scheduler. But I’m facing this issue many times with various choices of total epochs and platforms (google colab, kaggle) when the notebook line learn.fit_fc() didn’t complete the given total epochs but no errors were returned.

What can I do about this? Thanks for reading.

muellerzr · October 27, 2019, 2:00pm

Hi, when I made fit_fc I noticed this issue occurring every once in while. I believe that it was the optimizer itself but I’ll have to double check. Are you retraining the same learner and it’s having this issue? Eg: train one model for 20 epochs and then retrain again? That’s the exact scenario where I saw some troubles

MichaelScofield · October 27, 2019, 2:06pm

Hi, I often have the issue with the first run of fit_fc. You might be right about the optimizer, as I have only used fit_fc with Ranger. Have you found any workaround?

muellerzr · October 27, 2019, 2:14pm

Are you running lr_find() first? IIRC that’s how I got this bug to show. There’s no reason it should be the scheduler (I can think of)

If so, the workaround Less described is restarting the kernel after the lr_find. The bug is with the fact of how the optimizer is using LookAhead

MichaelScofield · October 27, 2019, 2:27pm

Yeah I also found an issue when calling lr_find() first causing the learner didn’t reset its weights and start fitting with a too big lr using fit_fc, but nothing like this issue here. Also the actual running epoch (9/20 in this case) is very random.

muellerzr · October 27, 2019, 2:29pm

Try running it again with Adam with the same steps for me (fc_fit)? Just want to be sure where the problem is first

MichaelScofield · October 27, 2019, 11:49pm

I did the test and was able to confirm that with the same total epoch (20), same other settings (run in the same notebook), fit_fc with AdamW ran into the same issue. Maybe you can check it out? Thank you.

muellerzr · October 27, 2019, 11:52pm

Sure! I’ll give it a look at here in a minute. Thanks for investigating @MichaelScofield

muellerzr · October 28, 2019, 12:36am

@MichaelScofield could you show exactly the steps you’re doing from start to finish? As I just ran it start to finish for 20 and then 30 epochs and it worked fine. (mabye share the kernel if you can)

Edit: AHA! I found the bug. It has to do with the callbacks. Let me see which one

It’s the oversampling callback and I know exactly why. fit_fc goes and looks at total training batches to determine when to start annealing, and oversampling changes that on the fly, causing the bug as it thinks its “finished” much earlier than it has.

MichaelScofield · October 28, 2019, 1:04am

Great to hear that. I guessed so when add oversampling callback due to class imbalance in my problem, how dumb did I not include it . But how quickly you can identify it was amazing. So do you have any suggestions using this callback with fit_fc?

muellerzr · October 28, 2019, 1:14am

Considering they both begin on on_train_begin i’m unsure quite how that would wind up working, my apologies Perhaps mabye merging the two together from the OverSampling callback source code?

The current issue is I believe that the end point is being set first, then the oversampling occurs and so we finish much sooner than anticipated.

Perhaps look into oversampling before?

MichaelScofield · October 28, 2019, 2:14am

I will have a deeper look and compare fit_fc and LessW2020’s flattenAnneal function in his ImageWoof notebook since I also did a test on the latter and didn’t find the bug. Anyway, thank you for your support @muellerzr.

muellerzr · October 28, 2019, 2:17am

They’re the same as I wrote them fc_fit simply calls the flattenAnneal callback easier. You can find flattenAnnealCosine in the callbacks.py

MichaelScofield · October 28, 2019, 2:21am

Ah. Didn’t know that you also wrote the flattenAnneal function in LessW2020’ notebook, I thought that it’s not official.

muellerzr · October 28, 2019, 2:23am

Nope it is. At the time it was not (I was working on getting it merged). Also wrote the championship notebook

Back to the issue, I can try to modify the callback into an actual callback and adjust to try to come up with something. Give me a few moments (a merged scheduler with flattenAnneal to get it working)

MichaelScofield · October 28, 2019, 2:35am

Definitely looking forward to it

muellerzr · October 28, 2019, 2:47am

Hmmmm… pinging @ilovescience as this is your callback (IIRC). FlatCosAnnealing works by taking an n to begin annealing before making the schedule:

n = len(self.data.train_dl)
anneal_start = int(n*tot_epochs*start_pct)
batch_finish = ((n * tot_epochs) - anneal_start)

Then the phases are added and it’s ending earlier than expected because there’s more being added (due to oversampling). Ideas on how to go about this?

ilovescience · October 28, 2019, 2:51am

So right now, it is creating a new train_dl with the correct oversampled length. So I am not sure why the problem is arising. Could it be due to the order of which the callbacks are applied? If so, isn’t there an _order attribute to control this?

muellerzr · October 28, 2019, 3:01am

Thanks! ~~It’s working I just need help with one thing. How do we access the number of epochs? (is it self.n_epochs)?~~

Otherwise I believe I’ve got it figured out

ilovescience · October 28, 2019, 3:10am

Ok so you were able to figure it all out? Is a fix needed for OverSamplingCallback or for fit_fc?