Lesson 1: High error_rate for Resnet50

hamede · February 12, 2019, 5:08pm

Hi

I recently ran the Lesson 1 - pets notebook and when running the Resnet50 I am getting very high error rates, check the screen shot below

Does anybody have an idea why this may be happening?

bam098 · February 12, 2019, 5:21pm

Hi,

I have the same problem. Furthermore, I’ve also observed kind of strange plots after running the learning rate finder for the Resnet34:

I assume, that the default learning rate is already quite good!? Did the default learning rate change compared to the version used in the in-person class? Not sure, if these two issues are related!?

Does anyone have an idea what the problem is?

Update:
I installed fastai via anaconda, which turned out to be version 1.0.43. Now, I tried to run the lesson 1 notebook with fastai version 1.0.42 to see whether there are any differences. It turns out, that the problems mentioned above disappeared.

The plot after running the learning rate finder looks like this now (Resnet34).

Training with Restnet50 looks like this now.

So, it seems like this change was introduced in fastai version 1.0.43. I don’t know the fastai library too much yet. Thus, I can’t really tell if this is a bug or expected behaviour, which requires e.g. changing the hyper-parameters now. I assume it might be the latter!? I will try to figure out what exactly caused this. However, I just don’t know if I can, due to my limited knowledge of fastai at the moment, but let’s see.

tschoy · February 12, 2019, 7:24pm

I have a similar problem with unet example. If I redefine learn after learn.recorder.plot(), it would train just like in older versions.

bam098 · February 12, 2019, 9:01pm

@tschoy Thank you for the hint. This also fixed the problem in the lesson 1 notebook. I just wonder why. Is it a bug or is this the required way from version 1.0.43 on now?

tschoy · February 12, 2019, 10:36pm

It seems v.1.0.43’s lr_find and plot now add a red dot where the most negative slope is. It may be related to the issue, but I’m not sure.

bam098 · February 12, 2019, 11:03pm

It seems like it was just fixed:

It works for me now without redefining learn.

Thanks

I’m not sure if I understand it correctly, but I think when the LRFinder finished, it loaded the model, but also purged the learner, which we still need for the fitting the model. Now, the learner is not purged anymore and thus, it works now. Well, not sure if I understand it correctly.

sgugger · February 12, 2019, 11:19pm

It’s more like there is a bug in purge that I’ll try to fix tomorrow

bam098 · February 12, 2019, 11:23pm

@sgugger Ah, I see. I was looking into the code, but there are many things I don’t understand yet. I just started the course. Thanks for looking into the issue.

EinAeffchen · February 13, 2019, 11:18am

I have the exact same problem, but after I downgraded to 1.0.42 I suddenly get
AttributeError: "NoneType" object has no attribute "group"
when trying to create the databunch object

Edit: After uninstalling 1.0.42 completely and reinstalling 1.0.43 with conda, instead of with pip as I did before, now everything seems to work as it should.

hamede · February 13, 2019, 3:44pm

I am not sure if the fastai library got update to fix this. Does git pull command within the fastai directory update it? I tried but it still has the same results.

Instead I took the hint from the above discussions and I just re-initialized the learner after running lr_find after which I ran the training again. This pretty much solved it ! Thanks !

bam098 · February 13, 2019, 4:20pm

I built from source yesterday from this commit: https://github.com/fastai/fastai/commit/34499e1b8e12d3731f44a3e220134966bf944918

As far as I understood it, it doesn’t fix the bug, but it makes lesson1 notebook work. I assume, because purge (where apparently the bug is) is set to False.

Well, after building from source, I installed it via pip and then it worked for me without re-initializing the learner. Did you do the same and it didn’t work? Or did you install it in a different way?

Anshuman · February 7, 2020, 2:00am

Hi,

I am also getting the same problem. The error_rate is too high and I am getting this high error_rate even after redefining the learn.recorder.plot(). Is there anything else that I can try to get it fixed?

I also had 1 doubt, the lr_find graph shows a decrease in the range 1e-03 to 1e-01 in ResNet50. So the max_lr in fit_one_cycle should be in the range 1e-03 to 1e-01, to reduce the loss. Am I correct? I cannot check it because of high error_rate.