Using OptimWrapper gives me higher loss

I came across OptimWrapper trying to slowly follow @muellerzr’s pytorch to fastai tutorial. Does it do anything but delegate calls to the pytorch optimizer it wraps? I’m trying to replace the code from Jeremy’s pytorch tutorial and I get weird behavior when comparing learn.fit(2) to running fit() (the manual training loop from the tutorial):

  • If I first call learn.fit(2), I get loss of about 2.2 (same as in the first tutorial), and the loss remains in this region even if I run fit() later
  • If I first call fit, I get a loss of about 0.2 (an order of magnitude below), which remains in this region even after running learn.fit(2)
  • If I use learn.fit(2), but use a fastai.optimizer.SGD, I get a loss of about 0.2 as well.

It makes me think that perhaps I’m misunderstanding OptimWrapper. Any idea?

The code I’m running can be found here (the interesting comparisons are in the 3 last cells, but the setup code is there too): https://colab.research.google.com/drive/1gbTysz2FISa5mv6dnyHHixl1sw1by3bN?usp=sharing

More debugging has revealed the source for my problem: the learning rate given in learner.fit() overrides that stored by the optimizer. I was giving None to fit and only expecting the fact that the optimizer was initialized with a learning rate to be enough.

Is this the expected behavior? Sounds like I’m missing something.

Further notes - the lr parameter cannot be omitted from initialization of a pytorch SGD, but its value cannot be used. For other parameters however, e.g. momentum, specification through the OptimWrapper (and not the fit method) seems like the only option. Seems pretty inconsistent, and I thought this application of “migrate from pytorch” is pretty standard.

What version of fastai are you using? With fastai 2.3, you can just do

opt_func = partial(OptimWrapper, opt=torch.optim.SGD)

and you don’t need to pass the learning rate parameter.

1 Like

Well, when I wrote it I was using 2.2.6, but the API indeed changed in 2.3 and that indeed seems like the correct way now.
I’m still not very happy with how easy it is to make an incorrect use of it and that the linked tutorial hides behind the scenes how the supplied LR is never used, but your answer seems to now be the standard way according to the docs as well.

Thanks!

I guess the idea is that if it’s an argument for Learner then that will be used instead of the Optimizer argument. This includes the learning rate and momentum arguments. Is there somewhere specific you think can be improved in terms of documentation?

We’ll what I needed was a minimal, official “migrating from pytorch” guide. One option is to have the migrating from PyTorch guide in the docs be updated to show the use of a PyTorchoptimizer (a minimal example should not force to switch to a fast.ai optimizer, as it currently does), but the “migrating_pytorch” module that is in use there no longer exists.

Another option is to update the guide I linked to in the first post, but that is not an official source.

1 Like

Over time that tutorial will be in the documentation (I’ll be putting in that PR when I can). In the meantime I can update that article as ideally it is the “official” migration guide for right now.

2 Likes

Any idea when that is going to happen? As far as I see, currently in the guide the loss is still ~2, which means the learning rate passed to SGD is not used.

From my experiments, the LR passed via OptimWrapper is not used in training the model. The training uses default LR value inside Learner which is 0.001.

Anyone can verify the case?

@yiftachbeer From your code, even though you passed lr=0.1 via OptimWrapper, the Learner uses lr=0.001. My temporary workaround is to pass the LR directly into Learner.

Modified notebook here.

I completely agree, I originally thought it was a genuine bug and then realized this is just us using the library not as intended (OptimWrapper is kind of patchy), so this thread has become more about “if despite fastai having an optimizer, we think about learning rate as a property of Learner (or a specific fit), this behavour should at least be clarified in the docs”.