New AdamW optimizer now available

BTW @anandsaha I feel like the adjustment of weight decay based on # epochs isn’t actually helpful, since I tend to interactively decide how many more epochs to run, as I go. What do you think? Maybe this should be pulled out into a separate option that can be enabled with a param?

Jeremy I tried it on plant-seedings, slightly worse as well, but without any intense tuning:

resnet_50_224 0.976985454158
resnet_50_224_wd 0.975192157381

I tried it on dog breeds and also a little worse…?

I tried on plant seeds, got worse results and slower convergence.

I tested as well and similarly getting slightly worse results.

However, my feeling is that we aren’t yet taking full advantage of it because currently choosing the weight_decay values is a blind guessing game. Perhaps wd_find() or something like that might help?

Thanks all for trying it out! Seems we have to figure out how to best use this technique at large.

My observation was that the technique works well in recovering from overfitting than preventing overfitting. But more trials like this will actually tell us what works (which currently is not so encouraging).

Also, I am thinking, we can tinker with 3 aspects of the paper separately:

  1. Detaching the weight regularization from gradient update
  2. Decaying the weight regularizing factor
  3. Setting the schedule multiplier

i.e., for e.g. switch off (2) and (3) and see how (1) performs alone. Also I need to expand my testing technique, will take up some of the Kaggle challenges mentioned above.

Sure, I guess till we don’t have a good understanding of how to use this technique, we should term it as experimental. I believe what you are saying is that we remove it from the fit() function and provide a switch like learner.use_wd_sched = True ? Or we can generalize this and have a mechanism in fastai to add experimental features which have to be deliberately added. What do you say?

That’s a great idea! I will give this a shot.

I am not sure that this is possible. Maybe I am not seeing something, but it seems to me that the only way we can tell whether we used a better or worse amount of regularization vs the previous run is to look at the losses once our model has been fully trained. Not sure if there is any info available to us earlier or any experiment we could perform (like with the lr_finder) to get a better idea of what the okay amount might be.

1 Like

Yes you are right. The approach taken with lr_finder may not work. For the wd case, we may want to find out, for e.g., which among a choice of n wds may work best, before venturing into full training. Also the criteria here is not loss per say, but the correlation between train loss and val loss.

Thanks for the food for thought. I will tinker with this idea a bit to see what I get.

@anandsaha, just to add to others’ input, I also played a bit with it and got worse performance.

My impression, in case this helps, is that the new version is more “overfitter”. So it somehows “tappers” regularization effect of wd. I don’t know if this behaviour is a hint on what is going on…

Would be great to find the key…anyway, awesome experience trying and testing an experimental tool , thanks to you! :grinning:

1 Like

Actually all I was saying was what you said (better) in the previous section - we should have a way to try each of the three things separately. Specifically - I’m most interested in (1), decoupling regularization from the momemtum and g^2 terms in optimizers.

Aah ok, got it!

Hi @anandsaha,

I get the error message AttributeError: 'NoneType' object has no attribute 'on_train_begin' when I run :

Any thoughts about the reason(s) ?

There is a bug in the current version of the file learner.py (might be my fault).
In the fit_generator function find the line

elif use_clr is not None:

and replace elif by if.
I just created a PR to correct this in the library.

Thanks for letting me know - merged now.

BTW since this thread is popping up again, it would be a good time to mention that no-one AFAIK has actually gotten this working right yet. Would be a great project for someone to figure out how to actually get better results training a real model using use_wd_sched.

I’ll have to read the article but from my first experiments with the 1cycle policy and Adam, it absolutely won’t work without putting use_wd_sched=True.

OK but let’s forget Adam for a moment - what about just using it for momentum? (And note that use_wd_sched doesn’t do anything except do “proper weight decay” - the other stuff in the paper was split out by @anandsaha into other params.)

Sorry @sgugger just realized I misread your “without” as “with” . So my reply only partly makes sense… :wink:

@jeremy I was watching the lesson 5 video, in which AdamW is discussed, and it prompted me to try it. I got worse results on Movie Lens as others have reported. I did a little research and found this february 2018 paper from the originators of AdamW: https://arxiv.org/pdf/1711.05101.pdf

In it they propose weight decay normalization, i.e. shrinking weight decay as training progresses. Unfortunately, I don’t believe that changing the weight decay per epoch is currently supported in fastai. If it were, I believe the following function would return the normalized weight decay per epoch according to the formula in the paper: https://pastebin.com/hDqcM6ek

If you have any tips on how I would be able to modify the fast.ai library in order to make this parameter available to model fitting, I would give it a shot. It wasn’t immediately obvious to me.

@zachl would be interested to hear what you find - you can probably hack something in easily enough using this new functionality from @sgugger : https://github.com/fastai/fastai/blob/master/courses/dl2/training_phase.ipynb .

I suspect that momentum (beta1) shrinking might work better still. Perhaps if you do some experiments you could share your notebook as a gist?

BTW movielens may be less good at this than models that require deeper resnet-style models.

(These are all guesses - little research has been done in this area, so even negative results are very useful.)

That does look like it will work! I will start working on it.

What is the formula for momentum shrinking? I wasn’t able to find a description of it via google but would be interested to try it and compare to the suggestion in the paper.