AdamW and Super-convergence blog

I had some questions about the recent fast.ai blog post, “AdamW and Super-convergence is now the fastest way to train neural nets”, and in the absence of a comments topic this Deep Learning one seemed most appropriate.

Mostly I am curious about the conclusion that amsgrad is just noise. @sgugger it appears that this is true for the image classification tasks, but for the NLP tasks it seems like there was a substantial improvement.

Would you mind elaborating? I am also curious if you did any comparisons with vanilla SGD.

For the NLP tasks, the measure used is the perplexity, which is the exponential of the validation/test loss. So lower is better and this is where amsgrad actually hurts training the most (since there is a substantial spike in ppl).

For the tests with SGD, I need 150 epochs to get to the same perplexities on wikitext-2 (I had posted a notebook about this earlier in this thread) but someone may find a best set of hyper-parameters to get as fast as AdamW.
On CIFAR-10, SGD gets to the same accuracy in 30 epochs, but needs 20 epochs to reliably get to >94% with TTA (instead of 18 for AdamW). I haven’t tried vanilla SGD on Stanford cars, so I have no comparison for this third task.

1 Like

Different question but related to super-convergence.

I see it achieves great results when training models from scratch and no pretrained weights. My question is. Should/can we use super-convergence when we have pretrained weights. Hence, the case of fine-tuning and/or transfer learning or is CLR or Cosine Annealing more practical on these cases.

You should always give it a try :wink:
The Standford cars mentioned in the blog post is a fine-tuning task and 1cycle did great. On the planets dataset, I also get better results with 1cycle instead of Cosine annealing. However, the gain should be less impressive if you compare to CLR since it’s almost the same thing.

Aha silly mistake on my part. I’ll see what I can do with SGD, I had very good luck on the Rakuten challenge :slight_smile:

Hi @sgugger

Thanks for the great blog post. A couple questions about its implementation in fastai if you don’t mind:

The following line from the Stepper class is confusing me.

p.data = p.data.add(-wd * lr, p.data)

It’s as if two things are being added to p.data. The first being wd * lr and the second being p.data itself. Also p.data represents the parameter weights, right? If so why wouldn’t the formula be…:

p.data.add_(-wd * lr * p.data)

…since this aligns with the following formula from the blog post:

w = w - lr * w.grad - lr * wd * w

I can’t even get the p.data.add(-wd * lr, p.data) paradigm to work in pytorch. In the code below replace -wd * lr with y and p.data with x:

x = T(5.0)
y = T(1.0)
x = x.add(y, x)

pytorch throws a TypeError, but this is essentially the same thing that’s happening in Stepper, right?

Thanks for any insight you can share.

That thing also confused me first. Turns out it is just an overloaded pytorch function. Take a look at https://pytorch.org/docs/stable/torch.html#torch.add. If the first param is a scalar, then the output is computed like input + scalar * value. So it is computationally equivalent.

x = T(5.0)
x.add(2, x) # 15
1 Like

Exactly. And the code

x = T(5.0)
y = T(1.0)
x = x.add(y, x)

throws an error because y isn’t a scalar.

I don’t know if it’s faster, I copied the way the update is done at the end of the SGD optimizer in the pytorch source code.

1 Like

Got it. Thanks for the explanation on that.

So if we create a pytorch optimizer without using fastai and pass in a non-zero value to the weight_decay argument (e.g. optim.Adam(m.parameters(), lr = 1e-3, weight_decay = 5e-4)) then the weght decay contribution will be in the gradients and that would be less than ideal if we were using momentum? To ask it another way, the code you added in Stepper for AdamW would have no effect?

What I don’t understand is when will the optimizer have a param_group key equal to wd? Since wd is what the code in Stepper is looking for in this line:

 if 'wd' in self.opt.param_groups[0] and self.opt.param_groups[0]['wd'] != 0:

I ask because it appears as if the layer_optimizer class in fastai has an attribute called wds but not wd.

Yes, if you use an optimizer without the fastai library, it ill be treated as L2 regularization and the weight decay will be done in the gradients.
I couldn’t use the parameter named ‘weight_decay’ in any of the param_groups since if there is a value there, the pytorch optimizer will use it as L2 reg. That’s why I added a new parameter named ‘wd’ and the line you mentioned in the Stepper detects if it exists, then performs the update as weight decay.

This ‘wd’ key is introduced in the param group inside the layer_optimizer class, when you use the function set_wds_out. This is called by the WdScheduler (old way of implementing true wd) during on_batch_begin and by the new API during the phase_begin event (if the wd_loss option is set to False).

Since L2 regularization and weight decay lead to the same result in the absence of momentum and in the presence of momentum weight decay does the “correct” thing, would it be possible to back out the weight decay contribution to the gradients before we take the step()? Something like the following (not sure if this would be quite right)?

for p in  opt.param_group['params']:
if p.grad is not None: 
    p.grad.data = p.grad.data.add(-wd * p.data)
    p.data = p.data.add(-wd * lr, p.data)

If it were done this way, then I think we would be backing out L2 regularization in the gradients and inserting it into the parameter weights and so this method would work for any pytorch optimizer.

Am I thinking about this right?

It would have to be tested, but this should also work, yes. Though you’re adding two ‘useless’ operations (subtracting the weight decay to the gradients then adding it) so it’s a bit less efficient than the way it’s currently implemented.

Yeah, definitely not as efficient. The only value it would add to the library is that sgd with momentum will be executed properly in the presence of a weight decay parameter for any type of pytorch optimizer. But personally I think it’s fine to always create a layer_optimizer instance if using weight decay and sgd with momentum. But if other users feel differently, I could take a look at implementing this.