AdamW tuning of weight decay

Has anyone come up with any good heuristics for choosing weight decay?

1 Like

This was an approach posted with a notebook: Fastai_v1, adding features

Kind regards

That paper was written before the AdamW paper so I’m wondering if it is still relevant.

If I’m interpreting it right though, the approach would be to choose a learning rate first without weight decay in the usual way, and then choose the maximum weight decay that doesn’t lead to degradation at that learning rate? Is that right?

Yes, :slight_smile:
Leslie Smith in
A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay