Has anyone come up with any good heuristics for choosing weight decay?
1 Like
That paper was written before the AdamW paper so I’m wondering if it is still relevant.
If I’m interpreting it right though, the approach would be to choose a learning rate first without weight decay in the usual way, and then choose the maximum weight decay that doesn’t lead to degradation at that learning rate? Is that right?