We Don’t Need To Worry About Overfitting Anymore

rbunn80130 · March 10, 2021, 10:03pm

According to this blog post…

Has anyone else checked this out or tried to implement it in fast ai?

ilovescience · March 11, 2021, 2:32am

I have already created an implementation back in December and I have just put it in a gist for now:

gist.github.com

https://gist.github.com/tmabraham/62cc1839e1dbb280cb80a79df856ec81

SAM.py

class SAM(Callback):
    "Sharpness-Aware Minimization"
    def __init__(self, zero_grad=True, rho=0.05, eps=1e-12, **kwargs): 
        assert rho >= 0.0, f"Invalid rho, should be non-negative: {rho}"
        self.state = defaultdict(dict)
        store_attr()

    def params(self): return self.learn.opt.all_params(with_grad=True)
    def _grad_norm(self): return torch.norm(torch.stack([p.grad.norm(p=2) for p,*_ in self.params()]), p=2)

This file has been truncated. show original

Note that it might not play nice with other callbacks (for now, will look into this further)…

I will likely write a blog post about it as well.

I observe improved performance with SAM(Ranger) on Imagenette using the same number of epochs. But we could argue it’s not comparable because one step of SAM actually takes two steps of the base optimizer. In that case, comparing SAM(Ranger) to Ranger with half the number of epochs (so it’s comparable with the number of steps that the base optimizer is taking) shows slightly less accuracy on Imagenette.

I am also going to test SAM on Noisy Imagenette. In fact I was inspired to generate the noisy Imagenette dataset in order to test the noise-robustness capabilities of SAM.

I hope this helps! And I’ll post more information once I run more experiments. Let me know if you observe anything interesting as well!

rbunn80130 · March 11, 2021, 3:37am

That’s awesome! Good thing I asked before working on it myself. I’ll let you know what results I get.

Pomo · March 14, 2021, 7:46pm

Hi to both of you. This looks like a promising technique! I read the paper, and to be honest, the math was not comprehensible to me.

Could you explain, using intuitive words, how epsilon is computed and what it means?

I am asking because I could fabricate a model whose loss function is flat in one combination of dimensions and steep in the others. How then do they take the loss and gradient at a single weights+epsilon and determine whether the loss is inside a flat basin?

Thanks for clarifying.

drscotthawley · April 3, 2021, 6:40pm

@Pomo I don’t fully follow every part of their derivation, but looks like are evaluating the gradient at two points instead of just one, so the combination of these two gradients could amount to the rough approximation to the relevant the part of the Hessian one would care about. (so they seem to argue).

@ilovescience Thanks for sharing that. I think the code comment
# climb to the local maximum "w + e(w)"
could be confusing. There’s no guarantee that there’s a local maximum there, just that the loss is greater there than at w. You might phrase it as something like…
# climb "upwards" in the direction e(w)
or
# climb to "w + e(w)" where loss is higher