Accumulating Gradients

It seems BN is an issue here. If we would PR merge this callback as it is, but getting lower accuracy, I think that would not worth it. Maybe we can find how to remedy this BN issue. But it seems then we have to change the BN layers in the arch?

Most articles did not solve the BN issue when they showed how to do gradient accumulation, and they have not even mentioned it. I think that’s because they haven’t checked whether the accuracy they are getting are the same.

related:

3 Likes