For implementing BatchNorm
in lesson10, Jeremy used register_bufer
as below for defining vars
& means
, the explanation for which I could understand fine.
However, his explanation made me wonder why we do NOT have to use register_bufer
as well for defining self.eps
, and I would like help to understand why.
At 1:43:12 of the lecture video for Lesson10,
Jeremy says,
If we move the model to the GPU anything registered as buffer will be moved to GPU as well
If we didn’t do that, then it tries to do the calculation down here, and vars and means are not on the GPU, but everything else is on the GPU, we get an error.
If this is the case, do we not have to define eps
using buffer as well, since it is also involved
in the calculation inside forward
?
Since self.eps
is defined as self.eps = eps
, it will NOT automatically moved on GPU when the model is moved to GPU, if I understand correctly.
Then, we should get an error while x = (x-m) / (v+self.eps).sqrt()
is executed
since we are trying to use a thing (= x
) on the GPU with a thing(= eps
) NOT on the GPU.
class BatchNorm(nn.Module):
def __init__(self, nf, mom=0.1, eps=1e-5):
super().__init__()
# NB: pytorch bn mom is opposite of what you'd expect
self.mom,self.eps = mom,eps
self.mults = nn.Parameter(torch.ones (nf,1,1))
self.adds = nn.Parameter(torch.zeros(nf,1,1))
self.register_buffer('vars', torch.ones(1,nf,1,1))
self.register_buffer('means', torch.zeros(1,nf,1,1))
def update_stats(self, x):
m = x.mean((0,2,3), keepdim=True)
v = x.var ((0,2,3), keepdim=True)
self.means.lerp_(m, self.mom)
self.vars.lerp_ (v, self.mom)
return m,v
def forward(self, x):
if self.training:
with torch.no_grad(): m,v = self.update_stats(x)
else: m,v = self.means,self.vars
x = (x-m) / (v+self.eps).sqrt()
return x*self.mults + self.adds
Thanks