Why not using "register_buffer" for "self.eps" in `BatchNorm`?

For implementing BatchNorm in lesson10, Jeremy used register_bufer
as below for defining vars & means, the explanation for which I could understand fine.

However, his explanation made me wonder why we do NOT have to use register_bufer as well for defining self.eps, and I would like help to understand why.

At 1:43:12 of the lecture video for Lesson10,
Jeremy says,

If we move the model to the GPU anything registered as buffer will be moved to GPU as well
If we didn’t do that, then it tries to do the calculation down here, and vars and means are not on the GPU, but everything else is on the GPU, we get an error.

If this is the case, do we not have to define eps using buffer as well, since it is also involved
in the calculation inside forward?

Since self.eps is defined as self.eps = eps, it will NOT automatically moved on GPU when the model is moved to GPU, if I understand correctly.

Then, we should get an error while x = (x-m) / (v+self.eps).sqrt() is executed
since we are trying to use a thing (= x) on the GPU with a thing(= eps) NOT on the GPU.

class BatchNorm(nn.Module):
    def __init__(self, nf, mom=0.1, eps=1e-5):
        super().__init__()
        # NB: pytorch bn mom is opposite of what you'd expect
        self.mom,self.eps = mom,eps
        self.mults = nn.Parameter(torch.ones (nf,1,1))
        self.adds  = nn.Parameter(torch.zeros(nf,1,1))
        self.register_buffer('vars',  torch.ones(1,nf,1,1))
        self.register_buffer('means', torch.zeros(1,nf,1,1))

    def update_stats(self, x):
        m = x.mean((0,2,3), keepdim=True)
        v = x.var ((0,2,3), keepdim=True)
        self.means.lerp_(m, self.mom)
        self.vars.lerp_ (v, self.mom)
        return m,v
        
    def forward(self, x):
        if self.training:
            with torch.no_grad(): m,v = self.update_stats(x)
        else: m,v = self.means,self.vars
        x = (x-m) / (v+self.eps).sqrt()
        return x*self.mults + self.adds

Thanks

Hi shun. register_buffer is intended for tensors - you could use it for eps but then you’d need to wrap the scalar inside a tensor first.

I think scalars are treated differently by the PyTorch code. I haven’t seen the implementation, but I suspect they will end up being sent as arguments to the low-level cuda kernels that actually do the computation. Keep in mind that all of this Python code actually needs to run in the GPU as a set of compiled cuda kernels. I imagine that scalars will be considered in this transformation, and thus automatically “sent” to the GPU.

That’s just my rationalization, because I had the same question arise the first time I realized I had to use register_buffer in my code. It would be great if we could have some authoritative explanation from someone more versed in that aspect of PyTorch :slight_smile:

2 Likes

Pedro,
Thank you for your quick response.

I imagine that scalars will be considered in this transformation, and thus automatically “sent” to the GPU.
I see, that makes sense. Thank you for sharing your great insight!

I will dig into this a bit more myself, and will come back here when I find any detailed explanation.

Thanks

1 Like