BatchNorm3d gives different result than manual calculation

Hi everyone. I went down a deep rabbithole implementing a manual version of BatchNorm3d to solve a completely different problem. But in the process, I find that PyTorch’s BatchNorm3d gives a different result than the manual calculation. Can anyone offer an explanation, or even a speculation?

The situation is reduced to its simplest form. Eval mode, no training or momentum, no weight or bias, eps=0.0. BatchNorm3d should merely subtract the mean and divide by the square root of the variance. But that same calculation done manually yields a result different by 1e-6 or so, depending on the variance used.

Here is the code. You can change varfactor to see different discrepancies.

import torch
import torch.nn as nn

varfactor = .9  #.9 gives a different result. 1.0 matches.
nfeatures=64

lin = torch.rand((2,nfeatures,5,8,9)) #bs = 2
bn1 = nn.BatchNorm3d(nfeatures, eps=0.0, momentum=0.0, affine=False, track_running_stats=False) #The simplest BatchNorm

#Set mean to zero and variance to varfactor
bn1.running_mean = nn.Parameter(torch.zeros((nfeatures,)), requires_grad=False) 
bn1.running_var = nn.Parameter(torch.full((nfeatures,), varfactor), requires_grad=False)
bn1.eval()

y = (lin - bn1.running_mean[None, :, None, None, None]) / torch.sqrt(bn1.running_var[None, :, None, None, None])  # Manual calc - Normalize with the current statistics

(bn1(lin)-y).abs().max().item()  #Print the difference

In this example, the discrepancy is 1.1920928955078125e-07.

Things I have tried: 1) Moved all calcs to the CPU; 2) converted all numbers from float32 to float64.

PyTorch’s calculation is buried inside C++ code. I am interested and obsessed enough to study it, but where is it? Does anyone know where to find the C++ code?

Final question - who even cares?

Thanks for your attention!
:exploding_head: