Output of frozen layers

I am running into some behavior I can’t explain, and was hoping for some help. I am using fastai 1.0.49 and pytorch 1.0.1 on linux.
I created a notebook gist to explain my issue here: gist

To start, let me define a simple model that reproduces the issue.

class WTTest(nn.Module):
    def __init__(self,num_classes):
        super().__init__()
        self.base = create_body(models.alexnet)
        for p in self.base.parameters(): p.requires_grad = False
        self.head = create_head(256*2,num_classes)
    
    def forward(self, x):
        return self.head(self.base(x))
learn = Learner(data,WTTest(data.c))

The issue is this. If I get some data x and call learn.model.base(x) to get the activations of the base network, train the network for any length using learn.fit_one_cycle(1,1e-3), and then call
learn.model.base(x) again, the two sets of activations are different despite the fact that the base layer is frozen. However, if I compare the base layer weights before and after training, they are identical. I can’t figure out how the activations are changing despite the fact the the input and weights stay the same.

The activations from before and after training don’t train all that much (usually on the order of 0.01 difference), but the more epochs I train the more they differ.

I originally thought it was an issue with batch norm statistics, but Alexnet has no batchnorm layers.

Did you put your model in eval the two times? I don’t know what could be the reason (I initially thought of BatchNorm too).

I had been (though not in the gist); I just added calls to learn.model.eval() in both locations and the result is the same – the activations change slightly.

How slightly? Are you on GPU?

I am on GPU. For example, after training one epoch a single feature map in the output goes from:
tensor([[0.6377, 0.0000, 0.0000],
[1.7095, 1.3126, 0.0102],
[0.0000, 0.0000, 0.0000]], device=‘cuda:0’)

To
tensor([[0.6377, 0.0000, 0.0000],
[1.7083, 1.3123, 0.0083],
[0.0000, 0.0000, 0.0000]],

Edit: and after training for 10 epochs, the resulting output becomes:
[[0.6370, 0.0000, 0.0000],
[1.7006, 1.3142, 0.0000],
[0.0000, 0.0000, 0.0000]],

Did you check your weights are exactly the same as before? With a save of the model then a load an checking with torch.all_close?

I checked by saving the state dict of the base before and after, and running:

for key in prev_sd.keys():
    if not torch.equal(prev_sd[key],post_sd[key]):
        print(key)

No keys were printed.
I just tried again using torch.allclose instead, and got the same result.

Are you sure you don’t have any kind of augmentation on your data?

Not using augmentation. In addition, I am only grabbing a batch of data once x,y = next(iter(data.valid_dl)) and using the same batch for both tests.

I’ve done some more tests, and this only seems to occur when using Alexnet or vgg16/19 as the base network. I tried resnets and densenets, and it doesn’t happen in those cases.

And if you do the predictions ten times in a row without doing training in the middle, do you always get the same exact result?

Yes, without training in the middle I always get identical results.