Output of frozen layers

noachr · March 18, 2019, 11:24pm

I am running into some behavior I can’t explain, and was hoping for some help. I am using fastai 1.0.49 and pytorch 1.0.1 on linux.
I created a notebook gist to explain my issue here: gist

To start, let me define a simple model that reproduces the issue.

class WTTest(nn.Module):
    def __init__(self,num_classes):
        super().__init__()
        self.base = create_body(models.alexnet)
        for p in self.base.parameters(): p.requires_grad = False
        self.head = create_head(256*2,num_classes)
    
    def forward(self, x):
        return self.head(self.base(x))
learn = Learner(data,WTTest(data.c))

The issue is this. If I get some data x and call learn.model.base(x) to get the activations of the base network, train the network for any length using learn.fit_one_cycle(1,1e-3), and then call
learn.model.base(x) again, the two sets of activations are different despite the fact that the base layer is frozen. However, if I compare the base layer weights before and after training, they are identical. I can’t figure out how the activations are changing despite the fact the the input and weights stay the same.

The activations from before and after training don’t train all that much (usually on the order of 0.01 difference), but the more epochs I train the more they differ.

I originally thought it was an issue with batch norm statistics, but Alexnet has no batchnorm layers.

sgugger · March 18, 2019, 11:31pm

Did you put your model in eval the two times? I don’t know what could be the reason (I initially thought of BatchNorm too).

noachr · March 18, 2019, 11:33pm

I had been (though not in the gist); I just added calls to learn.model.eval() in both locations and the result is the same – the activations change slightly.

sgugger · March 18, 2019, 11:34pm

How slightly? Are you on GPU?

noachr · March 18, 2019, 11:37pm

I am on GPU. For example, after training one epoch a single feature map in the output goes from:
tensor([[0.6377, 0.0000, 0.0000],
[1.7095, 1.3126, 0.0102],
[0.0000, 0.0000, 0.0000]], device=‘cuda:0’)

To
tensor([[0.6377, 0.0000, 0.0000],
[1.7083, 1.3123, 0.0083],
[0.0000, 0.0000, 0.0000]],

Edit: and after training for 10 epochs, the resulting output becomes:
[[0.6370, 0.0000, 0.0000],
[1.7006, 1.3142, 0.0000],
[0.0000, 0.0000, 0.0000]],

sgugger · March 18, 2019, 11:48pm

Did you check your weights are exactly the same as before? With a save of the model then a load an checking with torch.all_close?

noachr · March 18, 2019, 11:52pm

I checked by saving the state dict of the base before and after, and running:

for key in prev_sd.keys():
    if not torch.equal(prev_sd[key],post_sd[key]):
        print(key)

No keys were printed.
I just tried again using torch.allclose instead, and got the same result.

sgugger · March 19, 2019, 12:19am

Are you sure you don’t have any kind of augmentation on your data?

noachr · March 19, 2019, 12:43am

Not using augmentation. In addition, I am only grabbing a batch of data once x,y = next(iter(data.valid_dl)) and using the same batch for both tests.

noachr · March 19, 2019, 1:46am

I’ve done some more tests, and this only seems to occur when using Alexnet or vgg16/19 as the base network. I tried resnets and densenets, and it doesn’t happen in those cases.

sgugger · March 19, 2019, 1:19pm

And if you do the predictions ten times in a row without doing training in the middle, do you always get the same exact result?

noachr · March 19, 2019, 6:11pm

Yes, without training in the middle I always get identical results.