Adaptive pooling output

Hi everyone,

This is a crosspost from this SO post. Was hoping to get your thoughts on the question. For context, I was wondering if we can skip flipping images as augmentations.

I am looking at image embeddings and wondering why flipping images changes the output. Consider resnet18 with the head removed for example:

import torch
import torch.nn as nn
import torchvision.models as models
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model = models.resnet18(pretrained=True)
model.fc = nn.Identity()
model = model.to(device)
model.eval()

x = torch.randn(20, 3, 128, 128).to(device)
with torch.no_grad():
    y1 = model(x)
    y2 = model(x.flip(-1))
    y3 = model(x.flip(-2))

The last layer looks like this and most importantly has a AdaptiveAveragePooling as the last layer where the pixels/ features are pooled to 1 pixel:
enter image description here

According to how I’m thinking, since we are just having convolutions on top of convolutions, before the pooling, all that will happen is that the feature map will flip according to how the image is flipped. The average pooling simply averages the last feature map (along each channel), and is invariant to the orientation of it. AdaptiveMaxPool should have been the same.

The key difference between ‘normal’ convnets being that we are pooling/ averaging to one pixel width.

However, when I look at y1-y2, y1-y3, y2-y3 the values are significantly different to zero. What am I thinking wrong about?

This is a great “thought experiment” question and I appreciated it. :slightly_smiling_face:

(It looks like you got an answer to the mystery posted at SO.)