This is a crosspost from this SO post. Was hoping to get your thoughts on the question. For context, I was wondering if we can skip flipping images as augmentations.
I am looking at image embeddings and wondering why flipping images changes the output. Consider resnet18 with the head removed for example:
import torch import torch.nn as nn import torchvision.models as models device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") model = models.resnet18(pretrained=True) model.fc = nn.Identity() model = model.to(device) model.eval() x = torch.randn(20, 3, 128, 128).to(device) with torch.no_grad(): y1 = model(x) y2 = model(x.flip(-1)) y3 = model(x.flip(-2))
According to how I’m thinking, since we are just having convolutions on top of convolutions, before the pooling, all that will happen is that the feature map will flip according to how the image is flipped. The average pooling simply averages the last feature map (along each channel), and is invariant to the orientation of it.
AdaptiveMaxPool should have been the same.
The key difference between ‘normal’ convnets being that we are pooling/ averaging to one pixel width.
However, when I look at
y2-y3 the values are significantly different to zero. What am I thinking wrong about?