that’s why it’s a thought experiment, it’s not mirroring the same image that matters, I originally hope you can see the reason behind that.
let’s go further, suppose we concat every 100 cats into 1 image, and do the same to the dogs, I’d argue we can still get good result.
reason is that cnn begin locally and build a feature hierarchy step by step, absolute location does not matter, relative location(or composed feature) make the final difference.