Maybe someone with a deeper understanding can unravel this situation.
I am using resnet with images of very different widths, highly skewed to the smaller. The training loop runs through the entire epoch (whole set of images) and then does the weight update. By first sorting the images by width and grouping into minibatches of similar width, running an epoch is much faster. For example, n minibatches with images of average width 200 instead of n minibatches with images all of width 1000.
However, this method does not seem to play well with batchnorm. And I don’t understand why. When I do not sort the minibatches by image width, training works at the expense of much more time. With minibatch sorting, when I put all the batchnorms into eval mode, it also trains. If not, losses training and validation losses are very different.
Can anyone explain what is going on? And is there any hope of using this idea with resnet?
Also, I can’t resize all images to the same width because width represents time and the time scale matters for classification.
Thanks for any clarification!