Variations of pooling

I am trying to understand MaxPooling.
I get that we’re trying to reduce dimensionality by taking the maximum element in a subset of a given matrix and then replacing the matrix by that maximum number. My question (possibly quite trivial) is - how do we know that the maximum info is contained in the largest element of the matrix? Why not take the average, so every element is accounted for?

An example would be MNIST dataset. We know that those digits were written in black ink on a white background. So for each block, if we took the max, it would give us the maximum info. What if the digits were written in white ink on a black background?

Are there cases in which we may be able to combine the pooling methods - average and max? When would we do that and why?


That’s a great question. Whilst that could be an issue for the input layer, it’s not likely to be an issue for any of the intermediate layer - since they are all the results of convolution operations, which you can think of as feature detectors, and a large activation indicates that the feature has been found.

We don’t do max-pooling on the input layer, so this isn’t a problem in practice (if we want to down-sample the input, we generally use some kind of interpolation or averaging).

Having said all that, using average pooling can work well in some situations (especially for localization), and there are other techniques such as fractional max pooling that have had a lot of success.

Thanks. I will look into other pooling techniques as well.

I’m still curious about what would happen in the hypothetical situation where MNIST’s colors were swapped – would Max Pooling still return the same downsampled image? I’m pretty sure it would, but I’m not sure exactly why or how.

I feel like I haven’t accepted the decoupling of channel information (color) and ‘symantic representation’ yet – I intuitively can see color as a value in a matrix, but I don’t know how to interpret symantic depth. Which leads me to be unsure of why symantic value appears to us in the form of a large value.

We don’t maxpool the input, so it doesn’t impact the output of the model. If the colors were swapped, the first conv layer would learn to use negative weights, and the first layer of activations therefore would still be much the same regardless of whether it’s white-on-black or black-on-white.