Average Pooling instead of Dense layers

Jeremy discussed this topic when talking about ResNet and I also read the two papers that were the first to talk about this idea. The reasoning behind it makes sense and saving millions of parameters is never something one would complain about. So I went ahead and replaced dense layers in several of my CNN models with average pooling and the results were always terrible. My tested architectures were between 5 to 15 conv layers the “style” was similar to VGG or ResNet itself.

Does somebody know why this is? Do I need some sort of minimum depth to make average pooling work? Does training with average pooling just take longer?


Have a look at how I use avg pooling in the heatmap models in lesson 7. Be sure that your penultimate layer has a reasonable number (7x7 is common) of activations. You may also want to try a single dense layer after the average pooling layer.

thanks Jeremy.

i tested it a bit further and noticed that average pooling trains a little slower but also helps a lot to prevent overfitting with very little data. my main problem was that i didnt use enough channels in one of my architectures. with 176 different classes i need more than 64 channels resulting from average pooling :slight_smile: obviously. im not on par with my fc-architecture but the gap is getting smaller.