@metachi (FYI… for your attempts to port this in Keras)
The code’s full of sneaky little surprises (mean that in a good way, of course!)
During my attempt to integrate VGG-16, I had to examine the code very carefully and spent a lot of time understanding what the Module:
def __init__(self, sz=None):
sz = sz or (1,1)
self.ap = nn.AdaptiveAvgPool2d(sz)
self.mp = nn.AdaptiveMaxPool2d(sz)
def forward(self, x): return torch.cat([self.mp(x), self.ap(x)], 1)
was supposed to do. I think I have a good answer in terms of what it is actually doing. It is simply used as the transition step to connect the convolution layer to the fully connected layers. I am actually going to write a blog on it, so for my future reference.
But correct me if I am wrong, the normal ResNet34 (and VGG-16) simply average-pools the activation from its last convolution layer to the next fully-connected layer. In the fastai lib., we preserve both the max. and avg. activations from the last conv. layer. How does this help? I feel like it should help because we’re keeping more information… but does it really help?
So, I think this is one way that we’re doing a learning transfer distinctly from traditional approach. In the traditional approach (including in fastai, V1), we only replaced the last classification layer. Here, we’re completely recreating the fully-connected layer and training them during the fit.
Exactly that. Yes it does help. I came up with it during the Planet comp, then 2 weeks later a paper came out that mentioned it in an appendix. Sorry I don’t remember the paper.
It would make for an interesting blog post - you could test using concat pooling vs avg pooling vs max pooling for various datasets.
Imagine, for instance, the Planet satellite comp. You don’t want to know on average whether the pre-pooling cells have, say, a river, but whether any of them have a river - i.e. you want the max. But as to whether the image is ‘hazy’ (another of the labels in this comp), you really want to know whether it’s hazy on average. So by including both, we have access to both types of info.
Right - in fact we’re creating multiple fully connected layers, with dropout and batchnorm before each.
Max vs Avg vs Adaptive [Max, Avg] Pooling, Still, it is kind of confusing. I understood it adds additional information to the FC layer. Will it possible to explain in which use cases you use Max vs Avg vs Adaptive.
PyTorch Doesn’t have GlobalMaxPool2D or GlobalAvgPool2D, So I think AdaptiveMax an Avg pool basically adapts to the Input size of H,W and basically operates like GlobalAvg and GlobalMax Pooling layers.
With regards to when Avg vs. Max. Sometimes Max value of the HxW feature map of the last layer works better than Avg and vice-versa. By adding both in the final layer, you are letting Neural Net choose what works without having to experiment yourself. This is how I understand it and I think Jeremy talked about it in one of the lecture videos.
Was this in a lesson?
I’m trying to learn how to use this now but this comment isn’t in a specific lesson’s thread.