Bilinear CNNs


(Ben Johnson) #1

This paper has some interesting stuff in it, but one thing is a particularly simple idea that’s pretty compelling. Basically they’re trying to finetune an Imagenet-trained (VGG16) for a new classification task (what kind of bird is this?).

The way I’d usually approach this is to take the last convolution layer w/ size (7, 7, 512), max/sum pool to get to (512,) and then train a 1-2 layer network w/ appropriately sized softmax on the new dataset. In this paper, instead of the max/sum pooling, they take the outer product of each of the 49 512d vectors to get a (7, 7, 512, 512) feature map, then sum/max pool to get to (512, 512), then flatten to (512 ** 2, ).

Obviously 512 ** 2 = 262,144 dimensional features are big, and the outer product is slow. But, in the paper they report that the accuracy of the model goes from 0.60 using (something like) the standard sum/max pooling to 0.81 using the “bilinear pooling” (as they call it). I was able to roughly replicate these results, though I’ve been able to get ~0.70 accuracy using standard sum/max pooling method, and ~0.77 accuracy using the bilinear pooling. [1]

The authors of the paper suggest that this is particularly beneficial for “fine grained object classification” – situations where the inter-class differences may be small compared to intra-class differences. I’d be interested in whether people have tried this method on other datasets (eg Imagenet, facial recognition, etc). Also interesting in whether people have any insight about why this works from a math perspective, and whether people have any ideas on how (something like) this could be implemented more efficiently to make it easier to plug into other models.

Here’s some Keras code for doing the bilinear pooling + some normalization that the authors suggest. It’s quite slow – I’d recommend precomputing the pooled features, rather than using this function in a network.

def bili_pooling(x):
    shp = K.shape(x)
    bsz, w, h, c = shp[0], shp[1], shp[2], shp[3]
    
    # Bilinear pooling
    b1 = K.reshape(x, (bsz * w * h, c, 1))
    b2 = K.reshape(x, (bsz * w * h, 1, c))
    d  = K.reshape(K.batch_dot(b1, b2), (bsz, w * h, c, c))
    d = K.sum(d, 1)
    
    # Flatten
    d = K.reshape(d, (bsz, c * c))
    
    # Normalize
    d = K.sqrt(K.relu(d))
    d = K.l2_normalize(d, -1)
    return d

(FWIW, because the 262,144 vectors are big and slow, I tried using only the ~20K features w/ the highest variable. This gives quite similar results to using all 262K – maybe a little accuracy loss, but I haven’t fiddled w/ it much.)

** Edit: Took a look at this again – a reasonable “end-to-end” idea is to do a 1x1 convolution before the bilinear pooling to reduce the number of channels and avoid some of the quadratic explosion. Eg. using 1x1 conv w/ 144 filters reduced the bilinearly-pooled features to ~21K dimensional w/ accuracy of ~0.75. It’s still slow, but less slow… **

[1] My baseline is stronger because they used the fc1 features while I used max-pooled conv5 features (I often try both). They do some data augmentation I haven’t bothered with yet, which I suspect explains their stronger bilinear pooling results.