[Fisheries Competition] My FCN + Bounding Box model can improve the result,up to 97.2% acc

justinho · May 12, 2017, 6:13am

After finished the lesson7, I’m so impressed by FCN, it can create a heat map to tell us the where the model are trying to find fishes in the picture. In another section, we use bounding box to tell the model where it should find fishes. So I come up an idea, what if combine the FCN and Bounding Box?

So I create an architecture, I use the VGG640 to precompute the features, and make a FCN model like this:

def get_bb_model():
    
    inp = Input(conv_layers[-1].output_shape[1:])
    x = BatchNormalization(axis=1)(inp)
    x = Convolution2D(nf, 3, 3, activation='relu', border_mode='same')(x)
    x = BatchNormalization(axis=1)(x)
    x = Convolution2D(nf, 3, 3, activation='relu', border_mode='same')(x)
    x = BatchNormalization(axis=1)(x)
    x = Convolution2D(nf, 3, 3, activation='relu', border_mode='same')(x)
    x = BatchNormalization(axis=1)(x)
    x = Convolution2D(8, 3, 3, border_mode='same')(x)
    x_bb = Flatten()(x)
    x_bb = Dense(4, name='bb')(x_bb)
    x_class = GlobalAveragePooling2D()(x)
    x_class = Activation('softmax')(x_class)
    
    return Model(inp, [x_bb, x_class])

Hit the model.summary, we can see the architecture clearly:

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_3 (InputLayer)             (None, 512, 22, 40)   0                                            
____________________________________________________________________________________________________
batchnormalization_9 (BatchNormal(None, 512, 22, 40)   1024        input_3[0][0]                    
____________________________________________________________________________________________________
convolution2d_22 (Convolution2D) (None, 128, 22, 40)   589952      batchnormalization_9[0][0]       
____________________________________________________________________________________________________
batchnormalization_10 (BatchNorma(None, 128, 22, 40)   256         convolution2d_22[0][0]           
____________________________________________________________________________________________________
convolution2d_23 (Convolution2D) (None, 128, 22, 40)   147584      batchnormalization_10[0][0]      
____________________________________________________________________________________________________
batchnormalization_11 (BatchNorma(None, 128, 22, 40)   256         convolution2d_23[0][0]           
____________________________________________________________________________________________________
convolution2d_24 (Convolution2D) (None, 128, 22, 40)   147584      batchnormalization_11[0][0]      
____________________________________________________________________________________________________
batchnormalization_12 (BatchNorma(None, 128, 22, 40)   256         convolution2d_24[0][0]           
____________________________________________________________________________________________________
convolution2d_25 (Convolution2D) (None, 8, 22, 40)     9224        batchnormalization_12[0][0]      
____________________________________________________________________________________________________
flatten_3 (Flatten)              (None, 7040)          0           convolution2d_25[0][0]           
____________________________________________________________________________________________________
globalaveragepooling2d_3 (GlobalA(None, 8)             0           convolution2d_25[0][0]           
____________________________________________________________________________________________________
bb (Dense)                       (None, 4)             28164       flatten_3[0][0]                  
____________________________________________________________________________________________________
activation_3 (Activation)        (None, 8)             0           globalaveragepooling2d_3[0][0]   
====================================================================================================
Total params: 924300
____________________________________________________________________________________________________

After running 10 epochs, I got a pretty good result:

Train on 3277 samples, validate on 500 samples
Epoch 1/1
3277/3277 [==============================] - 25s - loss: 0.2828 - bb_loss: 258.3640 - 
activation_3_loss: 0.0244 - bb_acc: 0.8419 - activation_3_acc: 0.9997 - val_loss: 1.3981 - 
val_bb_loss: 1274.4607 - val_activation_3_loss: 0.1236 - val_bb_acc: 0.7800 - val_activation_3_acc: 0.9720

I got the result of 97.2% val_acc and 0.12 val_loss!

Visualize the result of one sample:

In my intution, the model has trained a good feature in the last convolution layer, which outputs 8 results, and it can generate a heat map which tells us it was finding fishes in these pink area. In other words, the model can focus on the pink area, if we tell it there’s a fish in some particular pink area with my bounding box, the model will keep ajusting it’s vision to find the fishes through SGD.

That’s a perfect cooperation between human and AI !

If you have some better idea, I hope you can share with us and improve the result together!

PS: you can download my ipynb from here.

torkku · May 12, 2017, 5:37pm

Hi,

This is really amazing Just what I wanted to test out, but haven’t yet gotten so far and the competition already ended.

I’m not sure if it’s a better idea or not, but at least an idea. I was thinking about only using the convolutional layers for creating the heat map. I’m not absolutelly sure, but I think that way you could easily use larger images (but I think they still need to be the same size)? Hopefully getting better results.

However, your results are so good, that I think that your bounding boxes should be good for anyone…

How are you planning to use your bounding boxes to detect the fishes? That is bit unclear to me. In the whale recognition competition they said [1] that they didn’t actually actually make new images from the bounding boxes…

Thank you for sharing your results!

[1] https://deepsense.io/deep-learning-right-whale-recognition-kaggle/

torkku · May 12, 2017, 5:39pm

Ps. Just occurred to me, I think the competition rules says that you shouldn’t share the actual images anywhere.

justinho · May 13, 2017, 2:57am

I am not sure if you have read the whole lesson7.ipynb? But never mind, actually, Jeremy uses the convolutional layers to create the heat map, the architecture is so call Fully Convolutional Network.

As for the bounding boxes, they are annotated manually, a guy boxed out these fishes of each images with red rectangle, so we let the model outputs two terms: 1.the fishes species, 2.the fishes’ location- a yellow rectangle - consist of 4 elements : “height, width, x, y”. The model compares the predicted location with the annotated location, and the model will keep adjusting the location in every training epochs, that looks like we are “teaching” this model where to find the fishes.

I simply read your link about the whale, they are also annotated the whale with red rectangle. As long as you share your annotated images to everyone, you are allowed to use them.

Hopefully answer your questions.

justinho · May 13, 2017, 3:05am

By the way, I used a dense layers to predict the location, that creates a lot params. I chose a Convoluton2D layers at first, but somehow it occurs dimension problem, maybe I made some mistake, I don’t know.

I’m wondering if we use the Convoluton2D to predict the fishes’ location would be even better, @jeremy what do you think?

torkku · May 13, 2017, 9:26am

Sorry, I didn’t actually mean “detect”, I mostly understand that part.

I was trying to ask how are you planning to use the bounding boxes that you have created to better categorize the images? Have you tried any cropping methods for example?

justinho · May 13, 2017, 10:22am

I got your point. You mean since we already have bounding boxes of every fishes, why don’t we just crop the rest of the picture and keep the part of bounding box area? The model will focus on these “passport photo” of the fish and maybe it’ll improve the result.

But there’s a problem, through we have the training set with bounding boxes, but we didn’t have any bounding box in the test set! So the bounding box would not be another input of the model.

On the other hand, if we use the model to predict the bounding box of test set at first, and then use these bounding boxes to “detect” the fishes’ species, I’m afraid that we’ll have poorer result, because as you can see, the bounding box accuracy could only get to 80%, it may lead to wrong bounding areas. Therefore I think we can’t use a predicted result to predict another result.

What do you think?