Advice for Invasive Species Kaggle Competition

danielhavir · July 26, 2017, 9:22pm

Hi,

I’m participating in this competition: https://www.kaggle.com/c/invasive-species-monitoring and I cannot get past ~0.964 score. I ran out of ideas, I even recreated my validation set twice.

I’m using these augmentation parameters:
rotation_range=30, height_shift_range=0.05, shear_range=0.1, width_shift_range=0.05, zoom_range=0.05, horizontal_flip=True (I also tried width and height shift of 0.1 and a rotation range of 0.1)

I tried creating an augmented set 4x, 5x and 6x the size of the original training set.

I’m using VGG16 convolution model for obtaining the bottleneck features. As for the top model, I tried finetuning the VGG16 model with BatchNorm, I also tried simpler model architectures such as 2x 256 + BatchNorm + Dropout 0.5-0.7; 2x 512 + BN + DO and also single layers of 256. In all cases including the output layer, which does not change.

I also tried two types of last output layer:
Dense(2, activation='softmax') and Dense(1, activation='sigmoid') even though I believe this does not affect anything and should be identical (please correct me if I’m wrong).

As for optimizers, I tried mostly Adam, but experimented with Nesterov SGD and RMSprop as well.

I tried model ensembles.

The “best” so far was 2x 256 + output layer with Adam and learning rate of 1e-4. But it’s still not good enough…

Every time I tested the validation accuracy using the sklearn’s roc_auc_score, but I found there’s very little correlation between my validation roc score and the score I achieve with the final dataset.

Could you please advise on what I may be doing wrong, i.e. how can I improve the results? I’ve been fiddling with this dataset for over 6 weeks now and I’m frustrated that the results are not improving… Thank you in advance.

alexandrecc · July 27, 2017, 2:33pm

Hello Daniel,

While waiting for the data access for my next project, I also tried this competition :

It is hard to tell or comment if you are overfitting or underfitting the training data without your training/validation loss/accuracy results . I think that it is the first question you need to ask you. I assume with all the regularization/augmentation techniques you mention that you are overfitting your data. But can confirm if your training loss is very low ?

A simple advice for this competition : image resolution is very important to improve test performance. When you can get a validation/test result above 98% with 224x224 pixels images, try to use higher resolution for your training/inference.

Personnaly, I didn’t use roc_auc_score. I don’t think training on a roc_auc_score is really useful (or worth the computation needed). The computation needed for a roc_auc_score is a lot higher than a cross_entropy loss. For this problem the correlation between the cross-entropy loss and the roc_auc_score is very high. Maybe for an extremely well performing model, there could be slight improvement with roc training since the final result is analyzed on the ROC curve instead of a log loss. But you need to understand that you should not manipulate your final predictions with some non-linear tricks (if pred > 0.95: pred = 0.99) like Jeremy suggested for the dogs and cat competition. This trick can potentially improve the result for a log_loss problem but not for a roc problem since the ROC estimation usually use the ranking of the predictions. You need to preserve this implicit ranking of the predictions in your submission. To make it clear, if you divide all your predictions by 2, the roc_auc results should be the same since you don’t change the rank of the predictions.

danielhavir · July 27, 2017, 3:58pm

Hi Alexandre,

Thanks for reaching out! I trained on categorical crossentropy, I only used roc_auc after the training to have an idea. My training error was ~ 0.12 and validation error ~0.19. The validation accuracy usually stopped at around 0.93 - 0.94. The most I reached for training accuracy was a little over 98. I once tried like a zillion epochs and managed to score even higher accuracy but then the validation accuracy got below 0.9.

Truth is I clipped the final predictions at 0.93, sometimes 0.95.

alexandrecc · July 27, 2017, 6:05pm

Hi again,

You should be able with 224x224 images and VGG16 to fit the training data with a much lower loss than 0.12. You are probably underfitting the training data. Your first task should be to try to overfit the training data with this architecture with a loss under 0.0200. Assuming there is no bug in your code, you can add representational power and depth to your model by training the entire network (with all the conv layers) initialized with imagenet weights instead of training only the top dense layers. Of course, keep the data augmentation for this goal. Don’t bother to much with the different optimizers or the different activation functions; you can probably converge/overfit with any of them. Then, you can try with different values of dropout to get a good fit between your training/validation loss. Add some salt and pepper, and the recipe should be good!

If you can get a training loss under 0.0200 (no matter your validation loss), just tell me and I’ll give you some more hints how to improve!

Try, ask, learn, retry, ask, learn, retry again, ask, learn, over and over.
Forward propagation, loss calculation, Back propagation (gradients), over and over.
That is the way we (humans) and them (machines) learn.

danielhavir · July 27, 2017, 7:05pm

Thanks a lot! I can overfit the data.

I also tried training the whole network (incl. the Conv layers) before, but after some 8 or 9 epochs, the training loss stops improving at around ~0.79. I’m not using the keras.applications.VGG16 model, I create it in a similar way like Jeremy does.

alexandrecc · July 27, 2017, 7:48pm

Ok great. 2 sec per epoch is pretty fast for 7180 images ! Consequently, I guess you only trained the last top layers for this example with the extracted preprocessed base model features. Are you using keras random data generator or you preprocessed the augmentation ? Randomness induced by keras data generator is usually beneficial.

I am very surprised that you can overfit the data while training only the top layers in about 20 sec and that you cannot overfit with all the layers trained. This is potentially a hint to your problem. Your problem could hide in your base model.

I assume of course that you are initializing your base model weights from an imagenet pretraining file. If you created/wrote completely your VGG16 base model, like Jeremy did, and you are initializing the weights with Jeremy h5 file, are you sure your model definition and inputs are exactly the same ? Any minor change to the base model definition could alter your results significantly, even if not detected when loading the weights.

If you are not initializing your weights with a pretrained model, then you should. Multiple hierarchical deep conv layers cannot properly be trained with only 2500 images.

Also did you try a submission without clipping your predictions ? What is your best result without clipping the predictions ?

danielhavir · July 27, 2017, 7:59pm

Well, first I create the VGG16 model, I remove the last Dense layers, I run the images throught the Convolution using ImageDataGenerator with data augmentation like this:
datagen = image.ImageDataGenerator(rotation_range=30, height_shift_range=0.1, shear_range=0.2, width_shift_range=0.1, zoom_range=0.2, horizontal_flip=True)

batches = datagen.flow_from_directory('train', target_size=(224,224), class_mode='binary', shuffle=False, batch_size=64)

conv_feat = conv_model.predict_generator(batches, batches.nb_sample*3)

I simply obtain the features (The top layers’ training is then much faster as you noticed).

As for the model: yes, I believe it’s exactly the same (including the preprocess function). Do you prefer the keras.application.VGG16 instead of recreating it?

I tried submission without clipping and I got 0.96578 which is slightly better than before, although not by much.

P.S.: I double-checked and the weights are identical with both the recreated and the implementated model in keras.applications.

alexandrecc · July 27, 2017, 8:29pm

If you precomputed and saved the features produced by 1 epoch with the ImageDataGenerator, and then used this fixed data to train the top layers, then you are not really fully using data augmentation. You basically just created one (or 4-6) random instance of data variation and you are using it over and over for all the following epochs. That is probably why you can overfit fairly easily the top layers with a high validation loss. There was a forum thread about this someday : Lesson 3, why can't we pre-compute when using augmantation?

Try to train the entire network with data generation and consequently without extracted precomputed features. The data generation will be random for each epoch. You can try to train only the top layers and then all the layers to see the difference.

I still don’t exactly understand why you couldn’t overfit when training all the layers. Did you use the image data generator for that full training as well ?

danielhavir · July 27, 2017, 8:34pm

Yes, I did. This is the full code:

Even after more epochs, the training loss is still between 0.6 and 0.7.

alexandrecc · July 27, 2017, 9:07pm

Fine, this code isn’t extracting the features and you are training only the top layers. The model definition looks fine. The image data generation looks fine. You could set shuffle to True when calling get_batches of the training batches. You could set a lower learning rate (0.0001 instead of the default 0.001) in the compile call. I guess vgg_preprocess and get_batches functions are the same as Jeremy in vgg16.py. You should be able to overfit the training data with this setting. Try the same code with l.trainable = True

danielhavir · July 28, 2017, 9:25pm

I did as you say with the learning rate and shuffle. Yet the training loss is still huge, it stops at about ~0.58. I even tried training the whole network, including the conv layers and that was even worse. I’ve played with it but nothing really helped.

alexandrecc · August 4, 2017, 2:07am

I just saw that you are doing very good 21/424 ! Keep the good work Daniel !

danielhavir · August 4, 2017, 9:17am

Thank you so much Alexandre! Your advice helped me immensely.

dukeofyork · August 9, 2017, 8:29pm

i’ve been messing around with this competition for a while as well. i’m wondering how a low a loss i should be shooting for? somewhere around 0.01? i’ve tried a bunch of stuff and now im mostly focused on trying to get the loss down on an inceptionv3 model.

the best i can get so far is down to 0.07 training loss by messing around with learning rates, data augmentation, and units in dense layers, batch norm.

this is the model im working with:

if anyone feels inclined to offer any pointers that would be greatly appreciated.

aaronwong · August 9, 2017, 8:41pm

Hey @dukeofyork. I’ve been using the built in inceptionv3 model and have been able to get the training loss to around .02 with my validation loss bouncing back and forth between .18 and .06 mainly following the advice given by @alexandrecc. The hint that helped me the most was using higher definition. When I increased the target size from 224 to 450 I went from dealing with validation losses that were stuck around .20 to validation losses in my current range. Finetuning the whole model was also essential to getting the loss down.

danielhavir · August 9, 2017, 8:46pm

Hi,

My advice would be to only use 1 hidden layer instead of 2. For me, 1 hidden layer worked well enough

Also watch out that InceptionV3 has a default input size of 299x299 (see https://keras.io/applications/#inceptionv3 )

Good luck!

dukeofyork · August 9, 2017, 8:52pm

@aaronwong

so interestingly i already tried this but with a custom built inception net (not inceptionv3). so it looks like ill have a go again but this time with incepv3.

my custom net was not very good for a number of reasons. it was more of a ‘let me try to understand this better’ but it did well enough on the LB.

@danielhavir

i’ll give it a go. usually setting my own shape w/ include_top = False.

thanks for all the feedback guys! having a lot of fun.

alexandrecc · August 10, 2017, 1:35am

@dukeofyork
Your learning rate of 0.001 looks pretty high for the potentially large batch size (64 images/batch ?). Try to overfit the training set under 0.0200 loss before adding any dropout in your top model. Be sure you have enough trainable layers to have the sufficient representational power to overfit the training data. As I already said, try also higher resolutions. I hope it helps.

dukeofyork · August 10, 2017, 1:48pm

merci alexandre.