Dogs vs Cats - lessons learned - share your experiences

With just two days away from the competition finish, I thought I’d create this thread so that we could all share our experience with the challenge!

Still making minor tweaks and will publish a more detailed write up once I see to what extent they work, but here is the prevailing thought I wanted to share:

Getting relatively okay-ish results (in top 100 on LB) has proved to have nothing to do with mathematics and nearly completely everything with engineering!!! (or if one is intimidated by the word engineering, then tinkering :smiley: )

Will post more as what worked and what didn’t work turned out to be quite unintuitive to me (probably due to the very little experience had to base my intuition on).

Looking forward to hearing what approach you took and worked / didn’t work for you!

3 Likes

I share your comments. To it I would also add that output clipping provided an immense boost to my early models (I went from top 57% to top 14%, and that without batch norm, dropout, or image augmentation).
I also noticed that retraining the last convolutional layer didn’t really do much, most of the gains came from fine tuning the dense layers.

And not directly related, but setting up the AWS server and having a K80 working for me for cheap turned out to be far easier than I thought. I had some minor issues that I had to fix (Reduce the keras progress bar update interval so that Jupyter doesn’t crash) but overall it didn’t require as much work as I thought.

I’m reading the Deep Learning (Goodfellow et al) textbook and I’m finding it interesting to understand the theoretical aspects of what we are doing in the course, but I do find the hands on approach much better than the math first approach

1 Like

Second your observations! :slight_smile: BTW how are you setting the update interval on the progress bar? I went with verbosity=2.

1 Like

That’s what I tried first, but I want my progress bars!
So first I tried keras-tqdm, but I ended up going with fixing it using this https://github.com/fchollet/keras/pull/5165
I set my interval to 2 seconds. It’s a bit hackish but it works

1 Like

Just fine tuning the last layer of Vgg16 model got me in the top 11%. Surprisingly, I see no further score improvement even after trying other advanced techniques.

Did any of you guys get better results with the other techniques mentioned in the course?

I have a similar experience, the first try got the best score. Adding some or all of the advanced techniques in lesson 2 or 3 only made the score worse. I keep hearing how DL models are more of an art, and I’m starting to believe them.

Hi Manoj, did you follow the finetuning in the dogs_cs_cats_redux notebook, or the method used in Lesson2 ipython notebook? I’m only getting around the top 24%, wondering if I’m doing something wrong here to not get better results.

Cheers!

I am not sure about the differences between the two notebooks.

I just popped out last layer and added a new Dense layer, and then finetuned the entire model. I just tried out different learning rates and number of epochs. Try to play with the learning rates.

Also, look over the ensemble notebook. Ensemble gives better results.

I agree, I think I am missing some crucial insights on how to apply these techniques well.

I reached 6% on the public leader board. I used VGG16, effectively replacing the FC layers and training them from scratch with batch normalization. I also fine tuned all the layers.

What worked

The process that @jeremy describes is golden
This is the key - the start simple, overfit, see what you can do with data augmentation, add regularization if needed. I think this approach / skillset is what Andrew Ng is referencing in his Nuts and Bolts of Applying Deep Learning presentation. He mentions that controlling bias / variance is something that even experienced practitioners struggle with and something that makes the real difference. To me it certainly felt like magic in how effective that approach was.

Data augmentation is cheating
With setting the FC layers to trainable, my model was quickly overfitting. However, I ended up with what felt to me like a lot of data augmentation, and it turned out I didn’t need additional regularization! (maybe could use a little bit but didn’t have time to add it). Experiments with small sample sizes were super cool - with 500 train and 250 validation images I was able to observe how data augmentation / regularization literally gives the model the ability to learn things that generalize.

Batchnorm is a performance enhancing drug for neural nets
I don’t think I ever want to build a model without it!

What didn’t work

Complexity everywhere
I ended up with some code doing something over here, then I had to reconstruct the model so I ran those bunch of cells, then I had this method in utils.py that I created and soon enough I sort of knew what I was doing in the moment but it wasn’t necessarily well structured. This approach doesn’t scale and will not get me ever a good solution. Seems a big part of getting good results is engineering / tinkering and with so many things flying around the overhead is unmanageable. Same with naming saved weights and knowing what is where, etc. With the next kaggle competition I need to work out a cleaner pipeline - maybe start using collapsable cells in jupyter notebook, moving more things out into separate scripts, and somehow preserving what worked and not allowing it get lost in the see of trying out things. I think this and model ensembling would be the key to improving my results, even above learning new and more powerful techniques / architectures.

Horrible zombie brain and tiny improvements
’Ooo I wonder what this epoch will bring… mhmm its only 6 minutes… let me browse those pictures of funny cats or ponder what the progress bar is doing while I wait. Ah what the heck, let me run it for another epoch, maybe this will help.’ Horrible, horrible, horrible time wasted. What’s even worse is chasing another 0.01 decrease in validation loss via running the training overnight with tiny learning rate… This is like a highway to loss of productivity and overfitting the validation set by trying to hard and yet it did help a bit with public leaderboard standing… Very dissatisfying.

Pseudo-labeling not so hot with a lot of data augmentation
I think at the moment I wanted to add pseudo-labeling there was not a whole lot of capacity in my model to spare. For one reason or another I was not seeing any improvements on the validation set. I think this reflects more on my particular set up at the time than pseudo labeling as such and should I have more time in the future I will definitely try to apply it - this needs more experimenting on my side.

Summary

Deeplearning is not what I expected. It is not so much about sitting in a corner reading a math book but much, much more about figuring out how to configure an AWS instance, how to use shell / python to move files around, etc. I also feel that moving from top 6% to top 3% will be a lot harder but the quickest way to get there is not by studying more math (which I would love to do anyhow) but by building models and experimenting. I would like to learn more math and know how to do it but at this point in time I only have very little time (relatively) I can give to deep learning related activities, so need* to spend my time bucks on where they can bring most value :wink:

* yeah yeah @radek we all know you are only saying this but we know your mind is already scheming how to learn more math

20 Likes

As we speak i’m in the top 5% of the competition. Here are a few things I did:
a) Use Resnet50 instead of VGG 16/19. VGG19 put me in the top 20% without BN, DA or other lesson3 techniques. I set the target size as (400,400) like what we did in lesson7.
b) Data Augmentation didnt really help push my score beyond a point. Infact my Resnet was behaving super weird when it came to fitting with augmented data. Did anyone else face this?
c) Clipping is an art and very important. I tried a whole range of clipping values and clipping it at 0.02 and 0.98 is what led me to my final score.
d) Adjusting hyper-parameters like learning rate, dropout value, etc. also pushed me further up in the leaderboard but only after an insane number of tries.
e) I figured this competition was fairly easy because like Jeremy mentioned the cats or dogs occupied almost the entirety of the image and so using just pre-trained networks gave us really nice results. Fisheries would be harder I guess.
f) I’m hearing constantly from folks that most of the top solutions use xgboost and it’s pretty popular with kagglers. I’m guessing we’ll learn more about it in future classes.

7 Likes

HAHAH. I totally get it when you talk about running epochs over and over again expecting things to improve by 0.0010 or something. it helped in certain cases but mostly it was just not helping. One of the things i learnt by spending a lot of hours on this competition was, sort of, an intuitive understanding of when a model has reached it’s “peak” and when to move on. If I see the validation set accuracy jumping all over the place I would rather try something new.

And yes, every 1% after the reaching top 10% seems harder and sometimes it just baffles me when i see the gulf of difference between what the pos. 1 and pos. 2 on the leaderboard. Especially cats vs dogs redux. I hope he puts out the code for it. Would greatly help folks like us.

1 Like

Haha, just when I felt good about my 8%. :slight_smile:

I used a boring method to improve my score:

Trained the same network x3 on random shuffled splits (0.9/0.1) of the data and averaged scores.

I tried this using VGG19, InceptionV3, ResNet50, and Xception (all the built in Keras models).

Best score was with ResNet50. Also tried averaging them together and performed slightly worse than ResNet.

1 Like

Since this dataset is open, it’s always possible someone just went through and manually labeled cats and dogs.

IMO, top 8% is awesome! It shows how powerful the techniques taught in this course are.
There was someone else in this forum who mentioned this whole averaging the predictions out for all these architectures and then reaching the top 5% in the fisheries competition. I would definitely want to try that out.
And about manually labelling the dataset why would anyone do that and get themselves disqualified is beyond me but still possible.

1 Like

Think averaging predictions from the same model on different splits of the data was the right move, but if I had had more submissions (and had left a holdout set), I would want to try something more advanced to combine different model results - maybe stacking?

Might be useful if we cover ensembling in one of the lectures in part II.

1 Like

I had a similar method and score, although now I’ve dropped down to 9% with the surge of last minute entries.

I used 9-fold splits and averaged across a vgg16 configuration with no dropout. I tried that early on and that was the best I ended up with. I tried ResNet50 but didn’t have much luck with the model there, although I didn’t try the 400x400 size.

Like @radek I totally got caught up with the ‘just one more epoch’ attempts, and also found that I spent a lot more time fiddling with the clipping than I probably should have.

I also tried taking the earlier model layer outputs (the first 4096) and feeding that into xgboost to see if it could come up with a better model. The model it built claimed 99.54% accuracy on 5 fold cross validation but when I submitted the results weren’t as good as my ensemble. I’m not sure what went wrong there.

Looking forward to taking these learnings and moving on to the fisheries competition.

1 Like

Came across this awesome post while looking up tips for making ensembles:

http://mlwave.com/kaggle-ensembling-guide/

Has a link to a really cool Hinton lecture on “Dark Knowledge” as well:

http://www.ttic.edu/dl/dark14.pdf

2 Likes

@davecg

nice find! here’s the dark knowledge talk to go with the notes.

2 Likes

Took a closer look at how correlated the predictions from each of the models were going into the ensemble (3x repetitions per model). Not a lot of variance between repetitions so probably diminishing returns there for the ensemble, but definitely something to gain from using multiple models.

Same approach brought me to top 10% of Fisheries. :slight_smile:

4 Likes