Dogs vs Cats - lessons learned - share your experiences

twairball · March 4, 2017, 4:40am

Thank you guys for sharing!

I have a general question about validation - how do you guys make your validation data?

For statefarm, it requires a little more work to have separate drivers in training vs validation set.

For dogscats and fisheries, do you guys just make a one-time setup to make training and validation folders?
Or do you use something liker Keras’ validation_split during training?

Thanks!

radek · March 4, 2017, 9:38am

I just used the simplest approach possible At the beginning randomly chose 10% of the training set and used it as my validation set.

JRicon · March 4, 2017, 12:28pm

I use keras validation_split on every category’s folder and copy from there to a valid and training folders (I keep the original data untouched as train_orig), then in my for-loops I copy a fraction of the data to a sample folder with the same train/valid/test structure. I use 10% of the original data for a sample and a 75% or 85% training/validation split.

import os
import shutil
from sklearn.model_selection import train_test_split
#path=os.path.realpath('')+'/'+path
# cats/dogs
pre_run=0
samplesize=0.1
prop_train=0.75


if(pre_run==1):
    shutil.rmtree(path+'sample',ignore_errors=1)
    shutil.rmtree(path+'valid',ignore_errors=1)
    shutil.rmtree(path+'train',ignore_errors=1)
    os.mkdir(path+'sample')
    os.mkdir(path+'valid')
    os.mkdir(path+'train')
    os.mkdir(path+'sample/train')
    os.mkdir(path+'sample/test')
    os.mkdir(path+'sample/valid')
    dirs=glob(path+'train_orig/*')
    for i in dirs:        
        subdir=i.split('/')[-1]
        os.mkdir(path+'sample/train/'+subdir)
        os.mkdir(path+'train/'+subdir)
        os.mkdir(path+'sample/valid/'+subdir)
        os.mkdir(path+'valid/'+subdir)
        train,valid=train_test_split(os.listdir(i),train_size=prop_train,random_state=42)
        count=0
        for j in valid:
            if count<=samplesize*len(valid):
                shutil.copy(i+'/'+j,path+'sample/valid/'+subdir+'/'+j)
                count+=1
            shutil.copy(i+'/'+j,path+'valid/'+subdir+'/'+j)
        count=0
        for j in train:
            if count<=samplesize*len(train):
                shutil.copy(i+'/'+j,path+'sample/train/'+subdir+'/'+j)
                count+=1
            shutil.copy(i+'/'+j,path+'train/'+subdir+'/'+j)
    test_imgs=glob(path+'test/*')
    np.random.seed(42)
    sample_test=np.random.permutation(test_imgs)[:round(len(test_imgs)*samplesize)]    
    for i in sample_test:
        shutil.copy(i,path+'sample/test/'+i.split('/')[-1])

twairball · March 4, 2017, 2:20pm

Thanks for your reply guys.

In the case of, e.g. fisheries, do you make your validation set same distribution of classes as training set?

E.g. 10% of each class, or would we want a more random distribution so as not to overfit?

I am having trouble to tell when/if my models are overfitting to training/validation data.
E.g. fisheries, using Dense(256, activation='relu') in the FC layers, I find that validation loss goes down much faster than training loss. Could it be that my model is overfitting to validation data?

RE: Keras validation_split:

I looked at Keras’ validation_split argument for model.fit() but the FAQ clearly states that it is held out as last X% before shuffle. (https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory).

radek · March 4, 2017, 2:34pm

It is impossible to overfit validation set while having poorer results on the training set. The reason for the loss discrepancy you describe is that the loss on the training set includes additional loss through regularization (l1, l2 penalties, loss through dropout etc), so having good performance on the validation set is a good thing

Just to test your sanity (and mine ;)) you might want to take your model and run model.evaluate_generator on your train set. This should give you a lower loss than it gives you during training (I believe at least). The data will still be the same, but the loss will be calculated in the way it is calculated on your validation set during training (or at least I suppose so).

davecg · March 4, 2017, 3:23pm

Scikit Learn has a bunch of great CV tools including ShuffleSplit and KFold and stratified versions of both (i.e. Keep class distributions equal).

I did something slightly hacky to make things work without tweaking the Keras image data generator - created file lists with sklearn and then created different folders (i.e. cv_0, cv_1, etc) each containing their own respective train and validation folders.

I put links to the original files (os.symlink) and then used the Keras generator with follow_links=True.

Even · March 4, 2017, 6:11pm

I did something slightly hacky to make things work without tweaking the Keras image data generator - created file lists with sklearn and then created different folders (i.e. cv_0, cv_1, etc) each containing their own respective train and validation folders.

I put links to the original files (os.symlink) and then used the Keras generator with follow_links=True.

That’s a really clever approach. I thought about doing something similar but didn’t know how to do the linking. Do you mind sharing your code? That seems like a really useful technique. I really don’t like the way the keras generator acts only on directories as it makes data manipulation for kfold really difficult. This seems like an elegant solution.

Even · March 4, 2017, 6:38pm

The talk and lecture are awesome, and I love the idea, but I’m not sure I have the chops to implement this, and in digging around I can’t seem to find any examples of anyone using this method that they’ve shared online. Do you know of any examples?

I looked into maybe modifying the softmax layer of Keras to include temperature and I think I could handle that part of it, but I’m a little shaky on the later backpropegation steps and the changes you need to make to the bias.

Overall the concept seems really powerful, particularly for knowledge transfer learning, but I can’t quite figure out the modifications necessary to train the new net so that it’s optimizing the logits of the softmax layer.

york · March 4, 2017, 6:49pm

Well - this is my first run at deep learning - I did manage to submit to Kaggle for the Dogs and Cats redux - but I missed the end of the competition by 2 days.

Kaggle still gives me a public score, though I don’t believe it gets shown on the leader-board. I came in at 0.10643 which I believe would have been in the top third. Woot!

This first lesson has been a ton of learning - can’t wait to keep digging in. The biggest challenges for me are figuring out the Command Line Interface (though I am starting to feel like a pro at tmux…at least in my own mind) and understanding python commands. (C++ is what I normally code in, and I am only so good at it).

But the experience has been a blast.

On a side note, there were only about 1300 participants in the Kaggle dogs vs cats redux competition. Not as many as I had expected (though I am not sure what I expected).

In any case - thanks to everyone who built this course and has contributed here - what fun!

jeremy · March 4, 2017, 11:02pm

In the course I show an end to end process for MNIST that includes both ensembling and pseudo-labeling using “dark knowledge”.

twairball · March 5, 2017, 1:25am

+1 would really love to see your code for doing this.

Does this allow validation data to be re-randomized when you run different experiments?

Even · March 5, 2017, 4:23am

In the course I show an end to end process for MNIST that includes both ensembling and pseudo-labeling using “dark knowledge”.

Sorry @jeremy, I ran through a number of the lectures again looking for the section you’re talking about and the closest I could find was the start of lecture 6 where you talk about ensembling and pseudo labeling and I checked the mnist code but it doesn’t contain what I’m referring to.

My understanding after watching Hinton’s talk on dark knowledge is that what he refers to as ‘dark knowledge’ is the resulting vector from the shifting of a softmax layer’s outputs via a temperature so that the relationship between objects is much clearer. The vectors he shows at around 11:35 in the lecture are the idea i’m driving at. By training a new net on those soft predictions and a subset of hard targets he’s able to get some very interesting results.

I think there’s a chance we’re talking about different things unless I’m misunderstanding.

tham · March 5, 2017, 4:35am

I spend about two weeks in this competition and learned a lot, my last score is 0.05051, place at 67, close to top 5%. The tools I used are dlib, keras and mxnet.

What I learned from this competition is:

1 : Ensemble may make your results worse
2 : Remember to record down the parameters you used, excel like editor is a nice tool for this
3 : Feed pseudo labels into the mini-batch with naive way do not work(I should finished lessons 4 before I gave it a rush even I am running out of time)
4 : Leverage pretrained model is much easier to get good results
5 : How to use dlib, keras and mxnet
6 : Read the post at forums, it may give you useful info
7 : Fast ai course is awesome, I should view them earlier(just finished lesson 4)

-------------Work approach--------------------

a : dlib

1 : split the data to 5 cross with augmentation(5 times), I did not figure out
which augmentation tricks work best, however, vertical augmentation looks like a bad choice
2 : extract features by resnet34 of dlib on the training data and test data, store them
3 : Predict the labels by different combinations of the k-cross models.
4 : Submit, score is 0.06266
5 : clip the value to 0.02, 0.98, this improve the score to 0.05688
6 : validate data with random crop might improve accuracy, but I have no time to try out

b : mxnet

I reentered this competition when I got 5 or 6 days left, so I am in a hurry, solution I tried on
mxnet and keras are less sophisticated than dlib

1 : Fine tune resnet34~200 on the dataset with augmentation, no k-cross validation,
did not figure out best why to augment the data.

2 : ensemble all of the results of the models, including the results of dlibs, this improve my
score to 0.05051

-------------Non work approach--------------------

1 : I trained different models by dlibs and ensemble them, but this give me worse results.The steps are

a : Extract augmented features by resnet34, store them
b : Train k-cross models with extracted features and different "top models"
c : ensemble the results
d : clip value to 0.02, 0.98
e : get worse results

--------------My views on the library(bias)------------------

1 : keras

pros : easiest to use, lots of nice examples out there
cons : hard to extend(I want to change the way the data feed into mini-batch), maybe it is
because I am not an expert of python yet.Learn a new language is very easy, but become an expert of it is another story.

2 : mxnet

pros : more pretrained models
cons : Documents and examples are not that good, some(many) examples are outdated.I cannot figure out how to find out the numbers of layers, freezing learning rate of base layers with correct solution yet(I implement them but not sure they are correct).

3 : dlib

pros : could work as a zero dependency lib, easy to port to different platforms, a library designed to solve real world problems, apps development rather than prototype nor academic use. Nice documents, examples, high quality source codes(this is what we called modern c++ looks like).

cons : Got one pretrained model(resnet34) only, small community, lack lots of of features in deep learning world. Since it is new, we can expect there will be more features add into it in the future.

ps : I may have bias on dlib because it is written by my favorite language–c++

radek · March 5, 2017, 7:15am

Thank you for this idea @Even - works like a charm. I think it even works without follow_links = True in the generator (unless it is a default value as I didn’t have to set it).

jeremy · March 5, 2017, 11:41pm

Good point - I think of ‘dark knowledge’ as referring in general to the idea of training a neural net using the full set of predictions as the target, rather than just the predicted class. That’s what we do when we do pseudo-labeling in the lessons.

I’m not aware of the shifting the layer’s outputs via a temperature as being important - although I’m not sure I’ve seen a direct comparison.

rodjun · March 6, 2017, 3:37pm

Hey, i also created some code to automate the creation of test/sample folders

import os
import random
import shutil

def organize_folder(folder):
    _, _, filenames = next(os.walk(folder))
    unique_classes = {filename.split(".")[0] for filename in filenames}
    for _class in unique_classes:
        path = os.path.join(folder, _class)
        if not os.path.exists(path):
            os.makedirs(path)
        for filename in filenames:
            if filename.startswith(_class):
                shutil.move(os.path.join(folder, filename), os.path.join(path, filename))        
    
def create_sample_folder(_from, to, percentage=0.1, move=True):
    if not os.path.exists(to):
        os.makedirs(to)
    _, folders, _ = next(os.walk(_from))
    for folder in folders:
        if not os.path.exists(os.path.join(to, folder)):
            os.makedirs(os.path.join(to, folder))
        _, _, files = next(os.walk(os.path.join(_from, folder)))
        sample = random.sample(files, int(len(files) * percentage))
        for filename in sample:
            if move:
                shutil.move(os.path.join(_from, folder, filename), os.path.join(to, folder, filename))
            else:
                shutil.copyfile(os.path.join(_from, folder, filename), os.path.join(to, folder, filename))

I used organize_folder to create two folders for the dogs and cats competition, haven’t found a use for it in other competitions yet.

create_sample folder was what i used to create a sample/test/validation folders, it has served me pretty well so far.

twairball · March 7, 2017, 6:07am

Wow thanks tham for the write-up. its a great result, thank you for sharing your workflow.

I’ve started experimenting with Resnet50 (Keras’ builtin model), can you talk about the optimizer you use, and what kind of learning rate, decay, momentum you try with?

Thanks,

Jerry

tham · March 11, 2017, 5:44am

Sorry for my late reply, recently I was spending my times on the videos and lectures of fast ai.

Yes, I do not have much times to tune the parameters, almost every models keras use the same setting.
Because I was running out of time, I trained on the whole training data set, did not split to training set and validate set

optimizer = adam
learning rate = 0.0001
momentum = default value

my top models looks like

top_model = Dense(128, activation='relu')(top_model)
top_model = Dropout(0.5)(top_model)
top_model = Dense(256, activation='relu')(top_model)
top_model = Dense(classes, activation='softmax')(top_model)

However, keras do not improve my results, mxnet did

twairball · March 11, 2017, 9:37am

@tham thanks for replying. Resnet50 doesn’t have the 2 dense layers like in VGG, are you referring to VGG in this example?

Thanks,

Jerry

tham · March 11, 2017, 10:24am

it is resnet50, what I did is slap the resnet and dense layers together.

base_model = ResNet50(include_top=False, weights='imagenet', input_tensor=Input(shape=(im_dim, im_dim, 3)))
top_model = create_top_model(base_model, top_model_index=2, classes=2)
# Slap the model and FC block together and compile
model = Model(input=base_model.input, output=top_model)