How to create the State Farm comp validation set?


Could any one try to explain how he created the validation set for state farm?
As per my understanding drivers that are in the train set should not be in the valid set.

Is that correct? How is that possible? I will end up removing too many picture from my train set.

If any one can clarify this to me.

Thank you!


I think it’s a lot to do with trying to make the validation test as close to the test set as possible. You can study the way it’s organized and then come up with a script that roughly takes ~20% away from the train set.

Check out this ipynb that could give you more ideas.


This notebook is really helpful. Thanks @karthik_k314. Quick question. This chooses 5 random drivers and moves all of their data to the validation set. Is there any benefit to this strategy over just randomly choosing 20% of the images from the entire corpus of training images to move over?

Thing is that you should not have any driver both in Validation set and Training set. Not having same image in both sets is not sufficient.

The idea behind is that if same driver (with different distraction) will appear in both sets, it will be “easier” for the trained model to predict the same driver in the validation - even if distraction is different. If you separate competently the driver in both sets, you make sure the trained model is able to correctly predict a driver that it never saw before.


Thanks @idano I appreciate your explanation.

We’re supposed to classify an image into one of 10 states (including one which is safe driving). I would guess training on the driver with some images and using images of the same driver in the validation set might cause the validation set to focus on the accidental regularities (i.e. the driver’s hair color) and ignore the features we really care about. However, if there was enough images of the user (at least one per class in the training set) this might not be an issue. I’d love your thoughts.

Practically speaking I suppose you wouldn’t expect to see the same driver at training and test time.

Reread this notebook.

Jeremy mentions that you should use different drives in the training and validation set per the rules on the Kaggle competition page.

Just noting this for anyone else who might be interested in the right setup.

I’ve been trying to get this script to work, but it’s giving me an error message at the point where the script should move the files.

IOError                                   Traceback (most recent call last)
<ipython-input-7-d3df63db566c> in <module>()
      8     move_to = val_dir + '/' + row['classname']
      9     print move_to
---> 10     shutil.move(to_move, move_to)

/home/ubuntu/anaconda2/lib/python2.7/shutil.pyc in move(src, dst)
    300             rmtree(src)
    301         else:
--> 302             copy2(src, real_dst)
    303             os.unlink(src)

/home/ubuntu/anaconda2/lib/python2.7/shutil.pyc in copy2(src, dst)
    128     if os.path.isdir(dst):
    129         dst = os.path.join(dst, os.path.basename(src))
--> 130     copyfile(src, dst)
    131     copystat(src, dst)

 /home/ubuntu/anaconda2/lib/python2.7/shutil.pyc in copyfile(src, dst)
     80                 raise SpecialFileError("`%s` is a named pipe" % fn)
---> 82     with open(src, 'rb') as fsrc:
     83         with open(dst, 'wb') as fdst:
     84             copyfileobj(fsrc, fdst)

IOError: [Errno 2] No such file or directory: '/home/ubuntu/nbs/data/statefarm/train/c0/img_51066.jpg'

Aha! Nevermind, I figured out what was causing this error. Some of the data hadn’t copied over properly, so the script wasn’t finding the correct image files to copy.