Should the “data” directory structure mirror that used by the Lesson 1 notebook?
How many images from the “train” folder (downloaded from kaggle) should be used/moved into the “valid” folder?
When we move X number of images from “train” to “valid”, do we need to randomly move X number of images or can/should we move the first X number of images?
i usually use 10%-30% for the cross validation set.
i always use a random subset. the advantage being you can train multiple models with different splits to make sure your model is predicting stable results. but technically the first x also creates a random set.
I’m not sure how much randomly moving X number of train/cats and train/dogs to the valid/ subdirs really matter for this exercise, but as a general rule of thumb, doing a random selection is a best practice since doing a simple top x but introduce some bias depending on how the data is structured.
I’ve written a python script that will setup the keras directory structure for you, download classes of images from imagenet and then randomly apportion the class images into the directories keras is expecting.
you want the lemonsorlimes.py file. if you run it as-is then it will download 2 classes of images, lemons and limes.
you can specify how many sample images (default 100), how many validation images (default 10%) and how many training images (default 60%) to use. The rest are used for testing your model.
if you just want the directory structure then call:
makekerasdirectories("some/directory/", ["lemons", "limes"])
it’s my first python script so be gentle
@toby Thank you. Your script will help. I was thinking of writing a similar script.