How to split an image dataset into test and train data?


(Divyansh Rai) #1

So I have this dataset of images,

The images are of 5 types of flowers (Tulips, sunflowers etc), all of them in 5 different folders
All tulip pictures in the tulip folder, all rose pictures in rose folder and so on

What I want to know is that if there is a way to split this group of pictures, randomly into train(80%) and test(20%) data sets?

(To just get it running I manually arbitrarily split them, but I want to know if there’s a way to do it by code)
I know how to split data, if the data is structured, but I can’t find any resources for images.


(Kiran Scaria) #2

You could do something in the lines of:

  1. Make the required folders(validation and the class folders). You can get this done inside the script(os.makedirs).
  2. Get the number of images in the ‘train’ folder.(len(os.listdirs()) )
  3. Copy 20 percent(as much as you want) images randomly chosen to the validation class folders. (random.choice(os.listdir()) and shutil.move())

Once you get the script written you can reuse it for all such examples. :slight_smile:


(RobG) #3

The first task I perform is usually to turn the folders of images into a structured index so they can be better managed as a dataframe. Essentially put all the images in a single folder, with a csv labelling each image as to the folder it was previously in. There are many ways to do this, but here is one that has been posted on this forum. You can then split this into whatever sets you need using the structured techniques you already know.