Create validation set from json


(nkiruka chuka-obah) #1

hi All,

I am trying to create a validation set for my dataset. I have a test folder with a list of images and the testing annotations in a test.json file. I have a training folder in the standard train/category[i]/sample.jpg folder arrangement with the accompanying train.json file. While I know I can use ImageClassifierData.from_paths to train the model, I would like to create a smaller dataset that I can iterate quickly on before going to my much larger training set.

So how do I setup a smaller sample that I can work with? I tried following the notebook https://github.com/fastai/imagenet-fast/blob/master/fp16/imagenet-fp16.ipynb

To get a single csv file (since it might be easier to create validation indices that way), I did the following:

  1. Copy/move all the images from all the subdirectories into a single folder: train1:

train_path = Path(’/data/train’)
train_json = json.load(train_path/‘train.json’).open())
categories = list( Path(train_path).iterdir() )
cats_to_files = {cat: list(cat.iterdir()) for cat in categories}

files = []
for k,v in cats_to_files.items():
     for file in v
         files.append(file)

import shutil
for file in files: 
   shutil.copy(str(file), '/data/train1')
  1. Create csv

import pandas as pd
df = pd.DataFrame(files, columns = [‘image’’])
df_temp = pd.DataFrame(files, columns = [‘image’])

df[‘image’] = df_temp[‘image’].apply(lambda x: str(x)[str(x).rfind(’/’)+1:])
df[‘category’] = df_temp[‘image’].apply(lambda x: strx(x)[str(x).rfind(’_’)+1 : str(x).rfind(’/’)])

TRAIN_CSV = ‘train.csv’
df.to_csv(train_path/TRAIN_CSV, index=False)

This creates the csv file of all the training images, and their corresponding labels, as well as puts all the training images into a single folder, train.

Is there an easier way to handle jsons? to cut them down into a smaller sample folder that I can iterate over without using the full dataset?