hi All,
I am trying to create a validation set for my dataset. I have a test folder with a list of images and the testing annotations in a test.json file. I have a training folder in the standard train/category[i]/sample.jpg folder arrangement with the accompanying train.json file. While I know I can use ImageClassifierData.from_paths to train the model, I would like to create a smaller dataset that I can iterate quickly on before going to my much larger training set.
So how do I setup a smaller sample that I can work with? I tried following the notebook https://github.com/fastai/imagenet-fast/blob/master/fp16/imagenet-fp16.ipynb
To get a single csv file (since it might be easier to create validation indices that way), I did the following:
- Copy/move all the images from all the subdirectories into a single folder: train1:
train_path = Path(‘/data/train’)
train_json = json.load(train_path/‘train.json’).open())
categories = list( Path(train_path).iterdir() )
cats_to_files = {cat: list(cat.iterdir()) for cat in categories}files = [] for k,v in cats_to_files.items(): for file in v files.append(file) import shutil for file in files: shutil.copy(str(file), '/data/train1')
- Create csv
import pandas as pd
df = pd.DataFrame(files, columns = [‘image’'])
df_temp = pd.DataFrame(files, columns = [‘image’])df[‘image’] = df_temp[‘image’].apply(lambda x: str(x)[str(x).rfind(‘/’)+1:])
df[‘category’] = df_temp[‘image’].apply(lambda x: strx(x)[str(x).rfind(‘_’)+1 : str(x).rfind(‘/’)])TRAIN_CSV = ‘train.csv’
df.to_csv(train_path/TRAIN_CSV, index=False)
This creates the csv file of all the training images, and their corresponding labels, as well as puts all the training images into a single folder, train.
Is there an easier way to handle jsons? to cut them down into a smaller sample folder that I can iterate over without using the full dataset?