What if there's no 'valid' directory?

richardreeze · August 2, 2018, 3:21am

I’m trying to work on this Kaggle competition: https://www.kaggle.com/c/plant-seedlings-classification

It’s perfect for me since I’m on Lesson 3 (and for this competition I can use the knowledge I learned in Lesson 1).

However, the library assumes that I have a ‘train’, ‘valid’, and ‘test’ directories.
The competition data only has a ‘train’ and a ‘test’ directory

I’m sure there’s an easy solution to this. But I haven’t learned it yet. Please help.

cedric · August 2, 2018, 7:17am

Correct for this particular Kaggle competition.

Partially true. I will explain later why is this so.

By default, fastai library was designed for ease-of-use in mind. So, most of the APIs assumed that your dataset has a ‘train’, ‘test’ and ‘valid’ directories. Again, this depends on your dataset format/structure and which function you call in fastai library as different function accepts different parameters and focus on achieving different thing.

For this competition, use the ImageClassifierData.from_csv() function to create the training data for model. Notebook/code example:

PATH = 'data/kaggle/plant_seedlings/'

!ls {PATH}

labels.csv  models/  sample_submission.csv  test/  tmp/  train/

labels_csv = f'{PATH}labels.csv'
n = len(list(open(labels_csv))) - 1
val_idxs = get_cv_idxs(n)

arch = resnet50
sz = 299
bs = 32

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_top_down, max_zoom=1.1)
data = ImageClassifierData.from_csv(path=PATH, folder='train', csv_fname=labels_csv, test_name='test', val_idxs=val_idxs, tfms=tfms, bs=bs)
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 5)

ImageClassifierData class read in images and their labels given as numpy arrays.

ImageClassifierData.from_csv() function read in images and their labels given as a CSV file. This method should be used when training image labels are given in an CSV file as opposed to sub-directories with label names.

ImageClassifierData.from_csv() parameters:

path: a root path of the data (used for storing trained models, precomputed values, etc.)
folder: a name of the folder in which training images are contained.
csv_fname: a name of the CSV file which contains target labels.
tfms: transformations (for data augmentations). e.g. output of tfms_from_model
val_idxs: index of images to be used for validation. e.g. output of get_cv_idxs. If None, default arguments to get_cv_idxs are used.
test_name: a name of the folder which contains test images.

As you can see, you can be upfront on how you intend a fastai function to work by specifying the path, folder name to your train set or test set and pass into the function. It gives you a certain degree of control.

I hope that clarify your understanding.

richardreeze · August 2, 2018, 5:14pm

Thanks Cedric! This was a helpful explanation.

However, the competition I’m doing also doesn’t have a .csv file to read from.

You can see it just has a ‘train’ directory and a ‘test’ directory: https://www.kaggle.com/c/plant-seedlings-classification/data

So ImageClassifierData.from_csv() also wouldn’t work, correct? The labels are given as sub directories, but there’s no ‘valid’ directory.

stephenjohnson · August 2, 2018, 7:02pm

You could write a script to move (not copy) a percentage (like 20%) of the images into your own valid directory. So 20% of train/cleavers would be moved to valid/cleavers, 20% of train/Fat Hen would be moved to valid/Fat Hen, etc. etc.

digitalspecialists · August 2, 2018, 8:06pm

You can use unix commands to move a random amount of train files into a validation directory, something like as follows. You should move on to csv’s as soon as practical however as it is then easier to create representative splits.

shuf -n 100 -e * | xargs -i mv {} new-folder-path

cedric · August 2, 2018, 8:46pm

Screenshot%20from%202018-08-03%2004-21-20

Correct. Unless you generate the CSV file using the directories as class labels and images as filenames but this is not a good idea.

Well, a better solution is, follow the approach in the ‘Dogs vs Cats’ task in lesson 1 notebook.

As others have suggested here, first move a random amount of images from the train set into a validation directory. Then, ImageClassifierData.from_paths() should work.

richardreeze · August 2, 2018, 8:49pm

I didn’t know this. Thank you!

richardreeze · August 2, 2018, 8:58pm

Am I missing something? When I downloaded the files they didn’t come with a .csv for train and test.

Even if you scroll down to ‘Data Sources’ on https://www.kaggle.com/c/plant-seedlings-classification/data
The only .csv file that appears is sample_submission.csv, which I do have.

Did I download it in a wrong way or something?

cedric · August 2, 2018, 9:04pm

No, you didn’t miss anything. The files didn’t come with a CSV for train and test.

The mistake is in the “File descriptions” and that got us all confused.