Formatting file structure for ImageDataBunch.from_csv

wcneill · March 24, 2020, 10:01pm

Hi,

I am trying to use ImageDataBunch.from_csv() on some downloaded kaggle data. I’m going to try to use lesson 1 to create my model! I’m very excited

The docs say that the file structure needs to look like this:

path\
  train\
  test\
  labels.csv

My question is where do the images themselves go? As it stands, I have seperated the training images into train\ and the test images into test\.

Also, Kaggle names the training labels file “train.csv”. Do I need to change that to “labels.csv” or is that just a generic file name in the docs?

This is what my directory looks like

\kaggle
  \titanic
  \foliar
     foliar_fastai.ipynb
     \data
        \test
            all my test images
            test.csv
         \train
             all my train images
          labels.csv

Will this work?

EDIT: I tried the above, and it looks like it is searching for the image files in the data\ folder when I call show_batch. So that means that’s where the images go. If that’s the case, what’s the point of having the seperate training and testing folders?

EDIT 2: I see that you should provide the full path to the images in labels.csv, so that means I need to edit that whole column to have train\image_0xxx.jpg. I did that, and then I tried show_batch but it only shows a single label for each image. Since there are multiple classifications, shouldn’t it show them all?

PranY · March 24, 2020, 10:49pm

It would really help if you could share your DataBlock API call but let me take a guess and try to help you.

Whatever path your provide as a Path object is recursively looked through to grab data. In general, you should provide the path to your train folder which will probably have a sub-folder for each class. Make sure that you have the same folder structure in you test folder. To answer your question, the point of having seperate folder is that you can create a data bunch from a DataBlock API call by passing your train data and passing your test folder in the same call to add_test_folder. You can use the splitter in the DataBunch to further create a validation set which will be a small chunk of your original train folder. Now when you do show_batch, it will only pull data from the train folder and from the subset which is chosen as train, validation will not be touched.
For the target labels, You can simply load the csv as a dataframe and pass relevant parameters for labels_from_df in the API call and I don’t think you need to edit the whole column but if you choose to, a simple concatenate call is all you need over the column.

I recommend reading https://docs.fast.ai/data_block.html

wcneill · March 24, 2020, 10:54pm

Hi @PranY, thank you for the response!

I am simply following lesson 1, step for step, just on a different dataset. I don’t know what a DataBlock is yet, since I’m only on lesson 1. However, I did read in the documents for ImageDataBunch.from_csv() that the labels colomn is set to 1 by default. Since I have multiple columns, I just needed to pass label_col=['class1','class2', ...] and everything worked fine.

I’m still new to being able to understand documentation well, but I finally found it

Now, I will dig into reading about data blocks! Thanks so much!