Dataset creation - ImageDataBunch vs ImageLists

shruti_01 · May 7, 2019, 12:39pm

Why do we have two different methods to create data - ImageDataBunch and ImageLists? I don’t really understand which one to use when.

dusan · May 7, 2019, 5:45pm

Hi Shruti,

when we create a DataBunch which is passed to the model for training, we actually go through a couple of steps. We use the fastai data block API which is quite flexible - it allows us to specify what type of data to get (image, text, etc.), where to get it from (from folder, from csv file, etc.), how it should be split into train/val sets, how to get the labels, whether to add a test set and how to transform it. After all that is specified we can create a DataBunch from the final output.

For an image dataset it might look like this:

data = (ImageList.from_folder(path)
                         .split_by_folder()
                         .label_from_folder()
                         .add_test_folder()
                         .transform(tfms, size=64)
                         .databunch())

You can’t use an ImageList to train a model (it doesn’t have enough information specified for training - what’s the train/val split, where are the labels, etc.). You have to go through the additional steps. The resulting ImageDataBunch will have all the information required.

If you got confused because there is both ImageDataBunch.from_folder and ImageList.from_folder know that ImageDataBunch.from_folder is just a short cut using some default settings for all of the steps above! Take a look at the code here. It calls ImageList.from_folder internally as the very first step. (You can always find the source code by checking out the docs for a function or class).

shruti_01 · May 8, 2019, 2:50am

Thanks Dusan! This makes sense now.

Went through this doc, the difference is more clear now.

kodzaks · September 16, 2019, 12:21pm

Hello,

I have a few questions about ImageList.from_folder when we use it for predictions. First, the data used to train model was normalized, should data used for predictions be normalized too? If yes, how?

Second, about this part:
split_by_folder()
.label_from_folder()
.add_test_folder()
.transform(tfms, size=64)
.databunch ()
The test data sits in its own folder, Mypath/test and it is unlabeled, because we need prediction, so why do we even need all these split, label, add for test image files?

Finally, when I use ImageDataBunch.from_folder to predict a single image (like is shown in one lecture) I get much more accurate results compared to when I use ImageList.from_folder and export model file to predict a number of files. I do not underststand why, is it because test data images were not normalized in ImageList.from_folder ?