How to generate ImageDataBunch from training/validation lists?

kzuiderveld · March 20, 2019, 1:19am

Hello,

I ran into an issue that I think others have ran in previously, but I couldn’t find a solution in the forums…

I love the ease-of-use of FastAI, but in this case it doesn’t do quite what I need and I can’t figure out how to add the functionality myself to fastAI without getting in over my head fast. But: it should be easy to do (and to be added to FastAI…)

So, here’s my use-case: I am determining a training/validation set myself and end up with four lists containing

the Posix filenames of my training images
the labels of the training images
the Posix filenames of my validation images
the labels of the validation images.

So I expected that is would be straightforward to create a ImageDataBunch, but I can’t figure out how! Seems that ImageDataBunch.create() contains the hook that I would need, but that function expects Torch Datasets for training and validation - and it’s not clear to me how to create Dataset using the lower-level APIs - Dataset is an abstract class, so I would need to implement an own custom class to do this - but I can’t imagine that my use-case is that special.

Does anyone have suggestions or code that can help me to move forward? Thanks!

Karel

dipam7 · March 20, 2019, 5:25am

Hey @kzuiderveld, when I was new to the course even I had a lot of trouble creating data bunches. After trying a lot of things I found that using the data block api is easier and intuitive in creating data bunches. In your particular case, I think you should put all the images in one folder, create a data frame with the columns image_name and their corresponding labels for the training data, use the from_csv method to create a data bunch of it, and split_by_fname(‘valid.txt’) for the train / valid split.
Another thing to do is to read a lot of notebooks, because they help you understand the different ways in which people are creating data bunches. You can find some of my notebooks here.

kzuiderveld · March 21, 2019, 5:01am

@dipam7 and other readers,

I’m happy to report that I solved the issue. It was relatively easy to implement this once I figured out how the data_block API worked.

Create a set() containing the filenames in the validation dataset.
Make a function that returns true if the filename is in the validation set (having the validation set in a set / map makes that function much faster:

def isValidationImage(fileName):
return fileName in validationSet

I eventually could string everything together as follows:
data = (ImageList.from_folder(datadir)      # the data is present in this folder
        .split_by_valid_func(isValidationImage)    # split using the set that contains all training images
       .label_from_func(getLabel)
       .transform(tfms,size=bs)
       .databunch()
       .normalize()
       )