I ran into an issue that I think others have ran in previously, but I couldn’t find a solution in the forums…
I love the ease-of-use of FastAI, but in this case it doesn’t do quite what I need and I can’t figure out how to add the functionality myself to fastAI without getting in over my head fast. But: it should be easy to do (and to be added to FastAI…)
So, here’s my use-case: I am determining a training/validation set myself and end up with four lists containing
- the Posix filenames of my training images
- the labels of the training images
- the Posix filenames of my validation images
- the labels of the validation images.
So I expected that is would be straightforward to create a ImageDataBunch, but I can’t figure out how! Seems that ImageDataBunch.create() contains the hook that I would need, but that function expects Torch Datasets for training and validation - and it’s not clear to me how to create Dataset using the lower-level APIs - Dataset is an abstract class, so I would need to implement an own custom class to do this - but I can’t imagine that my use-case is that special.
Does anyone have suggestions or code that can help me to move forward? Thanks!
Hey @kzuiderveld, when I was new to the course even I had a lot of trouble creating data bunches. After trying a lot of things I found that using the data block api is easier and intuitive in creating data bunches. In your particular case, I think you should put all the images in one folder, create a data frame with the columns image_name and their corresponding labels for the training data, use the from_csv method to create a data bunch of it, and split_by_fname(‘valid.txt’) for the train / valid split.
Another thing to do is to read a lot of notebooks, because they help you understand the different ways in which people are creating data bunches. You can find some of my notebooks here.
@dipam7 and other readers,
I’m happy to report that I solved the issue. It was relatively easy to implement this once I figured out how the data_block API worked.
Create a set() containing the filenames in the validation dataset.
Make a function that returns true if the filename is in the validation set (having the validation set in a set / map makes that function much faster:
return fileName in validationSet
I eventually could string everything together as follows:
data = (ImageList.from_folder(datadir) # the data is present in this folder
.split_by_valid_func(isValidationImage) # split using the set that contains all training images