Data Block Api - recap for Mnist examples in "look at data"


It is not my first post regarding Data Block API - i’m a bit stubborn, sorry; but I really don’t want to go further in the lessons until I understand how to create my databunches.

regarding the first lessons with the URLs.MNIST_TINY and the documentation in the examples for “look at data”. all the examples are using the ImageDataBunch but I am trying to understand how to do the same with the Data Block that many people say should be the usual way of creating our databunches. :

I have a folder :
Mnist_tiny :
– models
– test
– train :
— 3
— 7
– valid
– labels.csv

It would be really helpful if you could guide me on creating the databunch using different ways :

ImageItemList.from_folder :
data = (ImageItemList.from_folder(path).split_by_folder().label_from_folder().transform(tfms, size=24).databunch())

This one works ok.
questions :
.split_by_folder : it nothing is in the brackets, it goes and check if it is structured as a image-net stucture with valid and train folders ?
it goes into the folder train, sees 2 folders ‘3’ and ‘7’ and splits them automatically ? What would have happened (and needed to be changed if they were called ‘train-files’ and ‘valid-files’ ?

Other question :
What would have been my structuring for these if my folders were as this :

Mnist_tiny :
- models
- test
– 3 :
— train
— valid
– 7 :
- train
- valid
- labels.csv

ImageItemList.from_csv :
Can’t create my databunch here with the datablock API…
data = (ImageItemList.from_csv(path, ‘labels.csv’)) works
but I really don’t know how to split and label and create my databunch from the csv file.
Should i create a df from the csv with df = pd.read_csv(path/'labels.csv) or can I just make it from the csv itself ?

Please help me, I know that once I understand the datablock and what it is trying to do for each step, the rest will easely follow.
It would be really cool if we could have in the docs of “Look at Data” the examples explained bith with ImageDataBunch AND ImageItemList with the datablock for a better understanding

Thanks a lot,

Hope you have already looked at data_block API core documentation
Regarding the first question, you can pass folder names to split_by_folder()

    def split_by_folder(self, train:str='train', valid:str='valid')->'ItemLists':

For ImageItemList.from_Csv() check Planet example

data = (ImageList.from_csv(planet, 'labels.csv', folder='train', suffix='.jpg')
        #Where to find the data? -> in planet 'train' folder
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_df(label_delim=' ')
        #How to label? -> use the second column of the csv file and split the tags by ' '
        .transform(planet_tfms, size=128)
        #Data augmentation? -> use tfms with a size of 128
        #Finally -> use the defaults for conversion to databunch

You can check out this article which explains data block API thoroughly.

Hope this helps.