Developer chat

sgugger · October 29, 2018, 12:29am

New big change, introduced the data block API. Jeremy will explain it more on Tuesday and I’ll document it tomorrow, but the basics is that it lets you plug the different parts of creating a DataBunch as you want with a lot more flexibility than the current factory methods. Specifically, you tell

where are the filenames (if applicable)
how to determine the label of each input (re pattern, folder names, csv file…)
how to create a validation set (random split, folder names, valid indexes…)
what Dataset function to apply (ImageDataset, ImageMultiDataset, SegmentationDataset…)
transforms to apply (if applicable)
how to databunch it (which is where you tell the batchsize, the dl transforms…)

Examples are in the 104a and 104b notebooks in the dev folder, but here are a few of them:

Pets datasets from lesson 1

path = untar_data(URLs.PETS)
tfms = get_transforms()
data = (InputList.from_folder(path/'images')
        .label_from_re(r'^(.*)_\d+.jpg$')
        .random_split_by_pct(0.2)
        .datasets(ImageClassificationDataset)
        .transform(tfms, size=224)
        .databunch(bs=64)

Classic dogscats in an Imagenet style folder structure

path = Path('data/dogscats')
tfms = get_transforms()
data = (InputList.from_folder(path)
        .label_from_folder()
        .split_by_folder()
        .datasets(ImageClassificationDataset)
        .transform(tfms, size=224)
        .databunch(bs=64))

Planet dataset (multiclassification problem with labels in a csv file)

path = untar_data(URLs.PLANET_SAMPLE)
tfms = get_transforms()
data = (InputList.from_folder(path)
        .label_from_csv('labels.csv', sep=' ', suffix='.jpg', folder='train')
        .random_split_by_pct(0.2)
        .datasets(ImageMultiDataset)
        .transform(tfms, size=128)
        .databunch(bs=64))

Camvid (segmentation tasks with segmentation masks in another folder):

path = Path('data/camvid')
get_y_fn = lambda x: path_lbl/f'{x.stem}_P{x.suffix}'
codes = np.loadtxt(path/'codes.txt', dtype=str)
tfms = get_transforms()
data = (InputList.from_folder(path/'images')
        .label_from_func(get_y_fn)
        .split_by_fname_file('../valid.txt')
        .datasets(SegmentationDataset, classes=codes)
        .transform(get_transforms(), size=128, tfm_y=True)
        .databunch(bs=64))