DataBlock example, explained
These are some quick notes on the DataBlock API which I found a bit confusing at first.
Here is the example taken from the end Lesson 2 (Part I v4 2020):
bears = DataBlock( blocks = (ImageBlock, CategoryBlock), get_items = get_image_files, splitter = RandomSplitter(valid_pct=0.3, seed=42), get_y = parent_label, item_tfms = Resize(128))
So, let’s take this apart bit by bit.
blocksargument is used to specify the types of input and output of your data. In this example we are trying to classify images into three categories of bears (grizzly, black, teddy), so the input is
ImageBlockand the output is
get_itemsargument is used to specify a function which will return the datapoints. Here we use
get_image_files, which recurses through directories and returns all files with an image extension (at the time of writing, the list of extensions considered as an image is quite large: ‘.png’, ‘.ppm’, ‘.ico’, ‘.tif’, ‘.rgb’, ‘.xwd’, ‘.jpg’, ‘.pnm’, ‘.ras’, ‘.xbm’, ‘.svg’, ‘.tiff’, ‘.pbm’, ‘.xpm’, ‘.ief’, ‘.gif’, ‘.pgm’, ‘.bmp’, ‘.jpe’, ‘.jpeg’).
get_yargument is also a function, indicating how to extract y (in our case the category of bear) from an item obtained by
get_item. In this case, we tell it to use the function
parent_label, which gets the name of the parent folder of the item. So for a file
bears/grizzly/file1.jpgif will identify the folder
grizzlyas the parent and hence return
splitterargument is used to specify how to split the data into training and validation sets. Here we use a
RandomSplitterwhich just assigns entries randomly. Its parameter
valid_pct=0.3instructs DataBlock to put 30% of the items in the validation set (the
seedargument allows us to control the randomness, so that we get the same sets every time we run the function).
item_tfmsargument is used to specify which transforms need to be applied to each item. Here we use
Resize(128), which resizes each image to 128x128 pixels by the cropping method (another method can be specified with the
- We could also have specified a
get_xargument, a function to extract the x from each item. However in this case the x is the image itself, so we don’t need to extract anything.
- The most confusing to me was the line
blocks = (ImageBlock, CategoryBlock)
because in my intuition these are not really “blocks” but “types”. My brain would be much happier if this was written as
types = (ImageType, CategoryType)
but I will have to get used to it
- The constructed
DataBlockcan be used to produce
DataLoaders, as follows:
dls = bears.dataloaders(path)
Note that here we are passing a
path, whereas we didn’t specify a path when creating the
bearsobject. For example, the
get_image_filesfunction takes a
pathargument, but we simply specified the function without that argument.
Please let me know if something is unclear (or incorrect), and if you found this useful!