The bears DataBlock
example, explained
These are some quick notes on the DataBlock API which I found a bit confusing at first.
Here is the example taken from the end Lesson 2 (Part I v4 2020):
bears = DataBlock(
blocks = (ImageBlock, CategoryBlock),
get_items = get_image_files,
splitter = RandomSplitter(valid_pct=0.3, seed=42),
get_y = parent_label,
item_tfms = Resize(128))
So, let’s take this apart bit by bit.
-
The
blocks
argument is used to specify the types of input and output of your data. In this example we are trying to classify images into three categories of bears (grizzly, black, teddy), so the input isImageBlock
and the output isCategoryBlock
. -
The
get_items
argument is used to specify a function which will return the datapoints. Here we useget_image_files
, which recurses through directories and returns all files with an image extension (at the time of writing, the list of extensions considered as an image is quite large: ‘.png’, ‘.ppm’, ‘.ico’, ‘.tif’, ‘.rgb’, ‘.xwd’, ‘.jpg’, ‘.pnm’, ‘.ras’, ‘.xbm’, ‘.svg’, ‘.tiff’, ‘.pbm’, ‘.xpm’, ‘.ief’, ‘.gif’, ‘.pgm’, ‘.bmp’, ‘.jpe’, ‘.jpeg’). -
The
get_y
argument is also a function, indicating how to extract y (in our case the category of bear) from an item obtained byget_item
. In this case, we tell it to use the functionparent_label
, which gets the name of the parent folder of the item. So for a filebears/grizzly/file1.jpg
if will identify the foldergrizzly
as the parent and hence return'grizzly'
. -
The
splitter
argument is used to specify how to split the data into training and validation sets. Here we use aRandomSplitter
which just assigns entries randomly. Its parametervalid_pct=0.3
instructs DataBlock to put 30% of the items in the validation set (theseed
argument allows us to control the randomness, so that we get the same sets every time we run the function). -
The
item_tfms
argument is used to specify which transforms need to be applied to each item. Here we useResize(128)
, which resizes each image to 128x128 pixels by the cropping method (another method can be specified with themethod
argument).
Notes:
- We could also have specified a
get_x
argument, a function to extract the x from each item. However in this case the x is the image itself, so we don’t need to extract anything. - The most confusing to me was the line
blocks = (ImageBlock, CategoryBlock)
because in my intuition these are not really “blocks” but “types”. My brain would be much happier if this was written as
types = (ImageType, CategoryType)
but I will have to get used to it - The constructed
DataBlock
can be used to produceDataLoaders
, as follows:
dls = bears.dataloaders(path)
Note that here we are passing apath
, whereas we didn’t specify a path when creating thebears
object. For example, theget_image_files
function takes apath
argument, but we simply specified the function without that argument.
Please let me know if something is unclear (or incorrect), and if you found this useful!
Cheers