Core usage of DataBlock and DataLoader

yonatan365 · April 19, 2020, 10:28am

In the course, we see many really useful methods for loading data, such as ImageDataLoaders.from_name_func.

However, often I need a custom loader for my data, for many reasons. For example, if I have the data in files with names indicating classes, and parent folder indicating train/valid/test, I don’t know how to build a dataloader for a learner (as I ask also here).

It will be useful for me to see the process of using the basic building blocks to do some of the things the helper methods do, but manually. In this way I (and others) could use these building blocks to solve the custom requirements that come up quite often.

I tried looking at the source code, but often it is not clear enough for me. I need an explanation of the “philosophy”… What is a data block, what is a data loader. How to create a data loader under various circumstances, etc.

If anyone can help with it, it will surely help me, and maybe others.

Thanks!

yonatan365 · April 19, 2020, 12:10pm

I actually found a very useful resource for working out the basic data API functionality:
http://dev.fast.ai/tutorial.pets

machinethink · April 19, 2020, 1:51pm

Dataset and DataLoader are core PyTorch concepts.

A Dataset is just an object that returns a length (the number examples in your dataset) and the object at a given index (between 0 and length). That object can be anything. Usually it’s an image and its label, but if examples in your dataset consist of 3 images and 7 labels, your Dataset’s __getitem__ would return a tuple of those 10 things.

A DataLoader is a built-in PyTorch object (although you can create your own) that grabs a batch of examples from a Dataset and puts them into a tensor. It does this as efficiently as possible, using multiple worker processes etc, so that multiple batches can be loaded in parallel.

The DataLoader uses a sampler to decide which examples to pick for each batch. A DataLoader used for training typically uses a random sampler, while a DataLoader for evaluation will use a sequential sampler.

The DataLoader also uses a collate function to convert the tuple of data returned by your Dataset into a PyTorch tensor. You can usually use the default collate function, but let’s say your Dataset returns 3 images and 7 labels for each example, you might need a custom collate function to turn this into a tensor.

You only have to provide an implementation for the Dataset. The DataLoader, sampler, and collate functions are built into PyTorch already (unless you want to use your own, in which case you can override them).

The DataBlock is a fastai-specific thing that makes it easy to create all the datasets and loaders that you need. It’s provided just for convenience and saves you writing a bunch of this code yourself.

yonatan365 · April 19, 2020, 1:55pm

Thanks @machinethink,

I realize I wasn’t accurate in the title - I meant the core usage of datablock and dataloader. It’s corrected now.

But I really found what I was looking for in the link I mentioned above and also in the preceding tutorial: http://dev.fast.ai/tutorial.datablock.

Thanks again for your reply!