DataWhat? Comparing Dataset(s), DataLoader(s), DataBlock, DataFrame

What are the differences between these classes: Dataset, Datasets, DataLoader, DataLoaders, DataBlock, DataFrame ? Here are some of my notes on this:

  • Dataset is a PyTorch abstract class representing a dataset. A Dataset x must provide a way to index into it (usually with an integer, i.e. x[12]) to get the item corresponding to that key. I also has a length, so you can do len(x).
  • Datasets is a class which contains a training and a validation dataset. We can pass it an items argument, and it will create a tuple from each item in items by applying all the transforms provided in the tfms argument. You can call x.dataloaders() to get a DataLoaders for it.
  • DataLoader is a class which has a compatible API with the PyTorch DataLoader, but provides more functionality. It allows to iterate over a dataset, including mechanisms for batching (using argument bs or batch_size).
  • DataLoaders is a class which wraps around DataLoader objects for training and validation, accessible via x.train and x.valid respectively. It allows the adding of transforms to both (or just one) of the contained dataloaders.
  • DataBlock is a convenient class for building Datasets and DataLoaders. It needs the types of input and output, and at least two functions: get_items and splitter (possibly more to postprocess the results of get_items). You can get a DataLoaders using x.dataloaders(source), where source could be a path, or a DataFrame, etc. The class also has a very useful x.summary method to see a rich description of the data.
  • DataFrame is the main pandas class for a table-like data structure, containing labeled axes for rows and columns. Another DataFrame class belongs to cuDF which is a GPU-oriented package, which can convert from a pandas DataFrame.

Some video segments and posts where these classes are discussed:

Please point out any important additions or omissions (or mistakes)!

On a side note, my brain would be much happier if the classes had the following names, at any rate that is how I tend to think of them:

DatasetDataIndex
DataLoaderDataBatcher
DataBlockDataDescription

10 Likes