What are the differences between these classes: Dataset, Datasets, DataLoader, DataLoaders, DataBlock, DataFrame
? Here are some of my notes on this:
-
Dataset
is a PyTorch abstract class representing a dataset. ADataset
x
must provide a way to index into it (usually with an integer, i.e.x[12]
) to get the item corresponding to that key. I also has a length, so you can dolen(x)
. -
Datasets
is a class which contains a training and a validation dataset. We can pass it anitems
argument, and it will create a tuple from each item initems
by applying all the transforms provided in thetfms
argument. You can callx.dataloaders()
to get aDataLoaders
for it. -
DataLoader
is a class which has a compatible API with the PyTorchDataLoader
, but provides more functionality. It allows to iterate over a dataset, including mechanisms for batching (using argumentbs
orbatch_size
). -
DataLoaders
is a class which wraps aroundDataLoader
objects for training and validation, accessible viax.train
andx.valid
respectively. It allows the adding of transforms to both (or just one) of the contained dataloaders. -
DataBlock
is a convenient class for buildingDatasets
andDataLoaders
. It needs the types of input and output, and at least two functions:get_items
andsplitter
(possibly more to postprocess the results ofget_items
). You can get aDataLoaders
usingx.dataloaders(source)
, wheresource
could be a path, or aDataFrame
, etc. The class also has a very usefulx.summary
method to see a rich description of the data. -
DataFrame
is the mainpandas
class for a table-like data structure, containing labeled axes for rows and columns. AnotherDataFrame
class belongs tocuDF
which is a GPU-oriented package, which can convert from apandas
DataFrame
.
Some video segments and posts where these classes are discussed:
- Lesson 2 - Deep Learning for Coders (2020) - YouTube (bears DataBlock)
- Lesson 3 - Deep Learning for Coders (2020) - YouTube (bears DataBlock)
- The bears DataBlock example, explained
- Lesson 4 - Deep Learning for Coders (2020) - YouTube (batches and DataLoader)
- Lesson 4 - Deep Learning for Coders (2020) - YouTube (DataLoaders class)
- Lesson 4 - Deep Learning for Coders (2020) - YouTube (check/debug DataBlock)
- Lesson 6 - Deep Learning for Coders (2020) - YouTube (recap Datasets, DataLoaders, DataBlock)
- Lesson 8 - Deep Learning for Coders (2020) - YouTube (imdb DataBlock)
Please point out any important additions or omissions (or mistakes)!
On a side note, my brain would be much happier if the classes had the following names, at any rate that is how I tend to think of them:
Dataset
→ DataIndex
DataLoader
→ DataBatcher
DataBlock
→ DataDescription