06_multicat Constructing a DataBlock

OrLi · February 8, 2021, 4:46pm

After unloading the csv file into “df”, we define “dsets”:
dsets = dblock.datasets(df)
How is it that the train set and validation set are now defined? what made it happen?
Thanks

ilovescience · February 8, 2021, 8:33pm

If you have question about what fastai is doing, the best approach is to check the source code. Looking through the code for the DataBlock class and its datasets function, we see if there isn’t any splitter originally passed to DataBlock, then it uses a RandomSplitter:

github.com

fastai/fastai/blob/45376f13df04ddf72749be25ae8a6dff35859f68/fastai/data/block.py#L108


@classmethod
def from_columns(cls, blocks=None, getters=None, get_items=None, **kwargs):
    if getters is None: getters = L(ItemGetter(i) for i in range(2 if blocks is None else len(L(blocks))))
    get_items = _zip if get_items is None else compose(get_items, _zip)
    return cls(blocks=blocks, getters=getters, get_items=get_items, **kwargs)
def datasets(self, source, verbose=False):
    self.source = source                     ; pv(f"Collecting items from {source}", verbose)
    items = (self.get_items or noop)(source) ; pv(f"Found {len(items)} items", verbose)
    splits = (self.splitter or RandomSplitter())(items)
    pv(f"{len(splits)} datasets of sizes {','.join([str(len(s)) for s in splits])}", verbose)
    return Datasets(items, tfms=self._combine_type_tfms(), splits=splits, dl_type=self.dl_type, n_inp=self.n_inp, verbose=verbose)
def dataloaders(self, source, path='.', verbose=False, **kwargs):
    dsets = self.datasets(source, verbose=verbose)
    kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
    return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)
_docs = dict(new="Create a new `DataBlock` with other `item_tfms` and `batch_tfms`",
             datasets="Create a `Datasets` object from `source`",

Note that the RandomSplitter has a default split of 80% train-20% valid, resulting in the (4009,1002) split shown in the book. Hope this helps!