Fastai v2 code walk-thru 7

Use this wiki topic for walk-thru 7.
Fastai v2 daily code walk-thrus
Fastai v2 chat

Thanks to @pnvijay for these notes:

Notes from Code Walkthrough 7 - Completed

Importing all required libraries to run the notebook locally along with the lecture

from local.torch_basics import *
from local.test import *
from import *
from import *
from import *
from import *
from local.notebook.showdoc import *
from import *
from import *
from import *
from local import *

We are looking at 06_data_core.ipynb notebook. In the last lecture we went through DataSource. The input to a DataSource can be a list, L, pandas and numpy array. It will use the accelerated indexing in pandas/numpy to get the subsets. There was a refresher on the tests and nuances of what the DataSource and TfmdDS does. We saw this in the last part of the last walkthrough.

Now we will look at databunch in DataSource. The databunch method returns a DataBunch. It takes in dataloaders and this is what even the code in databunch shows. The code for databunch method of DataSource and DataBunch is shown below.

class DataSource(TfmdDS):
    "Applies a `tfm` to filtered subsets of `items`"
    @delegates(DataLoader.__init__) #(self.dl_cls.__init__)
    def databunch(self, bs=16, val_bs=None, shuffle_train=True, **kwargs):
        n = len(self.filts)-1
        bss = [bs] + [2*bs]*n if val_bs is None else [bs] + [val_bs]*n
        shuffles = [shuffle_train] + [False]*n
        return DataBunch(*[self.dl_cls(self.subset(i), bs=b, shuffle=s, drop_last=s, **kwargs)
                           for i,(b,s) in enumerate(zip(bss, shuffles))])
class DataBunch(GetAttr):
    "Basic wrapper around several `DataLoader`s."
    _xtra = 'one_batch show_batch dataset device'.split()

    def __init__(self, *dls): self.dls,self.default = dls,dls[0]
    def __getitem__(self, i): return self.dls[i]

    train_dl,valid_dl = add_props(lambda i,x: x[i])
    train_ds,valid_ds = add_props(lambda i,x: x[i].dataset)

    _docs=dict(__getitem__="Retrieve `DataLoader` at `i` (`0` is training, `1` is validation)",
              train_dl="Training `DataLoader`",
              valid_dl="Validation `DataLoader`",
              train_ds="Training `Dataset`",
              valid_ds="Validation `Dataset`")

Let’s look at the last line of the databunch code in DataSource.

return DataBunch(*[self.dl_cls(self.subset(i), bs=b, shuffle=s, drop_last=s, **kwargs)
                           for i,(b,s) in enumerate(zip(bss, shuffles))])

There is self.dl_cls added. This has been done recently to ensure that the dataloaders are of class TfmdDL. This can be seen in the __init__ of TfmdDS.

class DataSource(TfmdDS):
    "Applies a `tfm` to filtered subsets of `items`"
    def __init__(self, items, tfms=None, filts=None, do_setup=True, dl_cls = TfmdDL):

It then takes in a a subset, batch size, shuffle and drop_last. This is is depending on whether it is for valid or for train.

There was a question on the @delegates constructor used inside TfmdDS. Jeremy mentions that it is normally put before the class to mention that all **kwargs in the __init__ of that class are passed on to the super class. But we can also put it for a specific part of the class. Here we are using @delegates for the ’**kwargs to be passed on to DataLoader.__init__. So when check for the .databunch() of a DataSource via pressing shift+tab inside () all the **kwargs are listed in the signature.

inp = [0,1,2,3,4]
dsrc = DataSource(inp, tfms=[None])
< at 0x1258d3b90>

We go to 08_pets_tutorial.ipynb and we have items which are paths of images, we have tfms that we have defined. We have filter based on indexes. Then we create a DataSource. We can then subset into it. Now if we want to send it to GPU, we must create a batch. So we use TfmdDL and create the same. we can specify after_items transforms and after_batch transforms. We can create batches, check the batches and show the batch as well.

source = untar_data(URLs.MNIST_TINY)/'train'
items = get_image_files(source)

tfms = [[PILImage.create, ImageResizer(128), ToTensor(), ByteToFloatTensor()],
        [labeller, Categorize()]]

pets = DataSource(items, tfms, filts=split_idx)

x,y = pets.subset(1)[0],y));

ds_img_tfms = [ImageResizer(128), ToTensor()]
dl_tfms = [Cuda(), ByteToFloatTensor()]

trn_dl = TfmdDL(pets.train, bs=9, after_item=ds_img_tfms, after_batch=dl_tfms)
b = trn_dl.one_batch()

bd = trn_dl.decode_batch(b)
test_eq(len(bd), 9)
test_eq(bd[0][0].shape, (3,128,128))

_,axs = plt.subplots(3,3, figsize=(9,9))

We can do this via databunch as well. Then we can look into the train dataloader and look at one batch. We can also show batch.

dbch = pets.databunch(pets.train, bs=9, after_item=ds_img_tfms, after_batch=dl_tfms)

There was a question on whether image resize should be done as a part of pre-processing or as as transform. The answer to that is if the initial images are really big then they need to be resized as pre-processing. Otherwise the image resize is done as a part of data augmentation. Here the image is centre cropped and then it has to be resized. Therefore here it is better to do it as part of transform.

There was a question on whether after_item and after_batch transforms used in TfmdDL are callbacks. The answer to that is we can think of them as callbacks. Now we go to 01c_dataloader.ipynb notebook. It is a super fun notebook as per Jeremy. There is a FakeLoader which has been made to work around with some issues using PyTorch DataLoader. Now we go to DataLoader class and see the iterate function.

class DataLoader():
    wif=before_iter=after_item=before_batch=after_batch=after_iter = noops
    _methods = 'wif before_iter create_batches sampler create_item after_item before_batch create_batch retain after_batch after_iter'.split()
    def __iter__(self):
        for b in _loaders[self.fake_l.num_workers==0](self.fake_l): yield self.after_batch(b)

So when we iterate it calls _loaders (mainly MutliprocessingDataLoader is used) and it yields a batch. Then we invoke Sampler and then create batches. In create_batches of the DataLoader, usually there is a dataset passed so we create an iterator on the dataset and then we call self.do_item. The create_batches has samples (index values of them) passed along. The do_item does create_item and then calls after_item. The create_item returns a subset of the dataset containing the index values of the samples if there are samples. Else it returns the next iterator.

By default after_item==noops, so it does nothing. We have @funcs_kwargs decorator here. What it does is that it takes all the strings specified in _methods and looks for them in the **kwargs in the __init__ call. So we can actually call the after_item for example in the **kwargs and then replace the after_item==noops mentioned earlier. Then the result of the create_item is taken as it is if there is no batch size specified. If batch size is specified then self.do_batch is called over chunked. chunked takes the result and makes them into batches. In do_batch the before_batch is called. There is nothing defined here for before_batch. So create_batch is called where all the batches are concatenated. Then this result is sent do after_batch which also does nothing.

We now go to 05_data_core.ipynb notebook. So TfmdDL is a pretty thin subclass of DataLoader. Here we look at _dl_tfms=('after_item','before_batch','after_batch') and we create Pipelines for them. So the TfmdDL has decode, decode_batch and show_batch. When you call decode it needs to know the types that the decoded value needs to be of. That is available in _retain_dl method. We do it by calling _one_pass. _one_pass runs one mini batch to see what types the batch is of and then retains that. This part of code between _retain_dl and _one_pass is good to try and understand.

Going back to 08_pets_tutorial.ipynb where in we have the after_item and after_batch transforms

ds_img_tfms = [ImageResizer(128), ToTensor()]
dl_tfms = [Cuda(), ByteToFloatTensor()]
trn_dl = TfmdDL(pets.train, bs=9, after_item=ds_img_tfms, after_batch=dl_tfms)

The after_item transform runs after the item has been picked from the dataset but before it is collated as a mini batch. On the other hand Cuda() and ByteToFloatTensor() will run faster if they are run on a batch. So they are after_batch transforms which run after the mini batch is collated.

Lets now look at datablocks 50_data_blocks.ipynb. It is there because it has to use stuff from vision and text etc. Lets look at the MNIST dataset. Here we have to specify where to get the items, how to split the items and how to get the labels. We are also specifying the types for x and y here. This is because then the transforms needed to get this type is known to these types. So for example here, it will use PILImageBW.create for the images and Category.create for the labels.

class MNIST(DataBlock):
    types = PILImageBW,Category
    def get_items(self, source): return get_image_files(Path(source))
    def splitter (self, items ): return GrandparentSplitter()(items)
    def get_y (self, item  ): return parent_label(item)

Let’s look at the create method of PILImageBW. It is a subclass of PILImage which is a subclass of PILBase. Let’see the code for them in 07_vision_core.ipynb.

class PILImageBW(PILImage): _show_args,_open_args = {'cmap':'Greys'},{'mode': 'L'}
class PILImage(PILBase): pass

class PILBase(Image.Image, metaclass=BypassNewMeta):
    default_dl_tfms = ByteToFloatTensor
    _show_args = {'cmap':'viridis'}
    _open_args = {'mode': 'RGB'}
    def create(cls, fn, **kwargs)->None:
        "Open an `Image` from path `fn`"
        return cls(load_image(fn, **merge(cls._open_args, kwargs)))

    def show(self, ctx=None, **kwargs):
        "Show image using `merge(self._show_args, kwargs)`"
        return show_image(self, ctx=ctx, **merge(self._show_args, kwargs))

As you can see in the code, the PILImageBW has _open_args which gets passed to PILBase and is used in the create method. Also in PILBase there is _default_dl_tfms which is used automatically as a transform in the data loader stage. Jeremy shows other _default transforms for pointers, bbox etc. So then we can create the datablock this

mnist = MNIST().datasource(untar_data(URLs.MNIST_TINY))

Alternatively since the DataBlock class has @funcs_kwargs decorator, we can pass them along as well.

mnist = DataBlock(ts=(PILImageBW, Category), 

A lot of the time the one above will be the easiest format to use DataBlock. Let us look at the code of DataBlock.

class DataBlock():
    "Generic container to quickly build `DataSource` and `DataBunch`"
    get_x=get_items=splitter=get_y = None
    _methods = 'get_items splitter get_y get_x'.split()
    def __init__(self, ts=None, **kwargs):
        types = L(getattr(self,'types',(float,float)) if ts is None else ts)
        self.default_type_tfms = types.mapped(
            lambda t: L(getattr(t,'create',None)) + L(getattr(t,'default_type_tfms',None)))
        self.default_ds_tfms = _merge_tfms(ToTensor, *types.attrgot('default_ds_tfms', L()))
        self.default_dl_tfms = _merge_tfms(Cuda    , *types.attrgot('default_dl_tfms', L()))

    def datasource(self, source, type_tfms=None):
        self.source = source
        items = (self.get_items or noop)(source)
        if isinstance(items,tuple):
            items = L(items).zipped()
            labellers = [itemgetter(i) for i in range_of(self.default_type_tfms)]
        else: labellers = [noop] * len(self.default_type_tfms)
        splits = (self.splitter or noop)(items)
        if self.get_x: labellers[0] = self.get_x
        if self.get_y: labellers[1] = self.get_y
        if type_tfms is None: type_tfms = [L() for t in self.default_type_tfms]
        type_tfms = L([self.default_type_tfms, type_tfms, labellers]).mapped_zip(
            lambda tt,tfm,l: L(l) + _merge_tfms(tt, tfm))
        return DataSource(items, tfms=type_tfms, filts=splits)

    def databunch(self, source, type_tfms=None, ds_tfms=None, dl_tfms=None, bs=16, **kwargs):
        dsrc = self.datasource(source, type_tfms=type_tfms)
        ds_tfms = _merge_tfms(self.default_ds_tfms, ds_tfms)
        dl_tfms = _merge_tfms(self.default_dl_tfms, dl_tfms)
        return dsrc.databunch(bs=bs, after_item=ds_tfms, after_batch=dl_tfms, **kwargs)

    _docs = dict(datasource="Create a `Datasource` from `source` with `tfms` and `tuple_tfms`",
                 databunch="Create a `DataBunch` from `source` with `tfms`")

As you can see here, the _methods specified here can be replaced by what we send in **kwargs in the __init__. The DataBlock assigns three types of transforms - Type Transforms, Dataset Transforms and DataLoader Transforms. They are then added as necessary in the databunch method. We can also pass in specific dataset and dataloader transforms that we want to use here in the databunch method. These transforms will be merged with the default ones specified here and will be checked for duplication as well. This is achieved by the _merge_tfms method.

def _merge_tfms(*tfms):
    "Group the `tfms` in a single list, removing duplicates (from the same class) and instantiating"
    g = groupby(concat(*tfms), lambda o:
        o if isinstance(o, type) else o.__qualname__ if (isfunction(o) or ismethod(o)) else o.__class__)
    return L(v[-1] for k,v in g.items()).mapped(instantiate)

A key order can be passed to ensure that it makes sense to order the transforms. There was a question on data augmentations. The answer is that we haven’t looked at them still. But they are transforms that we will pass in the databunch method. The databunch needs a datasource and a datasource method is defined in the code. It takes in a source, gets items from the source. If the items return a tuple then it converts them to L and zips them.

The code then performs splitting of the items as required. If there is a get_x or get_y defined then it executes the same. If there are type transforms passed in then it is added to the default type transforms defined here. Else only the default type transforms and labellers are taken. Finally from all this a DataSource is created.

Play with the examples in the DataBlock notebook to understand it better. The MNIST example, Pets Example, Multilabel classification and others. In the Pets example, the data augmentation transforms are defined which we will see later. Let’s look at the Mutlilabel Classification Example from planet. There are many ways to go about with the datablock and databunch.

In the first way we use the pandas dataframes, create get_x and get_y and pass a splitter. The get_x defines that images are in x[0] and get_y defines that labels are in x[1]. In the databunch we pass df.values which are numpy values. This is not an elegant way as per Jeremy.

planet_source = untar_data(URLs.PLANET_TINY)
df = pd.read_csv(planet_source/"labels.csv")

planet = DataBlock(ts=(PILImage, MultiCategory),
                   get_x=lambda x:planet_source/"train"/f'{x[0]}.jpg',
                   get_y=lambda x:x[1].split(' '))

dbunch = planet.databunch(df.values, dl_tfms=aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.))

In the second way, we pass a function in get_items method that uses pandas specific methods to retrieve images and labels. In the databunch we pass the dataframe itself.

planet_source = untar_data(URLs.PLANET_TINY)
df = pd.read_csv(planet_source/"labels.csv")

def _planet_items(x): return (
    f'{planet_source}/train/'+x.image_name+'.jpg', x.tags.str.split())

planet = DataBlock(ts=(PILImage,MultiCategory),
                   get_items = _planet_items,
                   splitter = RandomSplitter())

dbunch = planet.databunch(df, dl_tfms=aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.))

In the second way, the items that are returned are a tuple. In this case as explained above the items are converted to L and zipped. There are couple of other ways as well that are described. In the third way, we notice that we can use staticmethod deccorator to make RandomSplitter as a static method.

class PlanetDataBlock(DataBlock):
    types = PILImage,MultiCategory
    splitter = staticmethod(RandomSplitter())
    def get_items(self, x): return (
        f'{planet_source}/train/' + x.image_name + '.jpg', x.tags.str.split())
planet = PlanetDataBlock()
dbunch = planet.databunch(df, dl_tfms=aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.))

The best way is the fourth way that is mentioned here.

planet = DataBlock(ts=(PILImage,MultiCategory),
                   get_x = lambda o:f'{planet_source}/train/'+o.image_name+'.jpg',
                   get_y = lambda o:o.tags.split(),
                   splitter = RandomSplitter())

dbunch = planet.databunch(df, dl_tfms=aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.))

There was a question on map and mapped and if they are used to parallelize things. mapped is a L specific method and map is used in python a lot. We use this as a lazy generator. It would be good to understand both and the DataLoader in v2 would be a good place to start. Jeremy goes on to explain map. It can be understood in the code below. We define t to be values from 0 to 9. We map the negative operator function to t. We call the map function and it returns a map which is a generator. It will be returned only when we print it or turn into a list.

t = range(10)
<map at 0x1258f4d50>
[0, -1, -2, -3, -4, -5, -6, -7, -8, -9]
t2 = list(map(operator.neg,t))
t3 = map(lambda o: o + 100,t2)
<map at 0x125907290>
[100, 99, 98, 97, 96, 95, 94, 93, 92, 91]

Jeremy says that he will not go through segmentation or object detection. But people who are interested can go through and ask any questions. In the next walkthrough, Jeremy mentions that we will go through tabular data.


For an image segmentation task, is there a way to display image labels when visualising in Datablock? (besides just going deep and modifying show method in a Image class and decode method in Opener transform?).

x - is image paths
y - masks paths
but we also have labels of the objects

1 Like

Not at the moment, but I agree it would be a nice feature to add. Probably as a color coded legend. IIRC there was an example of this in one of the lessons.

1 Like