Use this wiki topic for walk-thru 7.
Fastai v2 daily code walk-thrus
Fastai v2 chat
Thanks to @pnvijay for these notes:
Notes from Code Walkthrough 7 - Completed
Importing all required libraries to run the notebook locally along with the lecture
from local.torch_basics import *
from local.test import *
from local.data.load import *
from local.data.transform import *
from local.data.pipeline import *
from local.data.external import *
from local.notebook.showdoc import *
from local.data.all import *
from local.data.core import *
from local.vision.core import *
from local import *
We are looking at 06_data_core.ipynb
notebook. In the last lecture we went through DataSource. The input to a DataSource can be a list, L
, pandas and numpy array. It will use the accelerated indexing in pandas/numpy to get the subsets. There was a refresher on the tests and nuances of what the DataSource
and TfmdDS
does. We saw this in the last part of the last walkthrough.
Now we will look at databunch
in DataSource
. The databunch
method returns a DataBunch
. It takes in dataloaders and this is what even the code in databunch
shows. The code for databunch
method of DataSource
and DataBunch
is shown below.
class DataSource(TfmdDS):
"Applies a `tfm` to filtered subsets of `items`"
@delegates(DataLoader.__init__) #(self.dl_cls.__init__)
def databunch(self, bs=16, val_bs=None, shuffle_train=True, **kwargs):
n = len(self.filts)-1
bss = [bs] + [2*bs]*n if val_bs is None else [bs] + [val_bs]*n
shuffles = [shuffle_train] + [False]*n
return DataBunch(*[self.dl_cls(self.subset(i), bs=b, shuffle=s, drop_last=s, **kwargs)
for i,(b,s) in enumerate(zip(bss, shuffles))])
class DataBunch(GetAttr):
"Basic wrapper around several `DataLoader`s."
_xtra = 'one_batch show_batch dataset device'.split()
def __init__(self, *dls): self.dls,self.default = dls,dls[0]
def __getitem__(self, i): return self.dls[i]
train_dl,valid_dl = add_props(lambda i,x: x[i])
train_ds,valid_ds = add_props(lambda i,x: x[i].dataset)
_docs=dict(__getitem__="Retrieve `DataLoader` at `i` (`0` is training, `1` is validation)",
train_dl="Training `DataLoader`",
valid_dl="Validation `DataLoader`",
train_ds="Training `Dataset`",
valid_ds="Validation `Dataset`")
Let’s look at the last line of the databunch
code in DataSource
.
return DataBunch(*[self.dl_cls(self.subset(i), bs=b, shuffle=s, drop_last=s, **kwargs)
for i,(b,s) in enumerate(zip(bss, shuffles))])
There is self.dl_cls
added. This has been done recently to ensure that the dataloaders are of class TfmdDL
. This can be seen in the __init__
of TfmdDS
.
class DataSource(TfmdDS):
"Applies a `tfm` to filtered subsets of `items`"
def __init__(self, items, tfms=None, filts=None, do_setup=True, dl_cls = TfmdDL):
It then takes in a a subset, batch size, shuffle and drop_last. This is is depending on whether it is for valid or for train.
There was a question on the @delegates
constructor used inside TfmdDS
. Jeremy mentions that it is normally put before the class to mention that all **kwargs
in the __init__
of that class are passed on to the super class. But we can also put it for a specific part of the class. Here we are using @delegates
for the ’**kwargs
to be passed on to DataLoader.__init__
. So when check for the .databunch() of a DataSource via pressing shift+tab inside () all the **kwargs
are listed in the signature.
inp = [0,1,2,3,4]
dsrc = DataSource(inp, tfms=[None])
dsrc.databunch()
<local.data.core.DataBunch at 0x1258d3b90>
We go to 08_pets_tutorial.ipynb
and we have items which are paths of images, we have tfms that we have defined. We have filter based on indexes. Then we create a DataSource. We can then subset into it. Now if we want to send it to GPU, we must create a batch. So we use TfmdDL
and create the same. we can specify after_items transforms and after_batch transforms. We can create batches, check the batches and show the batch as well.
source = untar_data(URLs.MNIST_TINY)/'train'
items = get_image_files(source)
tfms = [[PILImage.create, ImageResizer(128), ToTensor(), ByteToFloatTensor()],
[labeller, Categorize()]]
pets = DataSource(items, tfms, filts=split_idx)
x,y = pets.subset(1)[0]
pets.show((x,y));
ds_img_tfms = [ImageResizer(128), ToTensor()]
dl_tfms = [Cuda(), ByteToFloatTensor()]
trn_dl = TfmdDL(pets.train, bs=9, after_item=ds_img_tfms, after_batch=dl_tfms)
b = trn_dl.one_batch()
bd = trn_dl.decode_batch(b)
test_eq(len(bd), 9)
test_eq(bd[0][0].shape, (3,128,128))
_,axs = plt.subplots(3,3, figsize=(9,9))
trn_dl.show_batch(ctxs=axs.flatten())
We can do this via databunch as well. Then we can look into the train dataloader and look at one batch. We can also show batch.
dbch = pets.databunch(pets.train, bs=9, after_item=ds_img_tfms, after_batch=dl_tfms)
dbch.train_dl.one_batch()
dbch.train_dl.show_batch()
There was a question on whether image resize should be done as a part of pre-processing or as as transform. The answer to that is if the initial images are really big then they need to be resized as pre-processing. Otherwise the image resize is done as a part of data augmentation. Here the image is centre cropped and then it has to be resized. Therefore here it is better to do it as part of transform.
There was a question on whether after_item and after_batch transforms used in TfmdDL
are callbacks. The answer to that is we can think of them as callbacks. Now we go to 01c_dataloader.ipynb
notebook. It is a super fun notebook as per Jeremy. There is a FakeLoader
which has been made to work around with some issues using PyTorch DataLoader. Now we go to DataLoader
class and see the iterate function.
class DataLoader():
wif=before_iter=after_item=before_batch=after_batch=after_iter = noops
_methods = 'wif before_iter create_batches sampler create_item after_item before_batch create_batch retain after_batch after_iter'.split()
def __iter__(self):
self.before_iter()
for b in _loaders[self.fake_l.num_workers==0](self.fake_l): yield self.after_batch(b)
self.after_iter()
So when we iterate it calls _loaders
(mainly MutliprocessingDataLoader is used) and it yields a batch. Then we invoke Sampler and then create batches. In create_batches
of the DataLoader
, usually there is a dataset passed so we create an iterator on the dataset and then we call self.do_item
. The create_batches has samples (index values of them) passed along. The do_item
does create_item
and then calls after_item
. The create_item returns a subset of the dataset containing the index values of the samples if there are samples. Else it returns the next iterator.
By default after_item==noops
, so it does nothing. We have @funcs_kwargs
decorator here. What it does is that it takes all the strings specified in _methods
and looks for them in the **kwargs
in the __init__
call. So we can actually call the after_item
for example in the **kwargs
and then replace the after_item==noops
mentioned earlier. Then the result of the create_item is taken as it is if there is no batch size specified. If batch size is specified then self.do_batch
is called over chunked
. chunked takes the result and makes them into batches. In do_batch
the before_batch is called. There is nothing defined here for before_batch
. So create_batch
is called where all the batches are concatenated. Then this result is sent do after_batch
which also does nothing.
We now go to 05_data_core.ipynb
notebook. So TfmdDL
is a pretty thin subclass of DataLoader
. Here we look at _dl_tfms=('after_item','before_batch','after_batch')
and we create Pipelines for them. So the TfmdDL
has decode
, decode_batch
and show_batch
. When you call decode
it needs to know the types that the decoded value needs to be of. That is available in _retain_dl
method. We do it by calling _one_pass
. _one_pass
runs one mini batch to see what types the batch is of and then retains that. This part of code between _retain_dl
and _one_pass
is good to try and understand.
Going back to 08_pets_tutorial.ipynb
where in we have the after_item and after_batch transforms
ds_img_tfms = [ImageResizer(128), ToTensor()]
dl_tfms = [Cuda(), ByteToFloatTensor()]
trn_dl = TfmdDL(pets.train, bs=9, after_item=ds_img_tfms, after_batch=dl_tfms)
The after_item transform runs after the item has been picked from the dataset but before it is collated as a mini batch. On the other hand Cuda() and ByteToFloatTensor() will run faster if they are run on a batch. So they are after_batch transforms which run after the mini batch is collated.
Lets now look at datablocks 50_data_blocks.ipynb
. It is there because it has to use stuff from vision and text etc. Lets look at the MNIST dataset. Here we have to specify where to get the items, how to split the items and how to get the labels. We are also specifying the types for x and y here. This is because then the transforms needed to get this type is known to these types. So for example here, it will use PILImageBW.create for the images and Category.create for the labels.
class MNIST(DataBlock):
types = PILImageBW,Category
def get_items(self, source): return get_image_files(Path(source))
def splitter (self, items ): return GrandparentSplitter()(items)
def get_y (self, item ): return parent_label(item)
Let’s look at the create method of PILImageBW. It is a subclass of PILImage which is a subclass of PILBase. Let’see the code for them in 07_vision_core.ipynb
.
class PILImageBW(PILImage): _show_args,_open_args = {'cmap':'Greys'},{'mode': 'L'}
class PILImage(PILBase): pass
class PILBase(Image.Image, metaclass=BypassNewMeta):
default_dl_tfms = ByteToFloatTensor
_show_args = {'cmap':'viridis'}
_open_args = {'mode': 'RGB'}
@classmethod
def create(cls, fn, **kwargs)->None:
"Open an `Image` from path `fn`"
return cls(load_image(fn, **merge(cls._open_args, kwargs)))
def show(self, ctx=None, **kwargs):
"Show image using `merge(self._show_args, kwargs)`"
return show_image(self, ctx=ctx, **merge(self._show_args, kwargs))
As you can see in the code, the PILImageBW has _open_args
which gets passed to PILBase and is used in the create
method. Also in PILBase
there is _default_dl_tfms
which is used automatically as a transform in the data loader stage. Jeremy shows other _default
transforms for pointers, bbox etc. So then we can create the datablock this
mnist = MNIST().datasource(untar_data(URLs.MNIST_TINY))
Alternatively since the DataBlock
class has @funcs_kwargs
decorator, we can pass them along as well.
mnist = DataBlock(ts=(PILImageBW, Category),
get_items=get_image_files,
splitter=GrandparentSplitter(),
get_y=parent_label)
A lot of the time the one above will be the easiest format to use DataBlock. Let us look at the code of DataBlock
.
@docs
@funcs_kwargs
class DataBlock():
"Generic container to quickly build `DataSource` and `DataBunch`"
get_x=get_items=splitter=get_y = None
_methods = 'get_items splitter get_y get_x'.split()
def __init__(self, ts=None, **kwargs):
types = L(getattr(self,'types',(float,float)) if ts is None else ts)
self.default_type_tfms = types.mapped(
lambda t: L(getattr(t,'create',None)) + L(getattr(t,'default_type_tfms',None)))
self.default_ds_tfms = _merge_tfms(ToTensor, *types.attrgot('default_ds_tfms', L()))
self.default_dl_tfms = _merge_tfms(Cuda , *types.attrgot('default_dl_tfms', L()))
def datasource(self, source, type_tfms=None):
self.source = source
items = (self.get_items or noop)(source)
if isinstance(items,tuple):
items = L(items).zipped()
labellers = [itemgetter(i) for i in range_of(self.default_type_tfms)]
else: labellers = [noop] * len(self.default_type_tfms)
splits = (self.splitter or noop)(items)
if self.get_x: labellers[0] = self.get_x
if self.get_y: labellers[1] = self.get_y
if type_tfms is None: type_tfms = [L() for t in self.default_type_tfms]
type_tfms = L([self.default_type_tfms, type_tfms, labellers]).mapped_zip(
lambda tt,tfm,l: L(l) + _merge_tfms(tt, tfm))
return DataSource(items, tfms=type_tfms, filts=splits)
def databunch(self, source, type_tfms=None, ds_tfms=None, dl_tfms=None, bs=16, **kwargs):
dsrc = self.datasource(source, type_tfms=type_tfms)
ds_tfms = _merge_tfms(self.default_ds_tfms, ds_tfms)
dl_tfms = _merge_tfms(self.default_dl_tfms, dl_tfms)
return dsrc.databunch(bs=bs, after_item=ds_tfms, after_batch=dl_tfms, **kwargs)
_docs = dict(datasource="Create a `Datasource` from `source` with `tfms` and `tuple_tfms`",
databunch="Create a `DataBunch` from `source` with `tfms`")
As you can see here, the _methods
specified here can be replaced by what we send in **kwargs
in the __init__
. The DataBlock assigns three types of transforms - Type Transforms, Dataset Transforms and DataLoader Transforms. They are then added as necessary in the databunch
method. We can also pass in specific dataset and dataloader transforms that we want to use here in the databunch
method. These transforms will be merged with the default ones specified here and will be checked for duplication as well. This is achieved by the _merge_tfms
method.
def _merge_tfms(*tfms):
"Group the `tfms` in a single list, removing duplicates (from the same class) and instantiating"
g = groupby(concat(*tfms), lambda o:
o if isinstance(o, type) else o.__qualname__ if (isfunction(o) or ismethod(o)) else o.__class__)
return L(v[-1] for k,v in g.items()).mapped(instantiate)
A key order can be passed to ensure that it makes sense to order the transforms. There was a question on data augmentations. The answer is that we haven’t looked at them still. But they are transforms that we will pass in the databunch
method. The databunch
needs a datasource and a datasource
method is defined in the code. It takes in a source, gets items from the source. If the items return a tuple then it converts them to L
and zips them.
The code then performs splitting of the items as required. If there is a get_x
or get_y
defined then it executes the same. If there are type transforms passed in then it is added to the default type transforms defined here. Else only the default type transforms and labellers are taken. Finally from all this a DataSource
is created.
Play with the examples in the DataBlock
notebook to understand it better. The MNIST example, Pets Example, Multilabel classification and others. In the Pets example, the data augmentation transforms are defined which we will see later. Let’s look at the Mutlilabel Classification Example from planet. There are many ways to go about with the datablock and databunch.
In the first way we use the pandas dataframes, create get_x and get_y and pass a splitter. The get_x defines that images are in x[0] and get_y defines that labels are in x[1]. In the databunch we pass df.values which are numpy values. This is not an elegant way as per Jeremy.
planet_source = untar_data(URLs.PLANET_TINY)
df = pd.read_csv(planet_source/"labels.csv")
planet = DataBlock(ts=(PILImage, MultiCategory),
get_x=lambda x:planet_source/"train"/f'{x[0]}.jpg',
splitter=RandomSplitter(),
get_y=lambda x:x[1].split(' '))
dbunch = planet.databunch(df.values, dl_tfms=aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.))
In the second way, we pass a function in get_items
method that uses pandas specific methods to retrieve images and labels. In the databunch we pass the dataframe itself.
planet_source = untar_data(URLs.PLANET_TINY)
df = pd.read_csv(planet_source/"labels.csv")
def _planet_items(x): return (
f'{planet_source}/train/'+x.image_name+'.jpg', x.tags.str.split())
planet = DataBlock(ts=(PILImage,MultiCategory),
get_items = _planet_items,
splitter = RandomSplitter())
dbunch = planet.databunch(df, dl_tfms=aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.))
In the second way, the items that are returned are a tuple. In this case as explained above the items are converted to L
and zipped. There are couple of other ways as well that are described. In the third way, we notice that we can use staticmethod
deccorator to make RandomSplitter
as a static method.
class PlanetDataBlock(DataBlock):
types = PILImage,MultiCategory
splitter = staticmethod(RandomSplitter())
def get_items(self, x): return (
f'{planet_source}/train/' + x.image_name + '.jpg', x.tags.str.split())
planet = PlanetDataBlock()
dbunch = planet.databunch(df, dl_tfms=aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.))
The best way is the fourth way that is mentioned here.
planet = DataBlock(ts=(PILImage,MultiCategory),
get_x = lambda o:f'{planet_source}/train/'+o.image_name+'.jpg',
get_y = lambda o:o.tags.split(),
splitter = RandomSplitter())
dbunch = planet.databunch(df, dl_tfms=aug_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.))
There was a question on map
and mapped
and if they are used to parallelize things. mapped
is a L
specific method and map
is used in python a lot. We use this as a lazy generator. It would be good to understand both and the DataLoader
in v2 would be a good place to start. Jeremy goes on to explain map
. It can be understood in the code below. We define t to be values from 0 to 9. We map the negative operator function to t. We call the map function and it returns a map which is a generator. It will be returned only when we print it or turn into a list.
t = range(10)
map(operator.neg,t)
<map at 0x1258f4d50>
list(map(operator.neg,t))
[0, -1, -2, -3, -4, -5, -6, -7, -8, -9]
t2 = list(map(operator.neg,t))
t3 = map(lambda o: o + 100,t2)
t3
<map at 0x125907290>
list(t3)
[100, 99, 98, 97, 96, 95, 94, 93, 92, 91]
Jeremy says that he will not go through segmentation or object detection. But people who are interested can go through and ask any questions. In the next walkthrough, Jeremy mentions that we will go through tabular data.