Fastai v2 code walk-thru 6

Use this wiki topic for walk-thru 6.
Feel free to edit this topic.
Fastai v2 daily code walk-thrus
Fastai v2 chat

Brief notes thanks to @maxim.pechyonkin:

02_data_transforms.ipynb notebook

  • How do we know class passes self to a method that we assigned by monkey patching, like in example below?
t = TypeDispatch(m_min, m_nu, m_bll)
class A: f = t
a = A()
a.f(1)
  • self.inst
  • f = types.MethodType(f, self.inst)
  • __get__() when calling a method of a class
  • self.inst = inst binds TypeDispatch object to a particular instance
  • endodes, decodes, setups
  • RAPIDS nVidia — like Pandas, but runs on the GPU

03_data_pipeline.ipynb notebook

  • creating a Pipeline() object, which is callable
  • supports .set_as_item() method
  • function composition
  • decodes on Pipeline calls decodes on each transform in the pipeline in the reverse order
  • pipe.show(t) decodes t one transform at a time until it gets to a type that is showable
  • ShowTitle class that knows how to show itself
  • test_stdout has first element should be a lambda that calls the function whose stdout output we want to test
  • Pipeline can show single elements, also can show individual elements of tuples
  • DataSource class discussion
  • TfmdList overrides self._gets() method and applies pipeline to each element
  • TfmdDS allows to handle 2 pipelines - for x and for y separately, both pipelines will start from the same thing — path to images (in this example), but can also be applied to tuples of items, where x and y come from different places (like images from folder, and labels from a .csv
  • tls stands for transformed lists
  • L has .mapped() method that applies a function to each element of L
  • pipelines passed to TfmdDS should be a list of lists
  • Categorize transform
  • @delegates() and @docs decorator

06_data_source.ipynb notebook

  • DataSource is almost identical to TfmdDS
  • DataSource has filts parameter
  • pets.subset(1) is the same as pets.valid
  • Transform can know whether to apply to validation or test set — use filt for that functionality
    • filt=0 — only apply to train set
    • filt is just an integer, or a bool mask
2 Likes

I thought it might be useful to take a closer look at the bit of Python magic Jeremy references here.

image

When an object stores another object as an attribute, when that inner object is accessed on the outer object, the inner object’s __get__ gets called (Python data model docs).

This is amazing and the fact that this functionality exists can be quite surprising! (I know it has been for me!)

image

As we are accessing the thing on an instance of AnotherThing, we can modify the thing on the fly, changing one of it’s attributes for instance. This is what TypeDispatch does

image

This __get__ mechanism is only an enabling block for another bit magic happening down the road.

Once self.inst is set on the instance of TypeDispatch(), the function to be called is bound to self and becomes a method (self.inst is the original object holding the thing attribute!):

    def __call__(self, x, *args, **kwargs):
        f = self[type(x)]
        if not f: return x
        if self.inst is not None: f = types.MethodType(f, self.inst)
        return f(x, *args, **kwargs)

The below example from the notebook demonstrates this quite nicely.

image

m_bll is just a regular function with two arguments. Through the mechanism described above it becomes a method. It is called with a single argument, False. It sets foo on a and hence the test passes.

4 Likes

@radek
Thanks very much! For quite some time, I had been trying ot understand The following methods only apply when an instance of the class containing the method appears in an owner class. from Python Data Model.

Your post explains it really well! :slight_smile:

class A:
    def __get__(self, inst, owner):
        print("I'm alive!")
        
class B: inst = A()

B.inst

>> I'm alive!
2 Likes

Notes for the Code Walkthrough 6 - Completed

Importing all the modules to make sure that I can recreate the note book and things did during the walkthrough.

from local.torch_basics import *
from local.test import *
from local.data.load import *
from local.data.transform import *
from local.data.pipeline import *
from local.data.external import *
from local.notebook.showdoc import *
from local.data.all import *
from local.vision.core import *
from local import *

Jeremy starts with TypeDispatch function. He mentions that Aman Arora has got some good write ups on the same in the forums. Let’s look at the code of TypeDispatch

class TypeDispatch:
    "Dictionary-like object; `__getitem__` matches keys of types using `issubclass`"
    def __init__(self, *funcs):
        self.funcs,self.cache = {},{}
        for f in funcs: self.add(f)
        self.inst = None

    def _reset(self):
        self.funcs = {k:self.funcs[k] for k in sorted(self.funcs, key=cmp_instance, reverse=True)}
        self.cache = {**self.funcs}

    def add(self, f):
        "Add type `t` and function `f`"
        self.funcs[_p1_anno(f) or object] = f
        self._reset()

    def returns(self, x): return anno_ret(self[type(x)])
    def returns_none(self, x):
        r = anno_ret(self[type(x)])
        return r if r == NoneType else None

    def __repr__(self): return str({getattr(k,'__name__',str(k)):v.__name__ for k,v in self.funcs.items()})

    def __call__(self, x, *args, **kwargs):
        f = self[type(x)]
        if not f: return x
        if self.inst is not None: f = types.MethodType(f, self.inst)
        return f(x, *args, **kwargs)

    def __get__(self, inst, owner):
        self.inst = inst
        return self

    def __getitem__(self, k):
        "Find first matching type that is a super-class of `k`"
        if k in self.cache: return self.cache[k]
        types = [f for f in self.funcs if issubclass(k,f)]
        res = self.funcs[types[0]] if types else None
        self.cache[k] = res
        return res

There are tests in 02_data_transforms.ipynb. Let’s look at them. We did not look at the second set of sets last time. The second set have self also in them. So in the test (see below) when we pass a.f, how does python know that it has to be taken as self. This is not default behaviour and we have changed that. f is an attribute of class A, there is nothing in particular that says a.f should be passed as self.

def m_nin(self, x:numbers.Integral): return x+1
def m_bll(self, x:bool): return x
def m_num(self, x:numbers.Number): return x

t = TypeDispatch(m_nin,m_num,m_bll)
class A: f = t
a = A()
test_eq(a.f(1), 2)
test_eq(a.f(1.), 1.)

Please Note:
I have modified the tests below to be able to do some debug. The tests in the actual notebook 02_data_transforms.ipynb are valid and remain the same.

Let’s change the tests a bit to show that we are passing a real self. To test this we need to set g.d(False) and then do test_eq(g.foo,'a'). We will find that it passes the test.

This succeeds because in the code for TypeDispatch, specifically in the __call__ method, we check for self.inst and if it exists we bind the function d to be method to the instance g. This is what happends in the code if self.inst: f = types.MethodType(f, self.inst).

How do we know what self.inst is? self.inst has to be set to g. This is done using __get__ method in TypeDispatch. When we first execute g.d(False), python first executes __get__ in the class d belongs to. It specifically executes g.__get__ in the TypeDispatch as g is an instance of class V and d has been set to be equal to t_new which is a TypeDispatch class.

The place to go and understand all of this is in Python Data Model. Note that g or classV do not have a __get__ defined in them. So this get passed to their ancestors which happens to be TypeDispatch in this case.

def t_m_nin(self, x:numbers.Integral): return x+1
def t_m_bll(self, x:bool): self.foo = 'a'
def t_m_num(self, x:numbers.Number): return x

t_new = TypeDispatch(t_m_nin,t_m_num,t_m_bll)
class V: d = t_new
g = V()
g.d(False)

As you can see in the below cell g.__dict__ has {'foo': 'a'}.

g.__dict__
{'foo': 'a'}

If we now do the test_eq of g.foo and a, it will pass.

test_eq(g.foo,'a')

One things that has been added new in _tfm_methods (Please see class _TfmDict for this) is setups. It already had encodes and decodes. Now setups has been added. setups will be called by function setup like in Transform class. setups is now a TypeDispatch object.

The reason for this is that now we have now started to work with rapids (by nvidia). They provide something like pandas which runs on the GPU. It works like pandas but is not pandas. We want to do tabular transforms and want it to work with RAPIDS dataframes as well as Pandas dataframes. Here we are using setups to help drive this behaviour.

Lets now do 03_data_pipeline.ipynb. Lets create a pipeline. It is easy to start looking at the tests first. In the test below we create an empty Pipeline and assign it to pipe. pipe is callable as pipe(1). But this is an empty pipeline. Whatever we pass on to an empty Pipeline gets returned. So here 1 gets returned as the output of pipe(1). It has the same ‘as item’ behaviour like in Transform. So if you do set_as_item() and pass in a boolean (True or False) inside it, it will set all transforms as_item to be equal to that boolean.

So if we pass a tuple after doing pipe.set_as_item(False) we get back a tuple as you can see in the test below.

# Empty pipeline is noop
pipe = Pipeline()
test_eq(pipe(1), 1)
pipe.set_as_item(False)
test_eq(pipe((1,)), (1,))
# Check pickle works
assert pickle.loads(pickle.dumps(pipe))

Let’s take a Pipeline that has transforms in it. As you can see there are two transforms in it. The int_tfm has an encodes that returns Int (Note: This Int is a fastai specific type and not the python int) and decodes that returns a Float. The neg_tfm just returns a negative value of the variable passed on to it in both encodes as well as decodes.

The encodes is called in the serial order whereas the decodes is called in the reverse order. So for pipe = Pipeline([neg_tfm, int_tfm]) when we call pipe(start) the encodes get to work. First the neg_tfm is called first and then int_tfm is called second. Whereas when we do pipe.decode(t) the int_tfm is called first and then neg_tfm is called second.

The way the Pipeline is executed when pipe(start) is called is via the __call__ defined in Pipeline. It calls the compose_tfms function which take in a list of functions. The list of functions is done via the __init__ in the Pipeline which takes the functions (neg_tfm and int_tfm in this case) and convert them to L(functions) (Note: L is a special list created in fastai), map them to Transform if they are not of Transform class and sort them by order.

The specific code snippets that we are referrring to in the Pipeline code is

def __call__(self, o): return compose_tfms(o, tfms=self.fs, filt=self.filt)

and

def __init__(self, funcs=None, as_item=False, filt=None):
    if not funcs: funcs=[noop]
    if isinstance(funcs, Pipeline): funcs = funcs.fs
    self.filt = filt
    self.fs = L(funcs).mapped(mk_transform).sorted(key='order')
    self.set_as_item(as_item)

The compose_tfms takes in the variable, the transformations, is primed to do encodes first (is_enc=True is there in the code by default and reverse=False to prevent decodes behaviour). It goes through each function and applies them to the variable. There are tests kept for functions as well as Transform class type.

def compose_tfms(x, tfms, is_enc=True, reverse=False, **kwargs):
    "Apply all `func_nm` attribute of `tfms` on `x`, maybe in `reverse` order"
    if reverse: tfms = reversed(tfms)
    for f in tfms:
        if not is_enc: f = f.decode
        x = f(x, **kwargs)
    return x

When the pipe.decode(t) is called then the same compose_tfms is called but now we set is_enc=False and reverse=True to stimulate decodes behaviour. The below code in Pipeline to handle decode shows the same.

def decode  (self, o): return compose_tfms(o, tfms=self.fs, is_enc=False, reverse=True, filt=self.filt)

pipe.show(t) behaves likes this. It takes t and starts decoding it until it gets to type that has the show method in it. In the case of this test, t is of type Int which has a show method that it inherits from ShowTitle.

class Int(int, ShowTitle): pass
class ShowTitle:
    "Base class that adds a simple `show`"
    _show_args = {'label': 'text'}
    def show(self, ctx=None, **kwargs): return show_title(str(self), ctx=ctx, **merge(self._show_args, kwargs))

Please note in the show function of class ShowTitle the ctx is set to None. In this case it shows whatever it is passed. In the ctx you can pass along a plot and Title if you want.

class IntFloatTfm(Transform):
    def encodes(self, x):  return Int(x)
    def decodes(self, x):  return Float(x)

int_tfm=IntFloatTfm()

def neg(x): return -x
neg_tfm = Transform(neg, neg)
pipe = Pipeline([neg_tfm, int_tfm])

start = 2.0
t = pipe(start)
test_eq_type(t, Int(-2))
test_eq_type(pipe.decode(t), Float(start))
test_stdout(lambda:pipe.show(t), '-2')

If you notice to test what gets printed the code is lambda:pipe.show(t) and it takes no arguments. This gets passed on to test_stdout to test what gets printed when pipe.show(t) is run. It is a nice way to test this. If we don’t pass this way, we can see the test does not go ahead.

test_stdout(pipe.show(t), '-2')
-2



---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-22-c1c41d39d068> in <module>
----> 1 test_stdout(pipe.show(t), '-2')


~/Documents/GitHub/fastai_dev/dev/local/test.py in test_stdout(f, exp, regex)
     74     "Test that `f` prints `exp` to stdout, optionally checking as `regex`"
     75     s = io.StringIO()
---> 76     with redirect_stdout(s): f()
     77     if regex: assert re.search(exp, s.getvalue()) is not None
     78     else: test_eq(s.getvalue(), f'{exp}\n' if len(exp) > 0 else '')


TypeError: 'int' object is not callable

When set_as_item is set to False and then pipe.show is called on a tuple then the show is called on each element of the tuple.

pipe.set_as_item(False)
test_stdout(lambda:pipe.show(pipe((1,2))), '-1\n-2')

The reason for that is in a batch we have a tuple. One element is an image and the other is a label. So each element will be shown as per the show defined for their type.

The show method in Pipeline goes through all functions in the pipeline but in reverse order. It will first find if it can show without decoding and if it cannot then it will decode it. Then it will use the show method of the decoded type. This can be seen in the code below that highlights the show part of Pipeline.

def show(self, o, ctx=None, **kwargs):
    for f in reversed(self.fs):
        res = self._show(o, ctx, **kwargs)
        if res is not None: return res
        o = f.decode(o, filt=self.filt)
    return self._show(o, ctx, **kwargs)

def _show(self, o, ctx, **kwargs):
    o1 = [o] if self.as_item or not is_listy(o) else o
    if not all(hasattr(o_, 'show') for o_ in o1): return
    for o_ in o1: ctx = o_.show(ctx=ctx, **kwargs)
    return 1 if ctx is None else ctx
pipe.set_as_item(False)
test_stdout(lambda:pipe.show(pipe((1,2))), '-1\n-2')
r = pipe((1,2)); r
(-1, -2)
pipe.show(r)
-1
-2

Now lets look at TfmdList. This is a subclass of TfmdBase which is a subclass of L. It takes in items and transforms and creates Pipeline of these transforms. It passes the list of items to L. This can be seen in the __init__ method of the same.

class TfmdList(TfmdBase):
    "A `Pipeline` of `tfms` applied to a collection of `items`"
    def __init__(self, items, tfms, do_setup=True, as_item=True, use_list=None, filt=None):
        super().__init__(items, use_list=use_list)
        if isinstance(tfms,TfmdList): tfms = tfms.tfms
        if isinstance(tfms,Pipeline): do_setup=False
        self.tfms = Pipeline(tfms, as_item=as_item, filt=filt)
        if do_setup: self.setup()

This is a good time to know more about L. In the __getitem__ method we have defined that if it is an iterator it goes to self._gets and to self._get if it is just an index.

def __getitem__(self, idx): return L(self._gets(idx), use_list=None) if is_iter(idx) else self._get(idx)
    def _get(self, i): return getattr(self.items,'iloc',self.items)[i]
    def _gets(self, i):
        i = mask2idxs(i)
        return (self.items.iloc[list(i)] if hasattr(self.items,'iloc')
                else self.items.__array__()[(i,)] if hasattr(self.items,'__array__')
                else [self.items[i_] for i_ in i])

We overwrite the _get method in TfmdList where we continue to use the self._get of L but we run a transform on the same. This is what enables us to treat the items part of the TfmdList as a list and index into the same.

def _get (self, i): return self.tfms(super()._get(i))

Here we define tl to be a TfmdList of items which is a list of floats (1.,2.,3.) and list of transforms which are (neg_tfm and int_tfm). But we can index into tl and get the first index item in tl which is 2.

tl = TfmdList([1.,2.,3.], [neg_tfm, int_tfm])
t = tl[1]
tl = TfmdList([1.,2.,3.], [neg_tfm, int_tfm])
t = tl[1]
test_eq_type(t, Int(-2))
test_eq(tl.decode_at(1), 2)
test_eq_type(tl.decode(t), Float(2.0))
test_stdout(lambda: tl.show_at(2), '-3')
tl
TfmdList: [1.0, 2.0, 3.0]
tfms - [Transform: True {'object': 'neg'} {'object': 'neg'}, IntFloatTfm: True {'object': 'encodes'} {'object': 'decodes'}]

We are not using TfmdList for datasets as it returns only one thing. We want to be able to get two things from a dataset like x,y instead of just one thing. Let us look at TfmdDS. It looks the same as TfmdList as it takes in items and transforms. It is also a subclass of TfmdBase. But it creates TfmdList for every transform function that it has. So now we have set up ‘n’ pipelines instead of just one pipeline. Normally TfmdDS is used for two pipelines, one each for x and y.

class TfmdDS(TfmdBase):
    "A dataset that creates a tuple from each `tfms`, passed thru `ds_tfms`"
    def __init__(self, items, tfms=None, do_setup=True, use_list=None, filt=None):
        super().__init__(items, use_list=use_list)
        if tfms is None: tms = [None]
        self.tls = [TfmdList(items, t, do_setup=do_setup, filt=filt, use_list=use_list) for t in L(tfms)]

This is what we used in 08_pets_tutorial where we had two TfmdList list.

tfms = [[PILImage.create, ImageResizer(128), ToTensor(), ByteToFloatTensor()],
        [labeller, Categorize()]]
tds = TfmdDS(items, tfms)

We can use .tls to see the two TfmdList and then used .tfms to see the transform applied to them.

tds.tls[0]
tds.tls[1]
tds.tls[0].tfms
tds.tls[1].tfms

When we call decode on it, it will go through each TfmdList and decode each item using each function in the transforms for that TfmdList. This is the decode code in TfmdDS.

def decode(self, o): return tuple(it.decode(o_) for o_,it in zip(o,self.tls))

There was a question on whether both pipelines here have been created using items which are path to images. The answer is yes. The path has the details about both the image as well as the details (labels) of the image. Jeremy goes on to explain the same in the code.

We take a single item from the Pet’s image. We take all the transforms in the first TfmdList and assign to fx. We assign all the transforms in the second TfmdList to fy. Then individually we can see that [PILImage.create, ImageResizer(128), ToTensor(), ByteToFloatTensor()] is being applied to it via fx and it returns a tensor. We can see that [labeller, Categorize()] is being applied to it via fy and it returns a label.

it = items[0]
fx = tds.tls[0].tfms
fy = tds.tls[1].tfms
fx(it) 
fy(it)

There was another question as to whether the items in TfmDS can be tuples. The answer is yes. To illustrate that Jeremy uses this example.

its = L((0,1),(1,2),(3,4))
fx = itemgetter(0)
fy = itemgetter(1)

its.mapped(fx) returns (#3) [0,1,3]. its.mapped(fy) returns (#3) [1,2,4]. He goes on to create a TfmdDS with its and fx,fy. The working of the same is shown in the below code.

its = L((0,1),(1,2),(3,4))
fx = itemgetter(0)
fy = itemgetter(1)
print(its.mapped(fx))
print(its.mapped(fy))
(#3) [0,1,3]
(#3) [1,2,4]
tds = TfmdDS(its,[[fx],[fy]])
tds[0]
(0, 1)

This also addresses another question on whether this cannot be used if labels were in a separate file. As long as the items contain both the image and the labels, we can use TfmdDS. They need not be in the same file. Jeremy explains that he and Sylvian have reached here after about 25 iterations and many weeks. It is not necessary to understand these details. But we are here in this walkthrough to get the details. We should not be worried about getting this. It is ok and it could take a while.

As homework, play with 08 notebook to understand all of this. We will use TfmdDS most of the time. But internally TfmdDS uses TfmdList. All of these things TfmdDS and TfmdList are small pieces of code. Go through the code to get a intuitive understanding of what the code does. Then try to understand the tests, specifically why did we add those tests. The tests are not arbitrary. They have been added because Jeremy and Sylvian think that they provide the best clarity. The tests will help to understand. The methods section also has tests. Try and go through and understand them as well.

We now go to notebook 05_data_core.ipynb. Let’s now try to understand setup in Pipeline. Let’s look at Categorize. If we don’t pass a vocab in the __init__ the setups will automatically add a Category. Let’s look at the setup part of the code in Categorize
`

class Categorize(Transform):
    def setups(self, dsrc):
        if self.vocab is None and dsrc: self.vocab = CategoryMap(getattr(dsrc,'train',dsrc), add_na=self.add_na)

The vocab is a CategoryMap class which will simply find the unique list of items to form the vocab. Unless we pass a Pandas Dataframe, then pandas automatically does it for us. Let’s look at the __init__ part of the code in CategoryMap to understand this.

class CategoryMap(CollBase):
    "Collection of categories with the reverse mapping in `o2i`"
    def __init__(self, col, sort=True, add_na=False):
        if is_categorical_dtype(col): items = L(col.cat.categories, use_list=True)
        else:
            # `o==o` is the generalized definition of non-NaN used by Pandas
            items = L(o for o in L(col, use_list=True).unique() if o==o)
            if sort: items = items.sorted()
        self.items = '#na#' + items if add_na else items
        self.o2i = defaultdict(int, self.items.val2idx()) if add_na else dict(self.items.val2idx())

We don’t have setup method tests defined still in the notebooks as of now. We will have to look at DataSource to see how they are working. So let’s try and understand the setup of Pipeline.

def setup(self, items=None):
    self.default = self.items = items
    tfms,self.fs = self.fs,[]
    for t in tfms: self.add(t,items)

def add(self,t, items=None):
    t.setup(items)
    self.fs.append(t)

we make the list of functions self.fs an empty list. Then we loop through each transform in tfms and call add. The add function again calls setup on the transform and then appends the transform to self.fs. The reason for doing like this is that in this tfms list [labeller, Categorize()], When Categorize() is called, the labeller is already completed and the vocab can now be formed. This was a very tricky thing to do, but now it is there and it is very handy. This is more clear if we look at the Transform code of setup.

def setup(self, items=None): return self.setups(items)

When t.setup is called it calls self.setups as seen above. We look at a test for Categorize(). There are three items [‘dog’,‘cat’,‘cat’] and then we call Categorize() on it. The vocab comes back with [‘cat’,‘dog’] correctly.

cat = Category.create()
tds = TfmdDS(['cat', 'dog', 'cat'], tfms=[cat])
cat.vocab
(#2) [cat,dog]

Now we have all the things to understand TfmdDL. Here we have defined _dl_tfms = ('after_item','before_batch','after_batch'). Lets look at the __init__ part of the code for TfmdDL.

def __init__(self, dataset, bs=16, shuffle=False, num_workers=None, **kwargs):
    if num_workers is None: num_workers = min(16, defaults.cpus)
    for nm in _dl_tfms:
        kwargs[nm] = Pipeline(kwargs.get(nm,None), as_item=(nm=='before_batch'))
        kwargs[nm].setup(self)
    super().__init__(dataset, bs=bs, shuffle=shuffle, num_workers=num_workers, **kwargs)

If we pass in any keyword arguments **kwargs in the __init__ then it will take each keyword argument and then create a Pipeline for each of them. It will then call setup kwargs[nm].setup(self). It has self here in the code because the setup needs to know what items are passed. For example in the Categorize() it needs all labelled items. We can also look at the decode of the TfmdDL and see what it is doing.

We look at Cuda(Transform). The encodes sets up on the device and decodes sets up on the CPU. This is a cool function and there are no other ones which will take things to CPU for the purpose of display. This will help address memory leaks. Lets look at 06_data_source.ipynb. We look at DataSource.

It is identical to TfmDS except that it has as an additional argument called filts which is filters.

tds  = TfmdDS    (items, tfms)
pets = DataSource(items, tfms, filts=split_idx)

The filters tell the DataSource how to get a subset.

pets.subset(1)[0]

This returns a new TfmdDS with items that are in split_idx. split_idx is nothing but list containing two sets of index numbers (one for training set and another for validation set) in the case of the 08_pets_tutorial.ipynb notebook. pets.subset(1) is the same as pets.valid and pets.subset(0) is the same as pets.train. All that the DataSource is doing is that it is returning two TfmdDS using split_idx to separate the items.

Let’s look at the code for DataSource. It comes in a single screen and a large part of that is for databunch. It is a subclass of TfmdDS. But it has a method called subset which calls on function _mk_subset. Let’s see the specific code for subset in DataSource and code for function _mk_subset.

def subset(self, i): return _mk_subset(self, i)

def _mk_subset(self, i):
    tfms = [o.tfms for o in self.tls]
    return TfmdDS(L._gets(self, self.filts[i]), tfms=tfms, do_setup=False, filt=i)

As you can in _mk_subset it makes a TfmdDS with the set of items in self.filts[i]. It passes do_setup=False as there is no need to recreate the vocab again. This is basically what a DataSource is. We can pass as many filters. Normally it is two but you can pass as many as you want.

assert all_disjoint(self.filts)

This code above is in __init__ of the DataSource and is there to ensure that are no indexes of training in validation and vice versa. So good Data Science practices are also inculcated. In the _mk_subset code we saw that the TfmdDS that was created had a filt variable. Let’s look at why this is there. This is passed from TfmdDS to TfmdList to Pipeline (passed in __call__ and decode in Pipeline) to compose_tfms. From here it goes to the transforms. So that transforms have the ability to know whether they are being performed on the train set or validation set.

So if we look at the code of __init__ of Transform we can see that we can define on which set we want this to be applied.

class Transform(metaclass=_TfmMeta):
    "Delegates (`__call__`,`decode`,`setup`) to (`encodes`,`decodes`,`setups`) if `filt` matches"
    filt,init_enc,as_item_force,as_item,order = None,False,None,True,0
    def __init__(self, enc=None, dec=None, filt=None, as_item=False):

So in data augmentation we can set filt=0 and make sure it is only applied to the training set. If we look at the _call part of the code in Transform, here we are ensuring that we are not accidentally passing in a different filter that was not the filter when the Transform was initiated. In such a case where we accidentally pass along different filters it will not do anything and just return what items it was given.

def _call(self, fn, x, filt=None, **kwargs):
    if filt!=self.filt and self.filt is not None: return x
    f = getattr(self, fn)
    if self.use_as_item or not is_listy(x): return self._do_call(f, x, **kwargs)
    res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
    return retain_type(res, x)

There was a question on whether the filt was just an integer. The answer is yes. We look at the tests for DataSource. We define a datasource with items being [0.,1.,2.,3.,4.] and no tfms. There are no filters defined which means that it has only one filter which is all items inside it.

inp = [0,1,2,3,4]
dsrc = DataSource(inp, tfms=[None])
test_eq(len(dsrc.filts), 1)

When we index a single item it returns that. But when we index two items it returns a tuple of those individual items. This is because it return TfmdDS which returns a tuple as in PyTorch we need a batch which has two items.

test_eq(*dsrc[2], 2) 
test_eq(dsrc[1,2], [(1,),(2,)])

There was a question on whether retain_type should be called in _call if filt!=self.filt. The answer is no. the function retain_type takes the result (res is the name of the result in the code) of the transform done on the input item x and makes res the same type as x. This is done just in case res ends up a sub class of x as a result of the transform. For example, if res is of type Tensor as a result of the transform, whereas x is TensorImage then retain_type will convert res to TensorImage. Here in our case of filt!=self.filt there is no change or transform being done. Only x is returned as such. So there is no need to do retain_type.

TfmdDS works with masks , range and dataframes as well. It works with dataframes in a most optimal way using iloc method. The filters can be tensors and masks. There was a question on how to handle a scenario where transform of y depends on transform of x. Jeremy says that we will get to it in a couple of days.

3 Likes

Thank you for the notes @pnvijay

1 Like

I am trying to do a proper k-fold cross validation and for this purpose intend to split Datasource in k splits.

When constructing a TfmdDL I want to select subset(i) as validation and all other subsets (aka ~i) for training. However i can pass only one integer to subset. Any ideas how to do roughly as below?

def get_data_fold(validFold, bs, sz):
    vld_dl = TfmdDL(dsrc.subset(validFold), bs=bs, num_workers=nw, after_item= item_tfms, after_batch=batch_tfms)
    trn_dl = TfmdDL(dsrc.subset(~validFold), bs=bs, num_workers=nw, after_item=item_tfms, after_batch=batch_tfms)
    
    return DataBunch(trn_dl, vld_dl)

Thank you, if anyone has any input

@mrandy I have an example for kfold validation here: https://github.com/muellerzr/Practical-Deep-Learning-for-Coders-2.0/blob/master/Computer%20Vision/03_Cross_Validation.ipynb, this is just for vision but perhaps it could help you? :slight_smile:

3 Likes

Is the kfold/cross validation example still available somewhere?

As always, thank you very much & kind regards
Michael

Yes it is! Sorry, still trying to narrow down all the links, I switched repo names:

1 Like