Those are comments that concern legacy code. There used to be ds_tfms
but they have been removed from TfmdDS
. Now they are put in after_item
in TfmdDL
.
Understood, thanks
This is the right approach. The user wonât exactly construct the Pipeline object themselves - theyâll just provide the list(s) of tfms, that will be automatically wrapped in a Pipeline by DataSource or TfmdDL or whatever.
Note also then that youâll probably want to create types for the higher level data blocks API, defined and explained in NB 50. We can talk more about that API on Monday so let me know if you have questions. Basically the ts
/types
param in DataBlock
is where you can bring together any tfms that most users will want most of the time for your data type.
Iâm just curious if the order attribute on a Transform
is necessary? Might be I am not seeing something (I certainly have not looked at the higher level APIs) but just wanted to share my observation on some of the challenges that I feel that having the order attribute there introduces.
We provide the transforms to the pipeline as a list and list communicates that the order is important. But then behind the curtain the pipeline will reorder itself. That this happens can be figured out with the slightest of code reading but still it might be a little bit surprising to the user.
Also, a very appealing aspect of Pipelines to me is that they are infinitely composable. A scenario where I have a Transform defined in one project that I want to move to another (or even another notebook doing something slightly different with the same data) is very plausible. But whenever I move the Transform
to a different pipeline I always need to be aware of the order attributes of Transforms that are in the Pipeline already.
I have not used this of course so maybe I am talking nonsense but just thinking about it I would love if I could order Transforms in a pipeline, including custom Transforms written by me, and not care about the order attribute at all.
On the other hand I see how the order attribute can be a bit of extra documentation to the user telling them in what order the Transforms that come with the library are intended to be run. But I am not sure if this added complexity for custom scenarios makes sense - as a user I would probably not rely on this information anyhow but would use the notebooks as starting point when working on a project of a given type (which probably should have the basic set of ordered Transforms already).
Sorry, this might be completely invalid reasoning depending on what is happening in other parts of the library, but thought Iâd share.
The order is mainly there to organize the default transforms (that a user of high-level API doesnât want to pass along) with added transforms. In the data block API theyâll be both concatenated, but they need to be reorganized.
Thank you very much, I think I understand now Using higher level API we can optionally pass transforms that need to be mixed somehow with the default ones, makes sense
Thank you very much for your reply!
Understanding TypeDispatch
So Iâve spent a bit of time trying to understand TypeDispatch
, and itâs really powerful! Basically, its a dictionary between types and functions.
You can refer to type hierarchy here
Letâs dig deeper and youâll see how powerful it is!
def __init__(self, *funcs):
self.funcs,self.cache = {},{}
for f in funcs: self.add(f)
self.inst = None
The __init__
takes in a list of functions, and adds the list of functions to the dictionary with type:func
mapping. Inside TypeDispatch
the type
is determined by the annotation of the first parameter of a function f
.
Too confusing? Letâs put it together.
#export
class TypeDispatch:
"Dictionary-like object; `__getitem__` matches keys of types using `issubclass`"
def __init__(self, *funcs):
self.funcs,self.cache = {},{}
for f in funcs: self.add(f)
self.inst = None
def _reset(self):
self.funcs = {k:self.funcs[k] for k in sorted(self.funcs, key=cmp_instance, reverse=True)}
self.cache = {**self.funcs}
def add(self, f):
"Add type `t` and function `f`"
self.funcs[_p1_anno(f) or object] = f
self._reset()
def __repr__(self): return str({getattr(k,'__name__',str(k)):v.__name__ for k,v in self.funcs.items()})
Letâs look at a simpler version of TypeDispatch
Now, letâs create a function:
def some_func(a:numbers.Integral, b:bool)->TensorImage: pass
and pass it to TypeDispatch
t = TypeDispatch(some_func); t
>>>{'Integral': 'some_func'}
Viola! TypeDispatch
worksâŚ! BUT how?
Step-1: __init__
takes a bunch of functions or a single function. To start with, self.funcs
and self.cache
are empty as defined by self.funcs,self.cache = {},{}
Step-2: for f in funcs: self.add(f)
loop through each function passed and add them to dictionary self.funcs
using add
.
Inside, add
, check for the annotation of the first parameter of function f
, if None
then use type object
and add it to self.funcs
.
Thus inside self.funcs
creating a mapping between type
of first param of f
and f
itself.
Step-3: Reorder self.funcs
dictionary basd on key cmp_instance
which sets the order using Pythonâs type hierarchy in reverse order. Thus if you pass int
and bool
, the first item inside this dict will be bool
.
Finally, make self.cache
same as self.funcs
. We use cache
to loop up mapping later. Since lookup keys inside dict
is order f(1)
itâs much faster.
And finally we have __repr__
which just returns the mapping self.funcs
but return f
's name and type
's name.
Reason why there is a getattr
inside getattr(k,'__name__',str(k)
is I think because itâs possible that a type doesnât have __name__
attribute when we use MetaClasses
.
Hopefully, this helps everyone! Please feel free to correct me if I understood something wrong.
We do reorder as Jeremy said in walk-thru 5, because we try to find the closest match from Transforms
. Thus, for integer
the closest match would first be int
and not Numbers.Integral
.
Also, inside docstring of __getitem__
: "Find first matching type that is a super-class of k
"
Understanding TypeDispatch
- Part 2
Hereâs an insight!
So now that we know TypeDispatch
is nothing but a pretty cool dict
that looks something like:
{
bool: some_func1,
int: some_func2,
Numbers.Integral: some_func3
}
ie., it is a mapping between type
and the function that needs to be called on that specific type
.
This is done through __call__
inside TypeDispatch
ofcourse!
def __call__(self, x, *args, **kwargs):
f = self[type(x)]
if not f: return x
if self.inst: f = types.MethodType(f, self.inst)
return f(x, *args, **kwargs)
f = self[type(x)]
Check type of param being called ie., and look it up in TypeDispatch
dict and call that function.
ie., foo(2)
will return type(2)
as int
and then we lookup int
which is coming from __getitem__
which simply returns the first matching type that is a super-class of type
.
So we lookup inside self.cache
which is also a mapping like
{
bool: some_func1,
int: some_func2,
Numbers.Integral: some_func3
}
and we will find a function some_func2
for int
. Thus, __getitem__
will return some_func2
as f
.
So, f = self[type(x)]
sets f
as some_func2
.
This is the magic! We will call the specific function using __call__
for the specific type based on the parameter being passed!!
Thus when we pass a TensorImage, it will find the function that corresponds to TensorImage
from inside dict
and call it which is just as simple as return f(x, *args, **kwargs)
!
How Transforms make use of TypeDispatch
Okay, hereâs another one! I couldnât have imagined that I will ever understand this part of V2, but now that I do, it just seems surreal! This is Python at a next level! And when you come to think of it, you can understand why itâs built this way.
But, lets discuss the thought process a little later.
First letâs understand encodes
and decodes
inside Transform
!
So, from _TfmDict
class _TfmDict(dict):
def __setitem__(self,k,v):
if k=='_': k='encodes'
if k not in ('encodes','decodes') or not isinstance(v,Callable): return super().__setitem__(k,v)
if k not in self: super().__setitem__(k,TypeDispatch())
res = self[k]
res.add(v)
As long as something is not of type encodes
or decodes
the namespace of the cls
would be created using dict
as per normal behavior. Note, that __setitem__
is responsible for setting k:v
inside dict
, thus if you update that, you can get custom behavior!
So as long as something is not encodes
or decodes
, just use dict
to set k:v
.
BUT, when it is encodes
or decodes
then k:TypeDispatch()
And as we know - TypeDispatch
is nothing but a cool dict
of type:function
mapping!
So theoretically speaking, the namespace of this special class which is a subclass of TfmMeta
will look something like
{....all the usual stuff like __module__:__main__etc AND
encodes:
{
bool: some_func1,
int: some_func2,
Numbers.Integral: some_func3
},
decodes:
{
bool: some_reverse_func1,
int: some_reverse_func2,
Numbers.Integral: some_reverse_func3
},
And finally ! When you call encodes
or decodes
- it can be done so for different types, which will be called using __call__
inside TypeDispatch
which then call the specific corresponding function to type
!
A post was split to a new topic: Leaf classification
Demystifying __new__
So hereâs another insight which I understood when I was looking into __new__
inside _TfmMeta
.
class _TfmMeta(type):
def __new__(cls, name, bases, dict):
res = super().__new__(cls, name, bases, dict)
res.__signature__ = inspect.signature(res.__init__)
return res
def __call__(cls, *args, **kwargs):
f = args[0] if args else None
n = getattr(f,'__name__',None)
for nm in _tfm_methods:
if not hasattr(cls,nm): setattr(cls, nm, TypeDispatch())
if isinstance(f,Callable) and n in _tfm_methods:
getattr(cls,n).add(f)
return f
return super().__call__(*args, **kwargs)
@classmethod
def __prepare__(cls, name, bases): return _TfmDict()
So letâs understand what does __new__
do?
To do so letâs create a simple class Meta
which inherits from type
similar to _TfmMeta
with the same __new__
method and class A
whose metaclass is Meta
.
class Meta(type):
def __new__(cls, name, bases, dict):
print("I'm alive!", super())
res = super().__new__(cls, name, bases, dict)
res.__signature__ = inspect.signature(res.__init__)
return res
class A(metaclass=Meta):
a=1;b=1
def __init__(self, a=1, b=1):
super().__init__()
>>> I'm alive! <super: <class 'Meta'>, <Meta object>>
Well, itâs not exactly the same but it prints out I'm alive!
when called and also spits out what is super()
. Since type
is MetaClass, so Meta
also becomes a MetaClass
.
And what exactly does __new__
do here? It delegates via super()
to call __new__
to actually create a new class. This should be the same as calling type(name, bases, dict)
.
class Meta(type):
def __new__(cls, name, bases, dict):
print("I'm alive!", super())
res = type(name, bases, dict)
res.__signature__ = inspect.signature(res.__init__)
return res
class A(metaclass=Meta):
a=1;b=1
def __init__(self, a=1, b=1):
super().__init__()
>>> I'm alive! <super: <class 'Meta'>, <Meta object>>
As you can see same result! From Python Data Model,
__new__
takes the class of which an instance was requested as its first argument
So we have to pass __new__(cls, <other args>)
and in this case we are creating new class from Meta
the other args become name, bases, dict
which need to be passed to type
to create new class.
Therefore, res
is the new class. Next we just update itâs __signature__
to be same as that classes __init__
.
a = A()
a.__signature__
>>> <Signature (self, a=1, b=1)>
This is exactly what happens with Transforms
too:
t = Transform()
t.__signature__
>>> <Signature (self, enc=None, dec=None, filt=None, as_item=False)>
And there we go __new__
has been demystified!
Fantastic explanation Aman. Really helps to have a detailed explanation to save me from going down every rabbit hole myself. Thank you!
Inside 03_data_pipeline.ipynb
We test for empty pipe:
pipe = Pipeline()
test_eq(pipe(1), 1)
pipe.set_as_item(False)
test_eq(pipe((1,)), (1,))
I dont understand why we do test_eq(pipe((1,)), (1,))
here since itâs an empty pipe with noop
.
The pipeline looks like Pipeline: (#1) [Transform: False {'object': 'noop'} {}]
Since, itâs a noop
even doing something like below passes too!
pipe.set_as_item(True)
test_eq(pipe((1,)), (1,))
Not sure when the two tests would be diff for noop
Also, wondering if adding a functionality that could hook into Pipeline
at specific point to spit out the result at that point would be helpful?
pipe = Pipeline([neg_tfm, int_tfm])
start = 2.0
t = pipe(start)
test_eq_type(t, Int(-2))
Something like test_eq_type(t.hook(1), -2.0)
where hook(1)
stands for output after 1st Transform inside pipe? I believe this could help in debugging later on when dealing with bigger Pipelines
*Edit: Never mind! V2 already has this covered with pipe.show
as explained by Jeremy here. Though, from what I understand currently, itâs used for decodes only and not sure how it would show the outputs during encodes or during decodes
after a specific step.
Maybe Pipeline
could have a debug:bool
param in __call__
and decode
that prints the result after each tfmâŚ
Or it could even be debug:Callable
which passes the intermediate result to some arbitrary function, which defaults to print()
.
Iâd love to have this callable act like logging with levels. Also sometimes I really want something similar to control verbosity of fastprogress as well. For example, when using Google Colab, the default behavior of fastprogress can consume quite a bit of network bandwidth.
Thereâs something just so magical about V2.
After rewatching walk thru #6, here is the intuition I got:
Transforms â Pipeline â TfmList â TfmDS â DataSource â âInfinite possibilitiesâ
As we saw earlier, a Transform
can encode
or decode
an item. Letâs just keep it that.
What if you want multiple Transforms in a series/sequence? Well, enter Pipelines
.
A pipe can apply multiple transforms to one item.
But wait, how is that going to help? In Data Science we have batches ie., multiple items. Solution? As expected, TfmdList
! It will apply a number of transforms to each item in a list! self.tfms(super()._get(i))
Okay, great! But, we have a dependant and independent variable ie., X
and a y
? Now what? Should we repeat this process every time and create two separate TfmdList
? Nah, donât be silly! This is covered in TfmdDS
! like so self.tls = [TfmdList(items, t, do_setup=do_setup, filt=filt, use_list=use_list) for t in L(tfms)]
I am already in LOVE with V2!
So this takes care of two sets of pipelines ready to be applied to the same set of items
or L
s to get a dependent and an independent variable. We are ready to train now, arenât we?!
Yes, we are! BUT, we need a train set and validation set to do beautiful work! Well, low and behold - enter DataSource
!
Pass in a list of filters or idxs and these filters will be passed all way back until we reach transforms which has the intelligence or capacity to apply tfms
to only the filter we passed otherwise it does nothing.
if filt!=self.filt and self.filt is not None: return x
Transforms â Pipeline â TfmList â TfmDS â DataSource â âInfinite possibilities with filtersâ
Beautiful
Iâd be more than happy to work on this
Donât know how to add this yet, but I will get it done
There is an aspect of DataBunch
that confuses me. Usually it uses TfmdDL
on top of TfmdDS
to load the data, so we have two levels of transforms that can be applied - directly in the dataset, or through the loader. With the loader being more general since we can plug them in more places like after_batch
.
It would seem to me that the transforms which live in the dataset could be as well placed in the loaders after_item
with the same effect, if thatâs the case why donât we use a simple transformless dataset instead? If Iâm missing something, what are the examples where itâs useful to have transforms on both levels?
Good question, @slawekbiel! In a DataSource
each Pipeline
is independent. So your transforms in each need only worry about their own pipeline. Itâs very convenient to simply pass in two pipelines, for instance, and get out your independent and dependent variables.
However in TfmdDL
thereâs just one pipeline in after_item
so it has to be able to handle tuples.
If you try to replicate, for instance, the examples in nb 08 using only TfmdDL
youâll see that it becomes much more tricky!
Ok, I see what you are saying, DataSource
has multiple TfmdList
s each with completely independent pipeline. And the TfmdDL
can only use the typehints to work on parts of the tuples.
So an alternative design with all the transforms happening in the loaders would probably need some way of stacking pipelines to get the same result. Youâd also have to figure out how to do pipelines setup() and how to connect pipelines across different callbacks. An upside of that would be if you ever needed to have separate flows after the data has been batched and put on GPU.
Yes exactly. Generally we think the best approach is to use the dataset tfms to just set up the basic data types and create tensors of the same size so they can be collated into a batch, and then do everything else on the GPU.
Alternatively, it may be possible to create an alternating loading mechanism using Nvidia DALI, although we havenât really done much with that yet.