Fastai v2 transforms / pipeline / data blocks

miko · September 12, 2019, 1:55pm

Understood, thanks

jeremy · September 12, 2019, 3:04pm

This is the right approach. The user won’t exactly construct the Pipeline object themselves - they’ll just provide the list(s) of tfms, that will be automatically wrapped in a Pipeline by DataSource or TfmdDL or whatever.

Note also then that you’ll probably want to create types for the higher level data blocks API, defined and explained in NB 50. We can talk more about that API on Monday so let me know if you have questions. Basically the ts/types param in DataBlock is where you can bring together any tfms that most users will want most of the time for your data type.

radek · September 13, 2019, 9:01am

I’m just curious if the order attribute on a Transform is necessary? Might be I am not seeing something (I certainly have not looked at the higher level APIs) but just wanted to share my observation on some of the challenges that I feel that having the order attribute there introduces.

We provide the transforms to the pipeline as a list and list communicates that the order is important. But then behind the curtain the pipeline will reorder itself. That this happens can be figured out with the slightest of code reading but still it might be a little bit surprising to the user.

Also, a very appealing aspect of Pipelines to me is that they are infinitely composable. A scenario where I have a Transform defined in one project that I want to move to another (or even another notebook doing something slightly different with the same data) is very plausible. But whenever I move the Transform to a different pipeline I always need to be aware of the order attributes of Transforms that are in the Pipeline already.

I have not used this of course so maybe I am talking nonsense but just thinking about it I would love if I could order Transforms in a pipeline, including custom Transforms written by me, and not care about the order attribute at all.

On the other hand I see how the order attribute can be a bit of extra documentation to the user telling them in what order the Transforms that come with the library are intended to be run. But I am not sure if this added complexity for custom scenarios makes sense - as a user I would probably not rely on this information anyhow but would use the notebooks as starting point when working on a project of a given type (which probably should have the basic set of ordered Transforms already).

Sorry, this might be completely invalid reasoning depending on what is happening in other parts of the library, but thought I’d share.

sgugger · September 13, 2019, 1:45pm

The order is mainly there to organize the default transforms (that a user of high-level API doesn’t want to pass along) with added transforms. In the data block API they’ll be both concatenated, but they need to be reorganized.

radek · September 13, 2019, 2:41pm

Thank you very much, I think I understand now Using higher level API we can optionally pass transforms that need to be mixed somehow with the default ones, makes sense Thank you very much for your reply!

arora_aman · September 16, 2019, 5:53pm

Understanding `TypeDispatch`

So I’ve spent a bit of time trying to understand TypeDispatch , and it’s really powerful! Basically, its a dictionary between types and functions.

You can refer to type hierarchy here

Let’s dig deeper and you’ll see how powerful it is!

def __init__(self, *funcs):
    self.funcs,self.cache = {},{}
    for f in funcs: self.add(f)
    self.inst = None

The __init__ takes in a list of functions, and adds the list of functions to the dictionary with type:func mapping. Inside TypeDispatch the type is determined by the annotation of the first parameter of a function f .

Too confusing? Let’s put it together.

#export
class TypeDispatch:
    "Dictionary-like object; `__getitem__` matches keys of types using `issubclass`"
    def __init__(self, *funcs):
        self.funcs,self.cache = {},{}
        for f in funcs: self.add(f)
        self.inst = None

    def _reset(self):
        self.funcs = {k:self.funcs[k] for k in sorted(self.funcs, key=cmp_instance, reverse=True)}
        self.cache = {**self.funcs}

    def add(self, f):
        "Add type `t` and function `f`"
        self.funcs[_p1_anno(f) or object] = f
        self._reset()

    def __repr__(self): return str({getattr(k,'__name__',str(k)):v.__name__ for k,v in self.funcs.items()})

Let’s look at a simpler version of TypeDispatch

Now, let’s create a function:

def some_func(a:numbers.Integral, b:bool)->TensorImage: pass

and pass it to TypeDispatch

t = TypeDispatch(some_func); t

>>>{'Integral': 'some_func'}

Viola! TypeDispatch works…! BUT how?

Step-1: __init__ takes a bunch of functions or a single function. To start with, self.funcs and self.cache are empty as defined by self.funcs,self.cache = {},{}

Step-2: for f in funcs: self.add(f) loop through each function passed and add them to dictionary self.funcs using add .
Inside, add , check for the annotation of the first parameter of function f , if None then use type object and add it to self.funcs .
Thus inside self.funcs creating a mapping between type of first param of f and f itself.

Step-3: Reorder self.funcs dictionary basd on key cmp_instance which sets the order using Python’s type hierarchy in reverse order. Thus if you pass int and bool , the first item inside this dict will be bool .
Finally, make self.cache same as self.funcs . We use cache to loop up mapping later. Since lookup keys inside dict is order f(1) it’s much faster.

And finally we have __repr__ which just returns the mapping self.funcs but return f 's name and type 's name.
Reason why there is a getattr inside getattr(k,'__name__',str(k) is I think because it’s possible that a type doesn’t have __name__ attribute when we use MetaClasses .

Hopefully, this helps everyone! Please feel free to correct me if I understood something wrong.

We do reorder as Jeremy said in walk-thru 5, because we try to find the closest match from Transforms . Thus, for integer the closest match would first be int and not Numbers.Integral .

Also, inside docstring of __getitem__ : "Find first matching type that is a super-class of k "

Understanding `TypeDispatch` - Part 2

Here’s an insight!

So now that we know TypeDispatch is nothing but a pretty cool dict that looks something like:

{
bool: some_func1,
int: some_func2,
Numbers.Integral: some_func3 
}

ie., it is a mapping between type and the function that needs to be called on that specific type .

This is done through __call__ inside TypeDispatch ofcourse!

    def __call__(self, x, *args, **kwargs):
        f = self[type(x)]
        if not f: return x
        if self.inst: f = types.MethodType(f, self.inst)
        return f(x, *args, **kwargs)

f = self[type(x)] Check type of param being called ie., and look it up in TypeDispatch dict and call that function.
ie., foo(2) will return type(2) as int and then we lookup int which is coming from __getitem__ which simply returns the first matching type that is a super-class of type .

So we lookup inside self.cache which is also a mapping like

{
bool: some_func1,
int: some_func2,
Numbers.Integral: some_func3 
}

and we will find a function some_func2 for int . Thus, __getitem__ will return some_func2 as f .

So, f = self[type(x)] sets f as some_func2 .

This is the magic! We will call the specific function using __call__ for the specific type based on the parameter being passed!!

Thus when we pass a TensorImage, it will find the function that corresponds to TensorImage from inside dict and call it which is just as simple as return f(x, *args, **kwargs) !

How Transforms make use of TypeDispatch

Okay, here’s another one! I couldn’t have imagined that I will ever understand this part of V2, but now that I do, it just seems surreal! This is Python at a next level! And when you come to think of it, you can understand why it’s built this way.

But, lets discuss the thought process a little later.

First let’s understand encodes and decodes inside Transform !

So, from _TfmDict

class _TfmDict(dict):
    def __setitem__(self,k,v):
        if k=='_': k='encodes'
        if k not in ('encodes','decodes') or not isinstance(v,Callable): return super().__setitem__(k,v)
        if k not in self: super().__setitem__(k,TypeDispatch())
        res = self[k]
        res.add(v)

As long as something is not of type encodes or decodes the namespace of the cls would be created using dict as per normal behavior. Note, that __setitem__ is responsible for setting k:v inside dict , thus if you update that, you can get custom behavior!

So as long as something is not encodes or decodes , just use dict to set k:v .

BUT, when it is encodes or decodes then k:TypeDispatch()

And as we know - TypeDispatch is nothing but a cool dict of type:function mapping!

So theoretically speaking, the namespace of this special class which is a subclass of TfmMeta will look something like

{....all the usual stuff like __module__:__main__etc AND 
encodes: 
    {
     bool: some_func1,
     int: some_func2, 
     Numbers.Integral: some_func3 
    }, 
decodes: 
    {
     bool: some_reverse_func1,
     int: some_reverse_func2, 
     Numbers.Integral: some_reverse_func3 
    },

And finally ! When you call encodes or decodes - it can be done so for different types, which will be called using __call__ inside TypeDispatch which then call the specific corresponding function to type !

jeremy · September 17, 2019, 1:43pm

A post was split to a new topic: Leaf classification

arora_aman · September 18, 2019, 7:24am

Demystifying `new`

So here’s another insight which I understood when I was looking into __new__ inside _TfmMeta.

class _TfmMeta(type):
    def __new__(cls, name, bases, dict):
        res = super().__new__(cls, name, bases, dict)
        res.__signature__ = inspect.signature(res.__init__)
        return res

    def __call__(cls, *args, **kwargs):
        f = args[0] if args else None
        n = getattr(f,'__name__',None)
        for nm in _tfm_methods:
            if not hasattr(cls,nm): setattr(cls, nm, TypeDispatch())
        if isinstance(f,Callable) and n in _tfm_methods:
            getattr(cls,n).add(f)
            return f
        return super().__call__(*args, **kwargs)

    @classmethod
    def __prepare__(cls, name, bases): return _TfmDict()

So let’s understand what does __new__ do?

To do so let’s create a simple class Meta which inherits from type similar to _TfmMeta with the same __new__ method and class A whose metaclass is Meta.

class Meta(type):
    def __new__(cls, name, bases, dict):
        print("I'm alive!", super())
        res = super().__new__(cls, name, bases, dict)
        res.__signature__ = inspect.signature(res.__init__)
        return res

class A(metaclass=Meta): 
    a=1;b=1
    def __init__(self, a=1, b=1): 
        super().__init__()

>>> I'm alive! <super: <class 'Meta'>, <Meta object>>

Well, it’s not exactly the same but it prints out I'm alive! when called and also spits out what is super(). Since type is MetaClass, so Meta also becomes a MetaClass.

And what exactly does __new__ do here? It delegates via super() to call __new__ to actually create a new class. This should be the same as calling type(name, bases, dict).

class Meta(type):
    def __new__(cls, name, bases, dict):
        print("I'm alive!", super())
        res = type(name, bases, dict)
        res.__signature__ = inspect.signature(res.__init__)
        return res

class A(metaclass=Meta): 
    a=1;b=1
    def __init__(self, a=1, b=1): 
        super().__init__()

>>>  I'm alive! <super: <class 'Meta'>, <Meta object>>

As you can see same result! From Python Data Model,

__new__ takes the class of which an instance was requested as its first argument

So we have to pass __new__(cls, <other args>) and in this case we are creating new class from Meta the other args become name, bases, dict which need to be passed to type to create new class.

Therefore, res is the new class. Next we just update it’s __signature__ to be same as that classes __init__.

a = A()
a.__signature__

>>> <Signature (self, a=1, b=1)>

This is exactly what happens with Transforms too:

t = Transform()
t.__signature__

>>> <Signature (self, enc=None, dec=None, filt=None, as_item=False)>

And there we go __new__ has been demystified!

MadeUpMasters · September 18, 2019, 11:09am

Fantastic explanation Aman. Really helps to have a detailed explanation to save me from going down every rabbit hole myself. Thank you!

arora_aman · September 18, 2019, 5:21pm

Inside 03_data_pipeline.ipynb

We test for empty pipe:

pipe = Pipeline()
test_eq(pipe(1), 1)
pipe.set_as_item(False)
test_eq(pipe((1,)), (1,))

I dont understand why we do test_eq(pipe((1,)), (1,)) here since it’s an empty pipe with noop.

The pipeline looks like Pipeline: (#1) [Transform: False {'object': 'noop'} {}]

Since, it’s a noop even doing something like below passes too!

pipe.set_as_item(True)
test_eq(pipe((1,)), (1,))

Not sure when the two tests would be diff for noop

@jeremy @sgugger

arora_aman · September 18, 2019, 5:34pm

Also, wondering if adding a functionality that could hook into Pipeline at specific point to spit out the result at that point would be helpful?

pipe = Pipeline([neg_tfm, int_tfm])

start = 2.0
t = pipe(start)
test_eq_type(t, Int(-2))

Something like test_eq_type(t.hook(1), -2.0) where hook(1) stands for output after 1st Transform inside pipe? I believe this could help in debugging later on when dealing with bigger Pipelines

*Edit: Never mind! V2 already has this covered with pipe.show as explained by Jeremy here. Though, from what I understand currently, it’s used for decodes only and not sure how it would show the outputs during encodes or during decodes after a specific step.

jeremy · September 18, 2019, 6:34pm

Maybe Pipeline could have a debug:bool param in __call__ and decode that prints the result after each tfm…

Or it could even be debug:Callable which passes the intermediate result to some arbitrary function, which defaults to print().

tmjiang · September 18, 2019, 7:20pm

I’d love to have this callable act like logging with levels. Also sometimes I really want something similar to control verbosity of fastprogress as well. For example, when using Google Colab, the default behavior of fastprogress can consume quite a bit of network bandwidth.

arora_aman · September 18, 2019, 7:26pm

There’s something just so magical about V2.

After rewatching walk thru #6, here is the intuition I got:

Transforms --> Pipeline --> TfmList --> TfmDS --> DataSource --> “Infinite possibilities”

As we saw earlier, a Transform can encode or decode an item. Let’s just keep it that.
What if you want multiple Transforms in a series/sequence? Well, enter Pipelines.
A pipe can apply multiple transforms to one item.

But wait, how is that going to help? In Data Science we have batches ie., multiple items. Solution? As expected, TfmdList! It will apply a number of transforms to each item in a list! self.tfms(super()._get(i))

Okay, great! But, we have a dependant and independent variable ie., X and a y? Now what? Should we repeat this process every time and create two separate TfmdList? Nah, don’t be silly! This is covered in TfmdDS! like so self.tls = [TfmdList(items, t, do_setup=do_setup, filt=filt, use_list=use_list) for t in L(tfms)]I am already in LOVE with V2!

So this takes care of two sets of pipelines ready to be applied to the same set of items or Ls to get a dependent and an independent variable. We are ready to train now, aren’t we?!

Yes, we are! BUT, we need a train set and validation set to do beautiful work! Well, low and behold - enter DataSource!

Pass in a list of filters or idxs and these filters will be passed all way back until we reach transforms which has the intelligence or capacity to apply tfms to only the filter we passed otherwise it does nothing.
if filt!=self.filt and self.filt is not None: return x

Transforms <-- Pipeline <-- TfmList <-- TfmDS <-- DataSource <-- “Infinite possibilities with filters”

Beautiful

arora_aman · September 18, 2019, 7:33pm

I’d be more than happy to work on this

Don’t know how to add this yet, but I will get it done

slawekbiel · September 22, 2019, 2:34pm

There is an aspect of DataBunch that confuses me. Usually it uses TfmdDL on top of TfmdDS to load the data, so we have two levels of transforms that can be applied - directly in the dataset, or through the loader. With the loader being more general since we can plug them in more places like after_batch.

It would seem to me that the transforms which live in the dataset could be as well placed in the loaders after_item with the same effect, if that’s the case why don’t we use a simple transformless dataset instead? If I’m missing something, what are the examples where it’s useful to have transforms on both levels?

jeremy · September 23, 2019, 1:04pm

Good question, @slawekbiel! In a DataSource each Pipeline is independent. So your transforms in each need only worry about their own pipeline. It’s very convenient to simply pass in two pipelines, for instance, and get out your independent and dependent variables.

However in TfmdDL there’s just one pipeline in after_item so it has to be able to handle tuples.

If you try to replicate, for instance, the examples in nb 08 using only TfmdDL you’ll see that it becomes much more tricky!

slawekbiel · September 23, 2019, 4:09pm

Ok, I see what you are saying, DataSource has multiple TfmdLists each with completely independent pipeline. And the TfmdDL can only use the typehints to work on parts of the tuples.

So an alternative design with all the transforms happening in the loaders would probably need some way of stacking pipelines to get the same result. You’d also have to figure out how to do pipelines setup() and how to connect pipelines across different callbacks. An upside of that would be if you ever needed to have separate flows after the data has been batched and put on GPU.

jeremy · September 23, 2019, 4:54pm

Yes exactly. Generally we think the best approach is to use the dataset tfms to just set up the basic data types and create tensors of the same size so they can be collated into a batch, and then do everything else on the GPU.

Alternatively, it may be possible to create an alternating loading mechanism using Nvidia DALI, although we haven’t really done much with that yet.

slawekbiel · October 8, 2019, 5:53pm

Alternatively, it may be possible to create an alternating loading mechanism using Nvidia DALI, although we haven’t really done much with that yet.

This got me curious and I played with it a bit. I took a simple image processing pipeline like this:

dsrc = DataSource(items, tfms)
tdl = TfmdDL(dsrc, bs=10, shuffle=True,
             after_item=[ Resize(224, method = ResizeMethod.Crop), ToTensor],
             after_batch=[Cuda, ByteToFloatTensor, Normalize(tmeans,tstds)],
             num_workers=8)

And created a DALI pipeline that does the same thing:

class DALIPipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id):
        super(DALIPipeline, self).__init__(batch_size, num_threads, device_id)
        self.input = ops.FileReader(file_root = image_dir,random_shuffle=True)        
        self.tfms = compose(
            ops.ImageDecoder(device = "mixed"),
            ops.Resize(device = "gpu", resize_shorter = 224),
            ops.CropMirrorNormalize(device = "gpu", crop = (224, 224), mean = means, std = stds)
        )
    def define_graph(self):
        images, labels = self.input(name="Reader")
        return self.tfms(images), labels

The results are encouraging performance wise, since I got 5x speed improvement (4sec vs 20sec for 20k images). There are several things that contribute to that:

Transforms implemented in C++, including their custom JPEG decoder
More work done on the GPU - here resizing and cropping of the images.
The pipeline prefetches batches which allows to overlap the CPU and GPU work.

The downside is in the flexibility, if we want to do something not included in the built in operators we have to either implement it in C++, build it and link as a plugin. Or use the [Torch]PythonFunction operator, however these are limited at the moment as they can’t be used in the exec_pipelined mode and can’t operate on the GPU at all, as far as I can tell.

This is again the same pipeline but using python to load files and assign labels, and DALI operators to decode, resize and normalize images.

class MixedPipeline(Pipeline):
    def __init__(self, np_items, batch_size, num_threads, device_id):
        super(MixedPipeline, self).__init__(batch_size, num_threads, device_id, exec_async=False, exec_pipelined=False)
        self.input_iter = iter(DataLoader(np_items, bs=batch_size, create_batch=noop, shuffle=True, num_workers = num_threads))
        self.path_input = ops.ExternalSource()
        self.y_tfms = compose(
            ops.PythonFunction(extract_label), 
            ops.PythonFunction(categorize)
        )
        self.x_tfms = compose(
            ops.PythonFunction(read_file),
            ops.ImageDecoder(device = "mixed", output_type = types.RGB),
            ops.Resize(device = "gpu", resize_shorter = 224),
            ops.CropMirrorNormalize(device = "gpu", crop = (224, 224), mean = means, std = stds)
        )
    def define_graph(self):
        self.paths = self.path_input()
        return (self.x_tfms(self.paths), self.y_tfms(self.paths))

    def iter_setup(self):
        self.feed_input(self.paths, next(self.input_iter))

This gave me a smaller, but still respectable 2x speed increase.
The whole notebook is here:

Fastai v2 transforms / pipeline / data blocks

Understanding TypeDispatch

Understanding TypeDispatch - Part 2

How Transforms make use of TypeDispatch

Demystifying __new__

Understanding `TypeDispatch`

Understanding `TypeDispatch` - Part 2

Demystifying `new`