fastcore.Pipeline setups method

tcapelle · October 21, 2020, 1:30pm

I am trying to make a Pipeline for image processing with some transforms of my own. It is not for a deep learning necessarily, nor fastai.
I am having trouble setting up the Pipeline class to work as intended. There is very little doc on how the setups method will be called.

During the setup, the Pipeline starts with no transform and adds them one at a time, so that during its setup, each transform gets the items processed up to its point and not after.

I am doing as following:

class A(Transform):
    def setups(self, x):
        print(f'A: {x}')
    def encodes(self, x): return 'a'
    
class B(Transform):
    def setups(self, x):
        print(f'B: {x}')
    def encodes(self, x): return 'b'

pipe = Pipeline([A,B])

pipe.setup(1)
>> A: 1
   B: 1

I would expect that the output of transform A goes to the setup of transform B but it is not working that way.
Looking at the code in fastcore.transform.Pipeline class it looks as if items are fed just once.


    def setup(self, items=None, train_setup=False):
        tfms = self.fs[:]
        self.fs.clear()
        for t in tfms: self.add(t,items, train_setup)

    def add(self,t, items=None, train_setup=False):
        t.setup(items, train_setup)
        self.fs.append(t)

I would expect something like items=t(tiems)

versus · October 25, 2020, 6:45pm

(Are you sure it is def setups and not def setup in the Transforms?)

I agree, I have the same confusion.
I have a basic language processing pipeline: Tokenize → Numericalize
When I run setup on this pipeline, items arrive to Numericalize unchanged. I would expect them to arrive tokenized.

florianl · October 25, 2020, 9:08pm

What are you trying to achieve with your pipeline and transforms?

From the Transform docs https://fastcore.fast.ai/transform.html#Transform:

* **Preprocessing** - The setup method can be used to perform any one-time calculations to be later used by the transform, for example generating a vocabulary to encode categorical data.

So setup is called once when setting up the pipeline - and is not supposed to return anything (check e. g. the Categorize Transform which uses setups())?

The items will be passed through the encode methods when calling the pipeline (o = pipe(items))

versus · October 26, 2020, 1:05am

In my example I indeed want to build a vocabulary of a Numericalize transformer. But I can only do this if the preceding transformer has split all strings into tokens.

Tokenize --> Numericalize

However, when I run setup on the Pipeline, the Numericalize setup receives unchanged, not tokenized strings.
So, as a workaround, I need to execute preceding Transforms manually (in the example above - just Tokenize) and manually pass outcome to Numericalize setup.

My expectation from the Pipeline was that it actually does this automatically.

tcapelle · October 26, 2020, 9:49am

I am proposing a fix right now. This fix my issue:

#export
@patch
def setup(self:Pipeline, items=None, train_setup=False):
    tfms = self.fs[:]
    self.fs.clear()
    for t in tfms: 
        self.add(t,items, train_setup)
        items = [t(i) for i in items]

It breaks fastai… TfmDL. something is not right in all this, but I am having a hard time finding exactly what.

tcapelle · October 26, 2020, 2:09pm

Yes, it is def setups, setup will call setups.
It is not that easy to fix this issue, cause it is used all over fastai. (they create a lot of pipelines with only one transform). I think I will need to summon @jeremy over here. Maybe the prof @muellerzr can also weight in.

jeremy · October 26, 2020, 4:54pm

Transforms don’t change the value of inputs. Rather, they’re lazily evaluated.

TfmdLists shows a way to make this all work. Frankly, TfmdLists/FilteredBase/Pipeline is an over-complex mess and needs to be re-done in a way that’s much more clear and simple (which is entirely my fault).

tcapelle · October 26, 2020, 5:03pm

I wouldn’t mind to help on this, it is very complex indeed.
So for my particular issue, Pipeline would not do the trick. I still find weird that the Pipeline setup expects the same input for all transforms.
My example is a following:

ToTensor: transforms the PILImage to TensorImage, no setup
Tf1: Do some some cv2 magic, needs one image for setup
Tf2: Do other cv2 magic, needs setup.
So, if I construct

pipe = Pipeline([ToTensor, Tf1, Tf2])

and I do pipe.setup(pil_image) it does not work, as the Tf1 and Tf2 get for their setup the PILImage instead of the TensorImage.
Most of fastai setup of tfms occur at a Datasets level, so setup receives a different object than the transform itself.

tcapelle · October 26, 2020, 9:09pm

I know, but the setup should evaluate them and propagate the inputs through the Pipeline.
Or maybe not…
I don’t really know what is the right solution, what I see is that setup mostly requires a collection of inputs to take action (Normalize, Categorify, etc…) and it is reasonable that they take a Dataset on fastai. Maybe Pipeline.setup should not exist, and one should initialize them one by one. As it is right now, al setups from the Pipeline take the same input. So if one tfm requires the output of the preceding one, it cannot be initialized by setup.

versus · October 27, 2020, 3:38pm

It is fairly difficult to imagine a scenario, when a Transform needs setup without pre-processing items. So, I would second this - for the purpose of setup the Transform object must receive items that are pre-processed by the proceeding Transforms in a Pipeline.

versus · October 27, 2020, 5:18pm

Overall, it would be interesting to know opinions on differences between Transforms/Pipelines in FastAI vs. sklearn TransformerMixin and Pipeline.

FastAI’s take on it has much magic that is not documented well. For example, handling tuples vs handling lists. But I am probably unaware of some apparent benefits of FastAI and it would be great to discuss it.

tcapelle · October 28, 2020, 9:44am

This notebook needs way more documentation. Also, I think that the Pipeline should be even more generic, (without train attributes for instance) and probably fastai should subclass this to add their needed features.

I will document this notebook.

jeremy · November 1, 2020, 11:18pm

You might want to wait until I fix the issues with it though…

jzast · August 1, 2024, 11:03pm

Hi, any update on propagating data through the setups?

a setup in a transform that sits in the middle of a pipeline very often depends on intermediate data.

I don’t see a way to get that intermediate data to the setup call of an intermediate transform using a Pipeline.

For example, three transforms that each need a setup call on pre-processed data:

    Raw Data -> Tokenize setup
    Tokenized Data -> Embedding setup
    Embedded Data -> Classifier setup

if we have a pipeline:

pipeline = Pipeline([TokTf, EmbTf, ClTf])

it seems natural to call setup in the way fit_transform operates with sklearn Pipelines:

pipeline.setup(items)

The way the code works now, the classifier would setup with raw data, rather than the intermediate embedded data. So this would fail.

Is there a work around with Pipeine? Or is the design intent that each transform work same raw data without any dependencies on prior transformations?