I am trying to make a Pipeline for image processing with some transforms of my own. It is not for a deep learning necessarily, nor fastai.
I am having trouble setting up the Pipeline class to work as intended. There is very little doc on how the setups method will be called.
During the setup, the Pipeline starts with no transform and adds them one at a time, so that during its setup, each transform gets the items processed up to its point and not after.
I would expect that the output of transform A goes to the setup of transform B but it is not working that way.
Looking at the code in fastcore.transform.Pipeline class it looks as if items are fed just once.
def setup(self, items=None, train_setup=False):
tfms = self.fs[:]
self.fs.clear()
for t in tfms: self.add(t,items, train_setup)
def add(self,t, items=None, train_setup=False):
t.setup(items, train_setup)
self.fs.append(t)
(Are you sure it is def setups and not def setup in the Transforms?)
I agree, I have the same confusion.
I have a basic language processing pipeline: Tokenize → Numericalize
When I run setup on this pipeline, items arrive to Numericalize unchanged. I would expect them to arrive tokenized.
* **Preprocessing** - The setup method can be used to perform any one-time calculations to be later used by the transform, for example generating a vocabulary to encode categorical data.
So setup is called once when setting up the pipeline - and is not supposed to return anything (check e. g. the Categorize Transform which uses setups())?
The items will be passed through the encode methods when calling the pipeline (o = pipe(items))
In my example I indeed want to build a vocabulary of a Numericalize transformer. But I can only do this if the preceding transformer has split all strings into tokens.
Tokenize --> Numericalize
However, when I run setup on the Pipeline, the Numericalize setup receives unchanged, not tokenized strings.
So, as a workaround, I need to execute preceding Transforms manually (in the example above - just Tokenize) and manually pass outcome to Numericalize setup.
My expectation from the Pipeline was that it actually does this automatically.
I am proposing a fix right now. This fix my issue:
#export
@patch
def setup(self:Pipeline, items=None, train_setup=False):
tfms = self.fs[:]
self.fs.clear()
for t in tfms:
self.add(t,items, train_setup)
items = [t(i) for i in items]
It breaks fastai… TfmDL. something is not right in all this, but I am having a hard time finding exactly what.
Yes, it is def setups, setup will call setups.
It is not that easy to fix this issue, cause it is used all over fastai. (they create a lot of pipelines with only one transform). I think I will need to summon @jeremy over here. Maybe the prof @muellerzr can also weight in.
Transforms don’t change the value of inputs. Rather, they’re lazily evaluated.
TfmdLists shows a way to make this all work. Frankly, TfmdLists/FilteredBase/Pipeline is an over-complex mess and needs to be re-done in a way that’s much more clear and simple (which is entirely my fault).
I wouldn’t mind to help on this, it is very complex indeed.
So for my particular issue, Pipeline would not do the trick. I still find weird that the Pipeline setup expects the same input for all transforms.
My example is a following:
ToTensor: transforms the PILImage to TensorImage, no setup
Tf1: Do some some cv2 magic, needs one image for setup
Tf2: Do other cv2 magic, needs setup.
So, if I construct
pipe = Pipeline([ToTensor, Tf1, Tf2])
and I do pipe.setup(pil_image) it does not work, as the Tf1 and Tf2 get for their setup the PILImage instead of the TensorImage.
Most of fastai setup of tfms occur at a Datasets level, so setup receives a different object than the transform itself.
I know, but the setup should evaluate them and propagate the inputs through the Pipeline.
Or maybe not…
I don’t really know what is the right solution, what I see is that setup mostly requires a collection of inputs to take action (Normalize, Categorify, etc…) and it is reasonable that they take a Dataset on fastai. Maybe Pipeline.setup should not exist, and one should initialize them one by one. As it is right now, al setups from the Pipeline take the same input. So if one tfm requires the output of the preceding one, it cannot be initialized by setup.
It is fairly difficult to imagine a scenario, when a Transform needs setup without pre-processing items. So, I would second this - for the purpose of setup the Transform object must receive items that are pre-processed by the proceeding Transforms in a Pipeline.
Overall, it would be interesting to know opinions on differences between Transforms/Pipelines in FastAI vs. sklearn TransformerMixin and Pipeline.
FastAI’s take on it has much magic that is not documented well. For example, handling tuples vs handling lists. But I am probably unaware of some apparent benefits of FastAI and it would be great to discuss it.
This notebook needs way more documentation. Also, I think that the Pipeline should be even more generic, (without train attributes for instance) and probably fastai should subclass this to add their needed features.