Fastai v2 transforms / pipeline / data blocks

You should use a TfmdList, not a DataSource, as your transform already returns a tuple. A TfmdList can be converted to a DataBunch.


Is it possible that FilteredBase.databunch should also pass after_item?

Based on my last example.

dsrc = TfmdList(items, tfms=pets)[0])
> displays correctly an image

> displays correctly an image

db = dsrc.databunch()
batch = db.show_batch()
> AttributeError: 'Tensor' object has no attribute 'show'

> <bound method Pipeline.decode of Pipeline: (#1) [Transform: False (object,object) -> noop ]>

I feel like the method data.core._decode_batch is supposed to decode the input (probably from my original Transform PetTfm) except that TfmdDL.after_item.decode is a noop.

Would you have a similar example of going from a custom TfmdList (based on one transform returning both inputs and outputs) to a Databunch?

You can pass after_item, after_batch and before_batch to your call to .databunch.

The issue is that the decoder needs to be called only for displaying the data.
It is called properly with TfmdList but not when I create a DataBunch.

I documented how I tested it in this notebook.

If I can make it work I’ll be happy to add it in the Pet Tutorial for future reference.


So I have been wondering how to go about running fastai2 on video data or data with multiple 2d slices of images with variable length. Meaning x is a set of 2d slices composing a 3d volume and between two distinct x’s the number of 2d slices may vary (i.e. one video may have more frames than the other since its a longer shot).

It seemed that the middle-level API is the right place to start. I successfully got a pipeline working but having issues creating a data set. Its my first time working with the API so it might be something obvious I’m missing.

As a toy example, I artificially aggregated paths into bags, the comparable to video frames paths saved on disk, and have binary label True if the bag contains more 3s than 7s

Dynamic images bags:

When I run the pipe the indexing is done successfully, however when attempting to create the dataset it the i variable for some reason is a path. You can see this by the prints of i.

Any help would be much appreciated.

How can I get a Dataset from the SiamesePair pipeline example in

I tried:

OpenAndResize = TupleTransform(resized_image)
labeller = RegexLabeller(pat = r'/([^/]+)_\d+.jpg$')
sp = SiamesePair(items,
pipe = Pipeline([sp, OpenAndResize], as_item=True)
dsets = Datasets(items, pipe)
t = dsets[0]

getting error:

TypeError                                 Traceback (most recent call last)
<ipython-input-66-cab6cdc85da8> in <module>
      4 pipe = Pipeline([sp, OpenAndResize], as_item=True)
      5 dsets = Datasets(items, pipe)
----> 6 t = dsets[0]
      7 type(t[0]),type(t[1])

~/Dev/fastai2/fastai2/data/ in __getitem__(self, it)
    256     def __getitem__(self, it):
--> 257         res = tuple([tl[it] for tl in self.tls])
    258         return res if is_indexer(it) else list(zip(*res))

~/Dev/fastai2/fastai2/data/ in <listcomp>(.0)
    256     def __getitem__(self, it):
--> 257         res = tuple([tl[it] for tl in self.tls])
    258         return res if is_indexer(it) else list(zip(*res))

~/Dev/fastai2/fastai2/data/ in __getitem__(self, idx)
    232         res = super().__getitem__(idx)
    233         if self._after_item is None: return res
--> 234         return self._after_item(res) if is_indexer(idx) else
    236 # Cell

~/Dev/fastai2/fastai2/data/ in _after_item(self, o)
    196     def _new(self, items, **kwargs): return super()._new(items, tfms=self.tfms, do_setup=False, types=self.types, **kwargs)
    197     def subset(self, i): return self._new(self._get(self.splits[i]), split_idx=i)
--> 198     def _after_item(self, o): return self.tfms(o)
    199     def __repr__(self): return f"{self.__class__.__name__}: {self.items}\ntfms - {self.tfms.fs}"
    200     def __iter__(self): return (self[i] for i in range(len(self)))

~/Dev/fastcore/fastcore/ in __call__(self, o)
    186         self.fs.append(t)
--> 188     def __call__(self, o): return compose_tfms(o, tfms=self.fs, split_idx=self.split_idx)
    189     def __repr__(self): return f"Pipeline: {' -> '.join([ for f in self.fs if != 'noop'])}"
    190     def __getitem__(self,i): return self.fs[i]

~/Dev/fastcore/fastcore/ in compose_tfms(x, tfms, is_enc, reverse, **kwargs)
    134     for f in tfms:
    135         if not is_enc: f = f.decode
--> 136         x = f(x, **kwargs)
    137     return x

~/Dev/fastcore/fastcore/ in __call__(self, x, **kwargs)
     69     @property
     70     def name(self): return getattr(self, '_name', _get_name(self))
---> 71     def __call__(self, x, **kwargs): return self._call('encodes', x, **kwargs)
     72     def decode  (self, x, **kwargs): return self._call('decodes', x, **kwargs)
     73     def __repr__(self): return f'{}: {self.use_as_item} {self.encodes} {self.decodes}'

~/Dev/fastcore/fastcore/ in _call(self, fn, x, split_idx, **kwargs)
     80         if split_idx!=self.split_idx and self.split_idx is not None: return x
     81         f = getattr(self, fn)
---> 82         if self.use_as_item or not is_listy(x): return self._do_call(f, x, **kwargs)
     83         res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
     84         return retain_type(res, x)

~/Dev/fastcore/fastcore/ in _do_call(self, f, x, **kwargs)
     86     def _do_call(self, f, x, **kwargs):
---> 87         return x if f is None else retain_type(f(x, **kwargs), x, f.returns_none(x))
     89 add_docs(Transform, decode="Delegate to `decodes` to undo transform", setup="Delegate to `setups` to set up transform")

~/Dev/fastcore/fastcore/ in __call__(self, *args, **kwargs)
     96         if not f: return args[0]
     97         if self.inst is not None: f = MethodType(f, self.inst)
---> 98         return f(*args, **kwargs)
    100     def __get__(self, inst, owner):

<ipython-input-63-605ff57d4e17> in encodes(self, i)
     11         othercls = self.clsmap[self.labels[i]] if random.random()>0.5 else self.idxs
     12         otherit = random.choice(othercls)
---> 13         return SiameseImage(self.items[i], self.items[otherit], self.labels[otherit]==self.labels[i])

~/Dev/fastcore/fastcore/ in __getitem__(self, idx)
    314     def _xtra(self): return None
    315     def _new(self, items, *args, **kwargs): return type(self)(items, *args, use_list=None, **kwargs)
--> 316     def __getitem__(self, idx): return self._get(idx) if is_indexer(idx) else L(self._get(idx), use_list=None)
    317     def copy(self): return self._new(self.items.copy())

~/Dev/fastcore/fastcore/ in _get(self, i)
    319     def _get(self, i):
    320         if is_indexer(i) or isinstance(i,slice): return getattr(self.items,'iloc',self.items)[i]
--> 321         i = mask2idxs(i)
    322         return (self.items.iloc[list(i)] if hasattr(self.items,'iloc')
    323                 else self.items.__array__()[(i,)] if hasattr(self.items,'__array__')

~/Dev/fastcore/fastcore/ in mask2idxs(mask)
    253     "Convert bool mask or index list to index `L`"
    254     if isinstance(mask,slice): return mask
--> 255     mask = list(mask)
    256     if len(mask)==0: return []
    257     it = mask[0]

TypeError: 'PosixPath' object is not iterable

Just tried again relized I might not have initialized the tfms correctly, still getting error,

tfms = [[sp, OpenAndResize], [labeller, Categorize]]
dsets = Datasets(items, tfms, verbose=True)
t = dsets[0]
x,y = dsets.decode(t)

Whats the right way to get a siamese dataset following the tutorial notebook on pets?

I’m trying to replicate some code I have in Fastai V1, in which images are composed of 4 channels (R,G,B & Y). These images come from Kaggle’s Protein Atlas challenge. In the data directory there are 4 PNG images, one for each channel. Given the name of the image, I want to load each of these and form a single 4-channel image.

I’m just getting started with V2 and am struggling to get a Dataset working for this. What I have so far is shown below. In this ‘open_4_channel’ takes an data record and gets the image name as the first item of this; it then forms paths for each or the 4 possible images and loads these, before finally returning a TensorImage, which has shape [4,512,512].

‘protein_labels’ takes the second item of the data record, which contains a list of space-seperated numbers, representing the multi-label categories.

def open_4_channel(x):                
    fname = data_path/'train'/f'{x[0]}'
    fname = str(fname)
    colors = ['red','green','blue','yellow']
    flags = cv2.IMREAD_GRAYSCALE          
    img = [cv2.imread(fname+'_'+color+'.png', flags).astype(np.float32)/255 for color in colors]    
    x = np.stack(img, axis=-1)           
    return TensorImage(pil2tensor(x, np.float32).float())

def protein_labels(x):
    y = x[1].split(' ')

I then use these to form the transforms and create a data set from these, supplying the DataFrame ‘train_df’:

tfms = [[open_4_channel],[protein_labels]]
dsets = Datasets(train_df, tfms)
show_at(dsets.train, 0)

When I call ‘show_at’, as shown above, everything works fine and the first image from the data set is displayed. However, if I then try and create a data loader from this I get an error:

dls = dsets.dataloaders(bs=4)

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class ‘NoneType’>

I presume I’m doing something basic wrong (for example, is it ok just to use functions like this in the transforms list?) but I haven’t found a way to be able to load these 4 channel images, either using Datasets nor with DataBlocks. So if anyone could point me in the correct direction it would be much appreciated

I’m confused on applying Transform on tuples (which I use in after_batch for dataloaders).

Sometimes I just add as_item=False, sometimes I use TupleTransform and sometimes I need both.

Here is a confusing example with IntToFloatTensor:

x = (TensorImage(1),TensorImage(2))

with call

Just use as_item=False

>> (TensorImage(1), TensorImage(2))

>> (TensorImage(0.0039), TensorImage(0.0078))

with encodes

Use TupleTransform(IntToFloatTensor(as_item=False))

>> (TensorImage(1), TensorImage(2))

>> (TensorImage(1), TensorImage(2))

>> (TensorImage(1), TensorImage(2))

>>  (TensorImage(0.0039), TensorImage(0.0078))

Note that encodes is not supposed to be called by the user, so the inconsistent behavior there is not something we will fix. You’re supposed to call __call__ or encode.

I actually don’t call directly these methods. It is just to debug my dataloader which works only when I pass after_batch=[TupleTransform(IntToFloatTensor(as_item=False))]

@sgugger I created a minimal example to explain better my difficulty.

# 2 items with 2 tensors each
items = (TensorImage(1),TensorImage(2)), (TensorImage(3),TensorImage(4))

# create a dataset
dsrc = Datasets(items, tfms=[[None], [lambda x:TensorCategory(0)]])

# create a dataloader
dls = dsrc.dataloaders(bs=1)])

The first issue is this returns tensors instead of TensorImage as internally retain_type is used only to preserve tuple type (not type of tuple contents).

This is solved with the following “hack”:

class myTuple(Tuple):
    def __new__(cls, x=None, *rest):
        x = TensorImage(x[0]), TensorImage(x[1])
        return super().__new__(cls, x)
class keepType(Transform):
    def encodes(self, x): return myTuple(x)

Then I can use my Transform to preserve the correct types.

# use myTransform to retain tuple content type
dsrc = Datasets(items, tfms=[[keepType], [lambda x:TensorCategory(0)]])

# create a dataloader
dls = dsrc.dataloaders(bs=1, after_batch=[TupleTransform(IntToFloatTensor(as_item=False))])

My main confusion is on the second issue and the fact that I have to do TupleTransform(IntToFloatTensor(as_item=False)) to make the transform work.

Here is an alternative method. I can add a method for myTuple

def encodes(self, o:myTuple):
    return [self.encodes(t) for t in o]

Both methods work but both look very “hacky” to me so I’m concerned they would become unsupported. The first one look cleaner but I don’t understand why I have to use both TupleTransform and as_item=False

Yes, the latter is way too much magic: if you want your transform to work at the tuple level, it will preserve the type at the tuple level, not inside the tuple.

I’ll look at why you need both when I have some time. You should not (note that as_item is probably superceded by the Pipeline setup methods, so you might need force_as_item)

1 Like

If we already have the training/validate data split into separate DataFrames … how do we load it using the Datasets/DataBlock API?

For example:

train_df = pd.read_csv(LM_PATH/'train.csv', low_memory=False)
valid_df = pd.read_csv(LM_PATH/'test.csv', low_memory=False)

tfms = [attrgetter(*corpus_cols), Tokenizer.from_df(corpus_cols), Numericalize()]

# this does not work as expected
dsets = Datasets([train_df, valid_df], [tfms], splits=None, dl_type=LMDataLoader)

I’m sure there is an easy way to do it … just can’t find it :slight_smile:

What I did was for images I merged the two together, so perhaps concatenate your dataframe into one whole one :slight_smile: (for KFold validation)

Yah that was my plan if this isn’t possible … merge them, add a column for determining which dataset they should go in, go from there.

Btw, do you ever sleep and actually make your classes :slight_smile: You’re everywhere. I literally saw the notification pop up and before I even looked I knew who it would be.

Balancing sleep, school, research, gym, and the class itself is a challenge but we’re doing it somehow :slight_smile:


Any ideas why Tokenizer.from_df doesn’t like me passing in rules?

tfms = [
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules), 

lm_dsets = Datasets(items=df,

len(lm_dsets.train), len(lm_dsets.valid)

throws an exception (it doesn’t like that I’m passing rules in at all … even if I set it to None):

TypeError                                 Traceback (most recent call last)
<ipython-input-66-6ac119f523c0> in <module>
      1 tfms = [
      2     attrgetter('text'),
----> 3     Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules),
      4     Numericalize()
      5 ]

~/development/_training/ml/nlp-playground/fastai2/fastai2/text/ in from_df(cls, text_cols, tok_func, **kwargs)
    256     @delegates(tokenize_df, keep=True)
    257     def from_df(cls, text_cols, tok_func=SpacyTokenizer, **kwargs):
--> 258         res = cls(get_tokenizer(tok_func, **kwargs), mode='df')
    259         res.text_cols,res.kwargs,res.train_setup = text_cols,merge({'tok_func': tok_func}, kwargs),False
    260         return res

~/development/_training/ml/nlp-playground/fastai2/fastai2/text/ in get_tokenizer(tok_func, **kwargs)
    243     sign = inspect.signature(tok_func)
    244     for k in kwargs.keys():
--> 245         if k not in sign: kwargs.pop(k)
    246     return tok_func(**kwargs)

TypeError: argument of type 'Signature' is not iterable

Changing rules is not supported other than by using the init yet. Will fix that tomorrow if I have time.