Fastai v2 transforms / pipeline / data blocks

I actually don’t call directly these methods. It is just to debug my dataloader which works only when I pass after_batch=[TupleTransform(IntToFloatTensor(as_item=False))]

@sgugger I created a minimal example to explain better my difficulty.

# 2 items with 2 tensors each
items = (TensorImage(1),TensorImage(2)), (TensorImage(3),TensorImage(4))

# create a dataset
dsrc = Datasets(items, tfms=[[None], [lambda x:TensorCategory(0)]])

# create a dataloader
dls = dsrc.dataloaders(bs=1)])

The first issue is this returns tensors instead of TensorImage as internally retain_type is used only to preserve tuple type (not type of tuple contents).

This is solved with the following “hack”:

class myTuple(Tuple):
    def __new__(cls, x=None, *rest):
        x = TensorImage(x[0]), TensorImage(x[1])
        return super().__new__(cls, x)
    
class keepType(Transform):
    def encodes(self, x): return myTuple(x)

Then I can use my Transform to preserve the correct types.

# use myTransform to retain tuple content type
dsrc = Datasets(items, tfms=[[keepType], [lambda x:TensorCategory(0)]])

# create a dataloader
dls = dsrc.dataloaders(bs=1, after_batch=[TupleTransform(IntToFloatTensor(as_item=False))])

My main confusion is on the second issue and the fact that I have to do TupleTransform(IntToFloatTensor(as_item=False)) to make the transform work.

Here is an alternative method. I can add a method for myTuple

@IntToFloatTensor
def encodes(self, o:myTuple):
    return [self.encodes(t) for t in o]

Both methods work but both look very “hacky” to me so I’m concerned they would become unsupported. The first one look cleaner but I don’t understand why I have to use both TupleTransform and as_item=False

Yes, the latter is way too much magic: if you want your transform to work at the tuple level, it will preserve the type at the tuple level, not inside the tuple.

I’ll look at why you need both when I have some time. You should not (note that as_item is probably superceded by the Pipeline setup methods, so you might need force_as_item)

1 Like

If we already have the training/validate data split into separate DataFrames … how do we load it using the Datasets/DataBlock API?

For example:

train_df = pd.read_csv(LM_PATH/'train.csv', low_memory=False)
valid_df = pd.read_csv(LM_PATH/'test.csv', low_memory=False)

tfms = [attrgetter(*corpus_cols), Tokenizer.from_df(corpus_cols), Numericalize()]

# this does not work as expected
dsets = Datasets([train_df, valid_df], [tfms], splits=None, dl_type=LMDataLoader)

I’m sure there is an easy way to do it … just can’t find it :slight_smile:

What I did was for images I merged the two together, so perhaps concatenate your dataframe into one whole one :slight_smile: (for KFold validation)

Yah that was my plan if this isn’t possible … merge them, add a column for determining which dataset they should go in, go from there.

Btw, do you ever sleep and actually make your classes :slight_smile: You’re everywhere. I literally saw the notification pop up and before I even looked I knew who it would be.

Balancing sleep, school, research, gym, and the class itself is a challenge but we’re doing it somehow :slight_smile:

3 Likes

Any ideas why Tokenizer.from_df doesn’t like me passing in rules?

tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules), 
    Numericalize()
]

lm_dsets = Datasets(items=df,
                    tfms=[tfms], 
                    splits=ColSplitter(col='is_valid')(df), 
                    dl_type=LMDataLoader)

len(lm_dsets.train), len(lm_dsets.valid)

throws an exception (it doesn’t like that I’m passing rules in at all … even if I set it to None):

TypeError                                 Traceback (most recent call last)
<ipython-input-66-6ac119f523c0> in <module>
      1 tfms = [
      2     attrgetter('text'),
----> 3     Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules),
      4     Numericalize()
      5 ]

~/development/_training/ml/nlp-playground/fastai2/fastai2/text/core.py in from_df(cls, text_cols, tok_func, **kwargs)
    256     @delegates(tokenize_df, keep=True)
    257     def from_df(cls, text_cols, tok_func=SpacyTokenizer, **kwargs):
--> 258         res = cls(get_tokenizer(tok_func, **kwargs), mode='df')
    259         res.text_cols,res.kwargs,res.train_setup = text_cols,merge({'tok_func': tok_func}, kwargs),False
    260         return res

~/development/_training/ml/nlp-playground/fastai2/fastai2/text/core.py in get_tokenizer(tok_func, **kwargs)
    243     sign = inspect.signature(tok_func)
    244     for k in kwargs.keys():
--> 245         if k not in sign: kwargs.pop(k)
    246     return tok_func(**kwargs)
    247 

TypeError: argument of type 'Signature' is not iterable

Changing rules is not supported other than by using the init yet. Will fix that tomorrow if I have time.

2 Likes

FYI: Same goes with arguments like mark_fields, add_bos, add_eos, and chunksize … probably others.

… and btw, if you want help please let me know. Still getting my feel for v2 but can figure things out if your strapped for time.

Hi all,
Finally I have some time to start and explore the V2.
I’m playing with the Kannada-MNIST Kaggle Dataset, where all the training images are imported as single csv file (each line represent one image).
I transformed it to tensor (# images, 1 Channel, 28 , 28), but i’m not sure if it’s possible to create image Datasaet/Dataloader directly from this tensor or do I need first to create image files for each image (sound wrong to me…)?

I’m sure it’s possible with the lower level API but I don’t know where to start looking…
Thanks!

Is there any good guide/tutorial and how to add your own transformation into the fastai v2 transforms/pipelines? Been trying to add in a gaussian noise data aug without much success.

Here’s what I have working so far:

class AddNoise(Transform):

def __init__(self, mean=0., std=1., **kwargs):
    super().__init__(**kwargs)
    self.std = std
    self.mean = mean
    print("Mean/Std: {}/{}.".format(self.std, self.mean))
    
def encodes(self, x:(TensorImage)):
    self.fudge = torch.randn(x.size()).cuda() * self.std + self.mean
    return x + self.fudge

This is my batch_tfms:

[Dihedral -- {'p': 1.0, 'size': 192, 'mode': 'bilinear', 'pad_mode': 'reflection', 'mode_mask': 'nearest', 'align_corners': True}:
 encodes: (TensorBBox,object) -> encodes
 (TensorPoint,object) -> encodes
 (TensorImage,object) -> encodes
 (TensorMask,object) -> encodes
 decodes: ,
 Brightness -- {'p': 1.0, 'max_lighting': 0.2}:
 encodes: (TensorImage,object) -> encodes
 decodes: ,
 AddNoise:
 encodes: (TensorImage,object) -> encodes
 decodes: ,
 Normalize -- {'mean': tensor([[[[0.4850]],
 
          [[0.4560]],
 
          [[0.4060]]]], device='cuda:0'), 'std': tensor([[[[0.2290]],
 
          [[0.2240]],
 
          [[0.2250]]]], device='cuda:0'), 'axes': (0, 2, 3)}:
 encodes: (TensorImage,object) -> encodes
 decodes: (TensorImage,object) -> decodes]

Questions:

  1. Do I need to setup a decode function?
  2. Does this mean that every image in the batch gets the same noise statistics? I would prefer each image to have a different noise statistic.
  3. This means that add noise was applied before normalization right? So I should specify the mean and std in the original RGB (0-255) space?
  1. the encodes() adds the transform and decodes() is supposed to remove the transform. As far as I know, the decodes are just used when displaying the items / batches. so imho you don’t have to implement a decodes (as you are adding noise I couldn’t think of a way to remove it).

  2. I don’t think so. I checked the AddNoise on the MNIST dataset (which has always 0.0 in the top left corner) and every item has different noise added.

  3. to see exactly the order of the transforms run dlock.summary(). And you are right, currently it’s executed before IntToFloatTensor (which divides by 255) and thus should add noise in the range between 0-255. You could also add order = 20 to your transform, to make sure it’s executed after.

class AddNoise(Transform):
order = 20 # <--- add this

def __init__(self, mean=0., std=1., **kwargs):
    super().__init__(**kwargs)
    self.std = std
    self.mean = mean
    print("Mean/Std: {}/{}.".format(self.std, self.mean))
    
def encodes(self, x:(TensorImage)):
    self.fudge = torch.randn(x.size()).cuda() * self.std + self.mean
    return x + self.fudge

A good way to understand transforms is to build a pipeline and play around with it:

# split_idx=0 to make sure that RandTransforms are being executed
p = Pipeline([PILImage.create,ToTensor,IntToFloatTensor,AddNoise], split_idx=0)

# get one item / mage
i = get_image_files(path)[0]

# put it through the pipeline
o = p(i)

# check what happened to your item
type(o)
o.float().mean(), o.float().max(), o.float().min()
o

Florian

Thanks. Is there anyway to make the fastai data pipeline compatible with standard torchvision and kornia augmentations? I’m running into issues because fastai needs this “TensorImage” object, but torchvision/kornia does not know what these are…

@ai_padawan in fastai transforms are applied via type-dispatching. So if I wanted to use torchvisions RandomResizedCrop I could do the following:

class TVRRC(ItemTransform):
  def __init__(self, size=448):
    self.tfm = tv.RandomResizedCrop(size)
  def encodes(self, x:(Image.Image, TensorImage)):
    return self.tfm(x)

However I will add that a TensorImage is just a Tensor with a subclass, so I can do this and it’s still valid:

tfm2 = tv.RandomResizedCrop(224)
batch = IntToFloatTensor()(TensorImage(item)).cuda()

tfm_batch = tfm2(batch)

Note also that torchvision expects tensors for most transforms (some can work with PIL images, you would need to see on the specific transform you want). These only happen after the ToTensor transform. So on your item transform make sure to give it an order property larger than 5:

class TVRRC(ItemTransform):
  order = 6
  def __init__(self, size=448):

Thanks, for color jitter I have the code running with kornia (not crashing out), but the image looks wrong when I display it (I only get a black image?)

class AddJitter(Transform):

def __init__(self, color_jitter=1):
    self._op = kornia.augmentation.ColorJitter(0.8*color_jitter, 0.8*color_jitter, 0.8*color_jitter, 0.2*color_jitter)

def encodes(self, x:(TensorImage)):
    print(type(x), x.shape)
    return self._op(IntToFloatTensor()(x).cuda())
item_tfms = [Resize(224, method='squish')]

batch_tfms = []
batch_tfms.append(Dihedral(p=1))
batch_tfms.append(Brightness(p=1, max_lighting=0.2))
batch_tfms.append(AddJitter(color_jitter=0.1))
batch_tfms.append(Normalize.from_stats(*imagenet_stats))

When I pass item_tfms and batch_tfms to the Datablock API, on show_batch I get black images. When I remove the AddJitter() transform, it appears normal?

Ok… there’s something else fastai v2 is doing under the hood. Using datablock I pass NO item_tfms and batch_tfms, and after pulling out the TensorImage, it seems to be normalized to (0, 1)!!! Where is this coming from? It should be (0, 255).

My augs are already doing the normalization, and so with this hidden normalization it’s reducing all my values to 1e-7, no wonder I get a black image!

EDIT: Using the datablock.summary() command I was able to identify the hidden normalization. This is added automatically. Is there a way to not add this final IntToFloatTensor operation? It looks like it’s even hardcoded for RGB images and it’s really a bad idea to do this if you’re doing non-std images.

IntToFloatTensor – {‘div’: 255.0, ‘div_mask’: 1}

EDIT2: Found a solution, DataBlock adds the IntToFloatTensor operation by default as the first one, so just pop it out (it’s a list, so just remove the first element). This is important if your transformations are doing some sort of normalization, as if you don’t remove the initial IntToFloatTensor, you will be doing double normalization.

Ok, there’s something still messed up with the IntToFloatTensor default operation that datablock calls. Does anyone know what is exactly going on and how the default FastAI transformations are related to it? Initially I thought it was dividing all values by 255 to get values in (0…1) but that’s not the case.

I can have ok results if I remove the first IntToFloatTensor operation, and use my data augs which itself uses a IntToFloatTensor operation inside it. But when I add it back in, I can see that my data augs that is called AFTER the first operation has input data that is (0…255). What’s going on? But then as I track the data, it gets normalized twice.

Isn’t the output of the IntToFloatTensor operation supposed to get it between (0…1). Why does the downstream function still get (0… 255)?

My workaround was to remove the original IntToFloatTensor operation and then move my data augs operation to the first one, instead of the last.

I did further digging, and it looks like the order of batch_tfms despite how I manipulate it, is not reflected in the order of which the transformations are called, as evidenced when I look at the pipeline printout!

How do you manipulate the order of transformations if you’re using the datablock API?

EDIT: Found the answer. DataBlock API and item_tfms order of execution?