Fastai v2 transforms / pipeline / data blocks

sgugger · February 11, 2020, 1:42am

Yes, the latter is way too much magic: if you want your transform to work at the tuple level, it will preserve the type at the tuple level, not inside the tuple.

I’ll look at why you need both when I have some time. You should not (note that as_item is probably superceded by the Pipeline setup methods, so you might need force_as_item)

wgpubs · February 11, 2020, 8:00pm

If we already have the training/validate data split into separate DataFrames … how do we load it using the Datasets/DataBlock API?

For example:

train_df = pd.read_csv(LM_PATH/'train.csv', low_memory=False)
valid_df = pd.read_csv(LM_PATH/'test.csv', low_memory=False)

tfms = [attrgetter(*corpus_cols), Tokenizer.from_df(corpus_cols), Numericalize()]

# this does not work as expected
dsets = Datasets([train_df, valid_df], [tfms], splits=None, dl_type=LMDataLoader)

I’m sure there is an easy way to do it … just can’t find it

muellerzr · February 11, 2020, 8:00pm

What I did was for images I merged the two together, so perhaps concatenate your dataframe into one whole one (for KFold validation)

wgpubs · February 11, 2020, 8:03pm

Yah that was my plan if this isn’t possible … merge them, add a column for determining which dataset they should go in, go from there.

Btw, do you ever sleep and actually make your classes You’re everywhere. I literally saw the notification pop up and before I even looked I knew who it would be.

muellerzr · February 11, 2020, 8:05pm

Balancing sleep, school, research, gym, and the class itself is a challenge but we’re doing it somehow

wgpubs · February 11, 2020, 8:47pm

Any ideas why Tokenizer.from_df doesn’t like me passing in rules?

tfms = [
    attrgetter('text'), 
    Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules), 
    Numericalize()
]

lm_dsets = Datasets(items=df,
                    tfms=[tfms], 
                    splits=ColSplitter(col='is_valid')(df), 
                    dl_type=LMDataLoader)

len(lm_dsets.train), len(lm_dsets.valid)

throws an exception (it doesn’t like that I’m passing rules in at all … even if I set it to None):

TypeError                                 Traceback (most recent call last)
<ipython-input-66-6ac119f523c0> in <module>
      1 tfms = [
      2     attrgetter('text'),
----> 3     Tokenizer.from_df(text_cols=corpus_cols, rules=custom_tok_rules),
      4     Numericalize()
      5 ]

~/development/_training/ml/nlp-playground/fastai2/fastai2/text/core.py in from_df(cls, text_cols, tok_func, **kwargs)
    256     @delegates(tokenize_df, keep=True)
    257     def from_df(cls, text_cols, tok_func=SpacyTokenizer, **kwargs):
--> 258         res = cls(get_tokenizer(tok_func, **kwargs), mode='df')
    259         res.text_cols,res.kwargs,res.train_setup = text_cols,merge({'tok_func': tok_func}, kwargs),False
    260         return res

~/development/_training/ml/nlp-playground/fastai2/fastai2/text/core.py in get_tokenizer(tok_func, **kwargs)
    243     sign = inspect.signature(tok_func)
    244     for k in kwargs.keys():
--> 245         if k not in sign: kwargs.pop(k)
    246     return tok_func(**kwargs)
    247 

TypeError: argument of type 'Signature' is not iterable

sgugger · February 11, 2020, 9:00pm

Changing rules is not supported other than by using the init yet. Will fix that tomorrow if I have time.

wgpubs · February 11, 2020, 11:34pm

FYI: Same goes with arguments like mark_fields, add_bos, add_eos, and chunksize … probably others.

… and btw, if you want help please let me know. Still getting my feel for v2 but can figure things out if your strapped for time.

Vertigo42 · February 19, 2020, 8:07pm

Hi all,
Finally I have some time to start and explore the V2.
I’m playing with the Kannada-MNIST Kaggle Dataset, where all the training images are imported as single csv file (each line represent one image).
I transformed it to tensor (# images, 1 Channel, 28 , 28), but i’m not sure if it’s possible to create image Datasaet/Dataloader directly from this tensor or do I need first to create image files for each image (sound wrong to me…)?

I’m sure it’s possible with the lower level API but I don’t know where to start looking…
Thanks!

ai_padawan · October 29, 2020, 9:00pm

Is there any good guide/tutorial and how to add your own transformation into the fastai v2 transforms/pipelines? Been trying to add in a gaussian noise data aug without much success.

Here’s what I have working so far:

class AddNoise(Transform):

def __init__(self, mean=0., std=1., **kwargs):
    super().__init__(**kwargs)
    self.std = std
    self.mean = mean
    print("Mean/Std: {}/{}.".format(self.std, self.mean))
    
def encodes(self, x:(TensorImage)):
    self.fudge = torch.randn(x.size()).cuda() * self.std + self.mean
    return x + self.fudge

This is my batch_tfms:

[Dihedral -- {'p': 1.0, 'size': 192, 'mode': 'bilinear', 'pad_mode': 'reflection', 'mode_mask': 'nearest', 'align_corners': True}:
 encodes: (TensorBBox,object) -> encodes
 (TensorPoint,object) -> encodes
 (TensorImage,object) -> encodes
 (TensorMask,object) -> encodes
 decodes: ,
 Brightness -- {'p': 1.0, 'max_lighting': 0.2}:
 encodes: (TensorImage,object) -> encodes
 decodes: ,
 AddNoise:
 encodes: (TensorImage,object) -> encodes
 decodes: ,
 Normalize -- {'mean': tensor([[[[0.4850]],
 
          [[0.4560]],
 
          [[0.4060]]]], device='cuda:0'), 'std': tensor([[[[0.2290]],
 
          [[0.2240]],
 
          [[0.2250]]]], device='cuda:0'), 'axes': (0, 2, 3)}:
 encodes: (TensorImage,object) -> encodes
 decodes: (TensorImage,object) -> decodes]

Questions:

Do I need to setup a decode function?
Does this mean that every image in the batch gets the same noise statistics? I would prefer each image to have a different noise statistic.
This means that add noise was applied before normalization right? So I should specify the mean and std in the original RGB (0-255) space?

florianl · October 31, 2020, 8:11pm

the encodes() adds the transform and decodes() is supposed to remove the transform. As far as I know, the decodes are just used when displaying the items / batches. so imho you don’t have to implement a decodes (as you are adding noise I couldn’t think of a way to remove it).
I don’t think so. I checked the AddNoise on the MNIST dataset (which has always 0.0 in the top left corner) and every item has different noise added.
to see exactly the order of the transforms run dlock.summary(). And you are right, currently it’s executed before IntToFloatTensor (which divides by 255) and thus should add noise in the range between 0-255. You could also add order = 20 to your transform, to make sure it’s executed after.

class AddNoise(Transform):
order = 20 # <--- add this

def __init__(self, mean=0., std=1., **kwargs):
    super().__init__(**kwargs)
    self.std = std
    self.mean = mean
    print("Mean/Std: {}/{}.".format(self.std, self.mean))
    
def encodes(self, x:(TensorImage)):
    self.fudge = torch.randn(x.size()).cuda() * self.std + self.mean
    return x + self.fudge

A good way to understand transforms is to build a pipeline and play around with it:

# split_idx=0 to make sure that RandTransforms are being executed
p = Pipeline([PILImage.create,ToTensor,IntToFloatTensor,AddNoise], split_idx=0)

# get one item / mage
i = get_image_files(path)[0]

# put it through the pipeline
o = p(i)

# check what happened to your item
type(o)
o.float().mean(), o.float().max(), o.float().min()
o

Florian

ai_padawan · November 3, 2020, 6:32pm

Thanks. Is there anyway to make the fastai data pipeline compatible with standard torchvision and kornia augmentations? I’m running into issues because fastai needs this “TensorImage” object, but torchvision/kornia does not know what these are…

muellerzr · November 3, 2020, 6:42pm

@ai_padawan in fastai transforms are applied via type-dispatching. So if I wanted to use torchvisions RandomResizedCrop I could do the following:

class TVRRC(ItemTransform):
  def __init__(self, size=448):
    self.tfm = tv.RandomResizedCrop(size)
  def encodes(self, x:(Image.Image, TensorImage)):
    return self.tfm(x)

However I will add that a TensorImage is just a Tensor with a subclass, so I can do this and it’s still valid:

tfm2 = tv.RandomResizedCrop(224)
batch = IntToFloatTensor()(TensorImage(item)).cuda()

tfm_batch = tfm2(batch)

Note also that torchvision expects tensors for most transforms (some can work with PIL images, you would need to see on the specific transform you want). These only happen after the ToTensor transform. So on your item transform make sure to give it an order property larger than 5:

class TVRRC(ItemTransform):
  order = 6
  def __init__(self, size=448):

ai_padawan · November 3, 2020, 7:22pm

Thanks, for color jitter I have the code running with kornia (not crashing out), but the image looks wrong when I display it (I only get a black image?)

class AddJitter(Transform):

def __init__(self, color_jitter=1):
    self._op = kornia.augmentation.ColorJitter(0.8*color_jitter, 0.8*color_jitter, 0.8*color_jitter, 0.2*color_jitter)

def encodes(self, x:(TensorImage)):
    print(type(x), x.shape)
    return self._op(IntToFloatTensor()(x).cuda())

item_tfms = [Resize(224, method='squish')]

batch_tfms = []
batch_tfms.append(Dihedral(p=1))
batch_tfms.append(Brightness(p=1, max_lighting=0.2))
batch_tfms.append(AddJitter(color_jitter=0.1))
batch_tfms.append(Normalize.from_stats(*imagenet_stats))

When I pass item_tfms and batch_tfms to the Datablock API, on show_batch I get black images. When I remove the AddJitter() transform, it appears normal?

ai_padawan · November 6, 2020, 12:06am

Ok… there’s something else fastai v2 is doing under the hood. Using datablock I pass NO item_tfms and batch_tfms, and after pulling out the TensorImage, it seems to be normalized to (0, 1)!!! Where is this coming from? It should be (0, 255).

My augs are already doing the normalization, and so with this hidden normalization it’s reducing all my values to 1e-7, no wonder I get a black image!

EDIT: Using the datablock.summary() command I was able to identify the hidden normalization. This is added automatically. Is there a way to not add this final IntToFloatTensor operation? It looks like it’s even hardcoded for RGB images and it’s really a bad idea to do this if you’re doing non-std images.

IntToFloatTensor – {‘div’: 255.0, ‘div_mask’: 1}

EDIT2: Found a solution, DataBlock adds the IntToFloatTensor operation by default as the first one, so just pop it out (it’s a list, so just remove the first element). This is important if your transformations are doing some sort of normalization, as if you don’t remove the initial IntToFloatTensor, you will be doing double normalization.

ai_padawan · November 6, 2020, 6:46pm

Ok, there’s something still messed up with the IntToFloatTensor default operation that datablock calls. Does anyone know what is exactly going on and how the default FastAI transformations are related to it? Initially I thought it was dividing all values by 255 to get values in (0…1) but that’s not the case.

I can have ok results if I remove the first IntToFloatTensor operation, and use my data augs which itself uses a IntToFloatTensor operation inside it. But when I add it back in, I can see that my data augs that is called AFTER the first operation has input data that is (0…255). What’s going on? But then as I track the data, it gets normalized twice.

Isn’t the output of the IntToFloatTensor operation supposed to get it between (0…1). Why does the downstream function still get (0… 255)?

My workaround was to remove the original IntToFloatTensor operation and then move my data augs operation to the first one, instead of the last.

ai_padawan · November 6, 2020, 7:06pm

I did further digging, and it looks like the order of batch_tfms despite how I manipulate it, is not reflected in the order of which the transformations are called, as evidenced when I look at the pipeline printout!

How do you manipulate the order of transformations if you’re using the datablock API?

EDIT: Found the answer. DataBlock API and item_tfms order of execution?