Fastai v2 transforms / pipeline / data blocks

jeremy · August 29, 2019, 9:51pm

Interesting posts:

A deep dive into TypeDispatch

Note: 2nd level headings are for modules, 3rd level headings are for functions/classes.

data.core

get_files

get_files returns an L list of all the non-hidden files in path with optional extensions and recurse, only if an optional include directory is in the path."

Example:

source = untar_data(URLs.MNIST_TINY)
all_files       = get_files(source)
train_files     = get_files(source, folders='train')
valid_img_files = get_files(source, folders='valid', extensions='.png')
labels          = get_files(source, recurse=False)

FileGetter

Creates and returns a partial get_files function that searches path suffix suf and passes along args.

Example:

source = untar_data(URLs.MNIST_TINY)
get_train = FileGetter(suf='train')
get_valid = FileGetter(suf='valid')

train_files = get_train(source)
valid_imgs = get_valid(source, extensions='.png')

get_image_files

Returns an L list of all possible image files in the path recursively, only if an optional include directory is in the path.

Example:

source = untar_data(URLs.MNIST_TINY)
train_imgs = get_image_files(source, folders='train')

RandomSplitter

RandomSplitter is used for splitting the dataset into train and validation datasets. It creates 2 sets of shuffled indexes, one for train and another for valid.

RandomSplitter returns a function which takes a list of objects(ex: filenames). Let’s say the length of the list is 1000 and we need 20% of it as a validation dataset, it returns a list consisting of shuffled indexes for the train(800 indexes) and valid (200 indexes) set.

Example:

source = untar_data(URLs.PETS)/"images"
items = get_image_files(source)[:1000]
split_idx = RandomSplitter(valid_pct=0.2)(items)
len(split_idx),len(split_idx[0]),len(split_idx[1])
Output: (2, 800, 200)

Categorize

Categorize helps in converting label strings to vocab id and vice versa.

Example:

tcat = Categorize(vocab=['cat','dog'])
lbl = tcat('cat'); lbl
Output : 1

#For reversing/decoding
tcat.decode(1)
Output : 'cat'

data.transform

Transform

This uses metaclass _TfmMeta. The class has two functions - encodes and decodes. Whenever you index (index is more a functional call like () than [ ]) into the class using an index number the encodes function is automatically called via the _call() method defined in the class. The class.decodes() will have to be called explicitly. It is usually called via the decode() defined in the class.

In a pipeline where a list of transformations are called

pipe = Pipeline([f2,f3,f1])

the pipe.decode() calls the decode() on each of the transformations. In this case, it is f1.decode(), f2.decode() etc. f1.decode() calls f1.decodes() internally.

TupleTransform

From Tuple transform is in docs, but not in code · Issue #266 · fastai/fastai2 · GitHub Per Sylvian
It used to exist, but it’s removed now since all transforms have this behavior (applying over tuple) unless they are ItemTransform .

This is a subclass Transform. This returns as_item_force = False which allows the Transform to return the result of encodes function as tuple type. This allows an encodes to selectively apply the encodes to an item within the tuple that matches a criteria.

jeremy · September 4, 2019, 1:04pm

Thanks @pnvijay! I’ve moved your notes into this wiki thread, and replaced the copied code with links to github.

pnvijay · September 4, 2019, 1:53pm

Thanks Jeremy! I did not know that we could do the links to GitHub in this way

jeremy · September 4, 2019, 4:19pm

If you click on the line number in github, it gives you a page with the correct hyperlink, FYI.

jeremy · September 5, 2019, 1:57pm

@pnvijay FYI you were using “instance of” incorrectly a couple of times. An instance specifically refers to an instantiated object of a class. I’ve fixed them now.

jcatanza · September 5, 2019, 9:14pm

So in get_files the include option has the sense “include only”?

jeremy · September 5, 2019, 9:15pm

Yes that’s right. That might be a better name for the param.

kdorichev · September 6, 2019, 8:20am

Shall I prepare a PR changing parameter include → incl_only?
Changes will be for the following:

get_files
FileGetter
get_image_files
ImageGetter

and corresponding tests.
I will also update the doc strings:

“Get all the files in path with optional extensions, optionally with recurse, only in incl_only, if specified.”

jeremy · September 6, 2019, 12:58pm

Perhaps a better param name would be folders .

kdorichev · September 7, 2019, 1:37pm

PR: https://github.com/fastai/fastai_dev/pull/173

tmjiang · September 9, 2019, 9:10am

Hi,

Just curious. When I had to use symbolic links of folders, I noticed that both v1 and v2 don’t allow followlinks for the internal os.walk. Is it by design for a simple API and/or for speed?

I sometimes used symlinks for switching between the local of Google Colab and a mounted Google drive to it. My another usage was to make several versions of different data augmentation strategies. None of the above usages were absolutely necessary.

Thank you.

sgugger · September 9, 2019, 1:32pm

We could definitely add that functionality.

jeremy · September 9, 2019, 2:01pm

It was by design, since sometimes symlinks can be to places that make things take a long time and not work, but it would be a fine idea to add a followlinks param to get_files to optionally enable it.

tmjiang · September 9, 2019, 10:21pm

Thank you @sgugger and @jeremy. I understand and share the reasoning behind the API.
I will see what I can do about it.

miko · September 12, 2019, 10:49am

I hope it is the right place. In notebook 08 (pets tutorial) I read the following in the section “Using TfmdDS”.:

You can additionally add some ds_tfms to be applied to the tuple created.

The doc string of TfmdDS reads

“A dataset that creates a tuple from each tfms, passed thru ds_tfms”

But I cannot see anywhere in the code where ds_tfms are supposed to be.

Is there something never implemented in the code, outdated docs or am I missing something obvious?

kdorichev · September 12, 2019, 11:15am

Good catch, @miko. I believe that should read:
You can additionally add some tfms to be applied to the tuple created.

miko · September 12, 2019, 11:21am

How would that work though? Say that I do something like

tfms = [[tfmx1, tfmx2], [tfmy1, tfmy2]]
tfm_for_the_tuple = MyAwesomeTFM()

I can create a DS with

TfmdDS(someItems, tfms=tfms)

But how can I add my tfm_for_the_tuple to be applied? I might be missing something from how the DS work, though: I have only been looking at it for a brief while

kdorichev · September 12, 2019, 11:29am

tfms is a list of lists of transformations, in which 1st list, [tfmx1, tfmx2] in your case, is applied to independent variables, while 2nd – [tfmy1, tfmy2] in your case – to the dependent variables. Thus you can add your tfm_for_the_tuple to either of sub lists.
Did I get you right?

miko · September 12, 2019, 12:18pm

I think that I am not understand something.

Say that I add tfm_for_the_tuple to the first sublist. Wouldn’t this transformation only be applied to the independent variables?

What I am looking for is a way of applying a transformation to the tuple resulting from both sublists having been applied which the notebooks suggests should be done via ds_tfms (or so I understand).

You can additionally add some ds_tfms to be applied to the tuple created.

In particular, I am trying to understand how add a show method that takes into account the resulting tuples, instead of the single pipelines.

I’ll try to explain with an example: imagine I want to create a neural net that creates an image out of an audio file. I know how to create a pipeline of transforms that opens the audio files and makes them into tensors. Same things for the images. The TfmdDS will create tuples of audio/image tensors, but how can I create a show method that takes into account the results of both pipelines (for example displaying the image and using the IPython.Audio display?)

MadeUpMasters · September 12, 2019, 1:46pm

To what extent is Pipeline intended to be used by standard users of the library vs internally by developers? We are working on refactoring audio to match v2, and in the original we have a Config object where the user sets preferences for things like silence removal, what type of spectrogram, whether to append delta/accelerate and then that is all handled automatically for the user. On my first implementation for v2 I’m generating a config and then still end up passing it to most transforms in the pipeline and it’s repetitive/poor design

What would be more compatible with your overall vision for fastai v2?

A config object that is passed off to a function that builds a Pipeline behind the scenes?
Having the user construct the pipeline themselves instead of having a config, and just passing the config params directly to the various transforms?

Thanks!