Fastai_v1, adding features

Thank you for the wonderful courses, Jeremy, Rachel, and Sylvain!

I am very interested in applying what I have learned from the fast.ai courses to medical imaging. In such applications, I often face inputs with an arbitrary number of channels, for example, 1-channel gray images, N-channel 3D data set, M-channel multi-parametric maps, K-channel multi-contrast images, L-channel complex-part and real-part of data acquired from multiple receivers, and etc. (where N, M, K, L, … are any arbitrary integer such as 1, 2, 3, 4, 5, … and up to 128 or even 256).

I was wondering if the fast.ai library could be supporting data with an arbitrary number of channels input data?

Thanks a lot!

2 Likes

Sure. As you try it out, let us know if you find places where it’s making assumptions about # channels. (Other than model definitions, of course).

1 Like

Yes, I will Jeremy

What were your initial thoughts on how to implement tta for segmentation?

To implement TTA to segmentation, you need to keep the transformation params and a way of applying the inverse of the transformation. There may be transformations that are not inversible.

So that would entail adding in a separate set of only inversible transforms for the test dl instead of utilizing the validation ones, modifying the test dl to capture the transform parameters, and then the tta function to do the inversion on the prediction?

Here was another thread on the topic:

And also, not all of them would need to be inversed, only those that change pixel locations. Crops probably wouldn’t work.

Maybe there is an easier way, but that was what I thought was needed when I had this problem. If you want, we could work together on trying to solve this.

Here is another discussion on subject: https://github.com/fastai/fastai/issues/646

I’d be up for that. Shoot me an email with a good time to connect live. cakeb@calebeverett.io.

This is all sounding about right to me. @sgugger is likely to be working on some related stuff next week BTW so he might have some useful updates. The basic pieces are already started here: https://github.com/fastai/fastai_docs/blob/master/dev_nb/x_006a_coords.ipynb

Add automatic validation set creation. I currently am working with a dataset where I have a train and test set so I will have to split out a validation set on my own. It would be cool if you could just load a training set and say what percent you wanted to be validation.

Edit: Looking into the source and it looks like there may be some of this functionality built in or at least being built in. The ImageClassificationDataset.from_folder method has a valid_pct variable that it uses to split it up. It isn’t extended to the image_data_from_folder function yet, but at least the structure is there for it!

I may be wrong, but I think I saw exactly that in the docs.

You can just call that yourself. See the image_data_from_folder source for an example. It’ll be something like:

train_ds,valid_ds = ImageClassificationDataset.from_folder(path, valid_pct=0.2)
data = DataBunch(train_ds,valid_ds)
4 Likes

https://github.com/fastai/fastai_docs/blob/master/dev_nb/x_006a_coords.ipynb got renamed to https://github.com/fastai/fastai_docs/blob/master/dev_nb/102_tfm_coords.ipynb

1 Like

Yup, that’s the rework.
You’ll find pieces to do the inverse of an affine transform or a coord transform there, and most the transforms implemented at an affine or coord level, so all in all, everything needed to do TTA for segmentation.

1 Like

Requested feature: A progress indicator when tokenizing a TextDataset.

Why: Tokenizing can take a looong time when you have a large corpus, so it would be nice to have some kind of indication how many texts have been processed.

Background: I’m currently building a “base” Dutch language model from a Dutch Wikipedia dump, but after ~10 hours I have no idea if it’s still happily tokenizing or if something is wrong.

I tried to find out if I could add this feature myself, but I’m still getting familiar with the details of the v1 library.

1 Like

It already is implemented. If you are creating your TextDataset from a csv, it automatically does it by chunks of size 10,000 and you have a progress bar.
If you create it from a dataframe, be sure to load it with chunksize=something to see the same progress bar.

I thought chunksize was removed from the from_df method as indicated here.

Thank you for your reply. I was using text_data_from_folder, because the corpus consists of many separate text files. But I’ll make a csv of the wiki dump instead.

That is weird. Your method creates a csv file that is then automatically opened with a chunksize of 10,000 normally (and you can specify the chunksize you want in the kwargs).
Do you have a csv file in the tmp directory where you were working? Maybe it got stuck while creating it and not during the tokenization.

It was removed from the arguments of the from_df method, but you can pass a dataframe that as been loaded with a given chunksize like:

df = pd.read_csv('my_file.csv', header=None, chunksize=10000)

Oh ok. What I’ve been doing is just pickle my train/test/val dataframes and just load that back in. Thats why there is really no chunksize there. Thanks for clearing that for me.