DataLoaders in fastai2, Tutorial and Discussion

muellerzr · May 23, 2020, 9:36pm

Hi Everyone! Today I made a tutorial walkthrough on how to make a simple NumPy DataLoader (based on the same one from the article I wrote a mont ago). In it, I show how to preprocess with TabularPandas and then from there build a custom dataset and use the fastai DataLoader to speed up and simplify grabbing the data!

Video and Notebook:

Video
Notebook

Now your first question may be if this worked, why not a PR? This is more advanced and presumes you know what you’re expecting to do with your data, we take the training wheels off!

Along with this, I decided to make this a general discussion thread on preparing at the DataLoader level specifically and to give us all a general platform thread to discuss ideas!

Finally, if you all enjoyed this and found it useful, I can look into possibly doing something similar for a TfmdDL for Vision or Text, let me know!

Richard-Wang · May 24, 2020, 5:33am

Hi @muellerzr,
I have seen your video and notebook quickly, thanks for your great finding and detailed explanation, it’s exciting !

At the same time, I have some questions about this.

The source of speed up
Is it come from numpy + loading batch from dataset, or some from the former, some from the latter ? (If they are independent, what is speed up for each?)
Why drop_last=shuffle

super().__init__(dataset, bs=bs, num_workers=num_workers, shuffle=shuffle, 
                         device=device, drop_last=shuffle, **kwargs)

I think we should be able to shuffle but not drop last for train dl.
And FilteredBase.dataloaders will take care don’t shuffle don’t dorp last on valid dataloader for us.

dls = [dl] + [dl.new(self.subset(i), bs=(bs if val_bs is None else val_bs), shuffle=False, drop_last=False,
                             n=None, **dl_kwargs[i]) for i in range(1, self.n_subsets)]

show batch ?
Can we show_batch ? It will be perfect if we can still show_batch to verify our own dataloading process.
What is the case we can’t use this

Now your first question may be if this worked, why not a PR? This is more advanced and presumes you know what you’re expecting to do with your data, we take the training wheels off!

Richard-Wang · May 24, 2020, 5:56am

Yes Yes, I am deeply interested in if the same thing can be applied to Text, please do it.

I have no idea whether it helps, but if you want to take a look, I also created a TextDataloader which is faster / as fast as SorteDL/LMDataloader with more features.

muellerzr · May 24, 2020, 12:34pm

Hi Richard!

Internally fastai2 has some overhead with how it gets the data. It uses pandas to grab it, and it grabs it individually. We index in the NumPy array grabbing the entire batch. Both do a speed up, along with changing the shuffle function to shuffle the internal dataset instead.

Richard-Wang:

Why drop_last=shuffle
super().__init__(dataset, bs=bs, num_workers=num_workers, shuffle=shuffle, 
                         device=device, drop_last=shuffle, **kwargs)
I think we should be able to shuffle but not drop last for train dl.

You do both 99% of the time (fastai2 this is what we do, we drop the last partial batch), hence I kept them the same, just as a quick example

Nope! It’s a quick and dirty tutorial, show_batch needs to be ironed out, but all you will see here is our encodes values, not the decoded.

See my above comment. We got rid of the decode ability with this change. We’d need a lot of modification to do so, but again. The case you can’t is if you still want some of the fastai decode/encode functionality. However if you’re fine doing that yourself post process, you’re good to go.

So in general, if you’re fine with raw tensor values and looking at those (and know what they are), use them.

muellerzr · May 24, 2020, 3:08pm

Also, one more comment on this. A case where we can’t exactly use this dataloader is if we can’t pre-process everything. In such a case we’d need to worry about how the transforms are being applied, etc. We don’t have this issue with tabular because we can do all the pre-processing to the dataset seperately, and there are no special specific needed transforms. This wouldn’t be the case for text because we need to pad the sequences, adjust them, etc.

Just like how we can’t do this on images either because it’s efficient to do all of our transforms on the batch and item transforms We’d want to explore the TfmdDL instead to look at those

Richard-Wang · May 25, 2020, 12:33am

Wow, this explanation makes me clear.

But what if we do all transforms and cache created batches first ?

We can’t change bs
We shuffle batches instead of samples

Do you think it is worth to sacrifice those to load a batch at a time ?

I’ll looking forward to see how you hack TfmdDL.

muellerzr · May 25, 2020, 12:59am

This is where perhaps running from an internal Dataset may be better that you shuffle, similar to my approach.

Is it worth it? If you can do it, sure, but depending on the problem this could add up very quickly

muellerzr · May 25, 2020, 1:02am

Also, I don’t think we actually can hack the TfmdDL, as it’s all (at least for vision as far as I can tell), built very efficiently. I think the key difference to why this speedup can take place is it’s quick to do a preprocess on something where zero augmentation or batch-optimization is necessary, such as our tabular data inwhich all we do is simply one-hot and make into numbers/normalize.

Richard-Wang · May 25, 2020, 1:19am

Ok, so in my understanding what you said is that your method is suitable for no preprocessing (but numericalize) data, such as tabular data, and if we want to apply it to text / image, it needs some extra work on internal dataset.

And since you said we can’t hack TfmdDL, what is your plan to apply your method to Text/Vision ?

It comes to me that we can maybe we can have Dataloader/TfmdDL to have an option for user, to load all batches and cache it into memory/disk, for the first epoch/ before training , and then reuse them for the rest of epochs/ other training runs.

muellerzr · May 25, 2020, 1:34am

I don’t have one The fastai pipeline is very good IMO for what they’re attempting to do, and the only speed hacks you can get away with is other sublibraries it inherits from to speed the process further, or adapt the transforms themselves to find more efficient methods.

One such example is there is a pillow sub-library called Pillow-SIMD, which speeds up the Pillow library. It’s a simple drop-in, but it can speed up anything pillow does by 4-6x. This would include any and all item transforms on images We even have documentation from fastaiv1 discussing this: https://docs.fast.ai/performance.html#pillow-simd

But Pillow-SIMD is only supported for x86, hence why we don’t use it standardly

Another option for vision is to use DALI, but I haven’t looked into it yet

bwarner · May 25, 2020, 4:24pm

Using your NumPy DataLoader increased training speed by at least a third with the dataset I’m currently working on. It’s such a relatively easy win if a dataset can safely fit into memory.

I’m thinking a NumPy TabularPandas, or more generic version of TabularPandas that can use Pandas, NumPy, Dask, cuDF, cuPy, or Dask-cuDF as the backend should be on the fastai2 development wish list.

Which means I should get around to watching the fastai2 code walkthroughs

muellerzr · May 25, 2020, 4:29pm

I’d certainly be happy to try to get started on this with some help from people more familiar with cuDF The main issue why they don’t want to support cuDF ATM is they were having issues getting it to install properly, Conda would sit there for hours on end.

There needs to be many improvements, in the end we’d want it similarly functional to that of native TP, but if everything functions overall the same way all we’d need to do is figure out how to make a generic Dataset to make it work I think?

bwarner · May 25, 2020, 4:32pm

Yeah. My first thought when I saw the NumPy speed improvement was can I do the same thing with cuDF and make it even faster? But the Rapids.ai cuDF Colab example notebook wouldn’t properly install cuDF on Colab, so I decided to stick with NumPy for now.

muellerzr · May 25, 2020, 4:34pm

I’ll see if I can manage something, and I’ll make a repo for us to all centralize work in (those interested) and post it here in a moment. In the meantime I’ll get an example showing a full integration of the NumPy, with decodes etc so we can properly pull back and add data such as a test_dl over the next few days

However I think for those interested they should either fork the repo or build their own that we eventually merge our ideas and approaches into, the audio sub library saw success with this approach

muellerzr · May 25, 2020, 7:22pm

For those interested, I’ve started the GitHub here: https://github.com/muellerzr/fastai2_tabular_hybrid

Along with an open issue for projects people can take on. (cc @bwarner as I’m sure you’d be interested )