Hi Everyone! Today I made a tutorial walkthrough on how to make a simple NumPy DataLoader (based on the same one from the article I wrote a mont ago). In it, I show how to preprocess with TabularPandas and then from there build a custom dataset and use the fastai DataLoader to speed up and simplify grabbing the data!
Internally fastai2 has some overhead with how it gets the data. It uses pandas to grab it, and it grabs it individually. We index in the NumPy array grabbing the entire batch. Both do a speed up, along with changing the shuffle function to shuffle the internal dataset instead.
You do both 99% of the time (fastai2 this is what we do, we drop the last partial batch), hence I kept them the same, just as a quick example
Nope! It’s a quick and dirty tutorial, show_batch needs to be ironed out, but all you will see here is our encodes values, not the decoded.
See my above comment. We got rid of the decode ability with this change. We’d need a lot of modification to do so, but again. The case you can’t is if you still want some of the fastai decode/encode functionality. However if you’re fine doing that yourself post process, you’re good to go.
So in general, if you’re fine with raw tensor values and looking at those (and know what they are), use them.
Also, one more comment on this. A case where we can’t exactly use this dataloader is if we can’t pre-process everything. In such a case we’d need to worry about how the transforms are being applied, etc. We don’t have this issue with tabular because we can do all the pre-processing to the dataset seperately, and there are no special specific needed transforms. This wouldn’t be the case for text because we need to pad the sequences, adjust them, etc.
Just like how we can’t do this on images either because it’s efficient to do all of our transforms on the batch and item transforms We’d want to explore the TfmdDL instead to look at those
Also, I don’t think we actually can hack the TfmdDL, as it’s all (at least for vision as far as I can tell), built very efficiently. I think the key difference to why this speedup can take place is it’s quick to do a preprocess on something where zero augmentation or batch-optimization is necessary, such as our tabular data inwhich all we do is simply one-hot and make into numbers/normalize.
Ok, so in my understanding what you said is that your method is suitable for no preprocessing (but numericalize) data, such as tabular data, and if we want to apply it to text / image, it needs some extra work on internal dataset.
And since you said we can’t hack TfmdDL, what is your plan to apply your method to Text/Vision ?
It comes to me that we can maybe we can have Dataloader/TfmdDL to have an option for user, to load all batches and cache it into memory/disk, for the first epoch/ before training , and then reuse them for the rest of epochs/ other training runs.
I don’t have one The fastai pipeline is very good IMO for what they’re attempting to do, and the only speed hacks you can get away with is other sublibraries it inherits from to speed the process further, or adapt the transforms themselves to find more efficient methods.
One such example is there is a pillow sub-library called Pillow-SIMD, which speeds up the Pillow library. It’s a simple drop-in, but it can speed up anything pillow does by 4-6x. This would include any and all item transforms on images We even have documentation from fastaiv1 discussing this: https://docs.fast.ai/performance.html#pillow-simd
But Pillow-SIMD is only supported for x86, hence why we don’t use it standardly
Another option for vision is to use DALI, but I haven’t looked into it yet
I’d certainly be happy to try to get started on this with some help from people more familiar with cuDF The main issue why they don’t want to support cuDF ATM is they were having issues getting it to install properly, Conda would sit there for hours on end.
There needs to be many improvements, in the end we’d want it similarly functional to that of native TP, but if everything functions overall the same way all we’d need to do is figure out how to make a generic Dataset to make it work I think?
Yeah. My first thought when I saw the NumPy speed improvement was can I do the same thing with cuDF and make it even faster? But the Rapids.ai cuDF Colab example notebook wouldn’t properly install cuDF on Colab, so I decided to stick with NumPy for now.
I’ll see if I can manage something, and I’ll make a repo for us to all centralize work in (those interested) and post it here in a moment. In the meantime I’ll get an example showing a full integration of the NumPy, with decodes etc so we can properly pull back and add data such as a test_dl over the next few days
However I think for those interested they should either fork the repo or build their own that we eventually merge our ideas and approaches into, the audio sub library saw success with this approach