In general it’s very similar to pandas. With dask support GPU memory is now no longer a limitation so you can iterate over data of arbitrary size which is amazing. It works like magic.
Finally, I’ve got a team work on a project https://github.com/NVIDIA/NVTabular that covers dataloaders for PyTorch and Tensorflow. We’ll have a new release shortly. This sprint we’ll be working on a fastai2 integration.
I’m doing research on NLP with transformers models and fastai v2. Therefore, as training time is a key point when the training dataset has plenty of Go of data, I’m very interested in using RAPIDS.ai (my GPU is one NVIDIA V100 32 Go).
We’ve made a lot of progress on the dataloader, and have a new version that’s available on our repo. Our 0.3 release will make it official but everything is working now if you pull from main.
It’s split up a bit strangely but we were trying to show all three in the same notebook. The v2 code is much simpler to integrate now. We just instantiate our own dataloaders for train and validation sets and then use the TabularDataLoaders wrapper to turn them into a databunch.
We’ll have a blog post on this coming out in a few weeks when we release but let us know if you have any questions or issues.