Fastai.tabular with Dask for big tabular datasets

fastai.tabular with Dask for big tabular datasets

Hello everyone. I searched the forums to see if anyone had done this yet, but didn’t find anything. I’m happy to be pointed in the right direction if I’m busy reinventing something that’s already been done.

I started to use fastai relatively recently for a computer vision project and was pretty blown away by how much faster I was able to iterate than before (using pure pytorch). I then decided to use fastai.tabular for a timeseries model. Problem is I was working with text files of 50Gb+. I first just created a TorchData iterable datapipe, which worked fine with the fastai dataloaders and learner classes. However, I really missed the quick Pandas data transformations that fastai.tabular offered to be able to test hypothesis, etc.

I then began to replicate the fastai.tabular transformations for Dask dataframes as needed. It was very hacky at first but worked quite well, so I tidied the code up a bit and tried to replicate the fastai.tabular API very closely, only for dask dataframes. Some bits are still a little hacky (mostly where I didn’t quite understand how something worked in fastai), but it works for my current needs. The result is here: GitHub - stefan027/bigtabular: Extentension of tabular for bigger-than-memory datasets with Dask. Docs: bigtabular - BigTabular

Some of the changes needed to make fastai.tabular work with Dask is really small and can quite easily be integrated into the fastai library. In some cases, it’s really just a Dask compute() call here and there, or to wrap a function in Dask map_partitions(). (For example, see FillMissing vs my DaskFillMissing or the date helper functions (add_datepart and dask_add_datepart).)

I created a class called TabularDask, which is analogous to TabularPandas. The biggest difference is that it is set up as an iterable dataset. It is possible to make it work as a map-style dataset with __get_item__, but it’s slow in my experience. For my application, it made sense to do one expensive shuffle operation, and then to just iterate over the rows very quickly. (I am thinking of some sort of hybrid approach, where the dataloader iterates over the Dask dataframe’s partitions, but could then create batches map-style within each partition.) I’m working on an example/tutorial with sizeable dataset to better demonstrate the library.

I’m new to contributing to open source libraries - and new as a contributor to this forum - so I might be way off base here. But I was wondering if there would be any interest of integrating the Dask compatibility into the fastai library?