I have tabular data that is far too large to fit in memory (240+ million rows). The data is in numerous parquet files of ~400k rows each. Is there a way to set up a data loader to, say, train on the data from one parquet file in each step?
I’m coming from PyTorch, so I’ll describe what I would do there. I I would provide a DataLoader a list of parquet files, and the collate function would read one file with pandas, convert values and target to a tensor, and return. Is there analogous functionality in FastAI? All I could find in the docs indicate that my data would have to be either a single dataframe in memory, or a single CSV