DataLoader performance for categorical data


I was just watching the part of recreating DataLoaders from scratch and reminded me of an issue I have experienced while working on StarSpace model implemented in PyTorch. StarSpace is rather generic but to illustrate the problem, think of it as a mimicking word2vec and taking 5 token ids + label id as a single training example.

Now, the problem is with overhead of loading the data using pytorch’s Dataloader abstraction. The way Dataloader (and Dataset) are designed is that you have to “touch” every training example separately (because __getitem__ takes a single index as an argument) and that ends up killing the performance of dataloading. In my specific scenario, I was training on around 1 billion examples and the training was taking over 5 days. The GPU utilization was in low single digits, though. When I looked more closely, I realized that training was entirely CPU-bound and due to overhead of Python. As in, even if you served constant data (from an array), you still would get low performance due to Python’s overheard.

I wonder if other people ran into an issue like this? I believe it naturally arises anytime your examples are small (which is often the case for categorical data).

PS. I ended up fixing this problem and got training down to 37 hours.

What did you do to fix the issue?

I ended up smuggling a batch of indices through Dataloader/Sampler/Dataset abstractions to achieve batch loading end-to-end. From PyTorch’s point of view, Dataset would return a single value but it’s a wrapper around a batch of values that’s opaque to PyTorch. I have a custom collate function that unwraps the batch and passes it downstream to Dataloder’s logic. It’s a little bit like pulling a rabbit out of hat, though.

It’s really hacky but was worth the speedup I got. I feel that this points at a deficiency of current PyTorch abstractions.


Got it! Do you think you could share a code snippet? I’ve got my own tabular problem with a very large dataset and I’m hoping to speed up the ~20 min per epoch I am at currently.

It’s tangled with some work code but I can see what I can do.

Are you sure you have the same problem? The easy way to check is is compare CPU and GPU utilization (using htop and nvtop for example) while training. If you see low GPU utilization (for me it was below 10%) and high CPU utilization then you’re onto something.

How many examples do you have in your training set?

1 Like

I will look into that later today and get back with you. My dataset has ~4 million data points, nowhere near close to your billion.

I also deal with quite massive datasets in the order of 100s of millions, would be interested as well on how you handle 1B+.

1 Like


I’m defrosting this old thread because I finally got around to posting my code for end-to-end batch dataloading via smuggling batches through existing Dataloader/Sampler/Dataset abstractions.

See it here:

This code is used in production and helped me get the almost 4x speed improvement that I mentioned earlier.

1 Like

Did you use it with fastai v1?

No, it was a custom implementation of the StarSpace model that wasn’t using fastai v1. However, the problem from this thread is universal: if you’re using standard dataloaders (fastai does) for tabular data then you’ll run into performance issues.

can you share link of your implementation of fastAI?

Not sure what do you mean. Are you asking if I integrated the end-to-end batch data loading with fastai? If so, no, I haven’t. However, it should b easy. The code I posted in pytorch’s issue tracker is self-contained.

I was asking about your starspace integration with fastAI

It was pointed out in the PyTorch issue gkk raised that with current PyTorch you can achieve a similar batched loading by using batch_size=None. Not to detract from gkk’s work which may well have been before this handling was added, and definitely should also look at gkk’s code for other tricks (or easier integration with fastai).
I think the batch_size=None thing should be usable with a lot of the standard fastai code, but may need some modification. The linked post suggests the standard collate function should handle this and looks like that’s the fastai default. You might need to adapt the fastai TabularItemList to produce batches for this, but as ItemList.__getitem__ already support numpy style indexing with a list of indices it may actually be enough to just set bs=None on your dataloader which would be nice.

Also, for categorical you may want to check out fastai v2 (still in development but quickly taking shape). It’s going to support the RAPIDS cudf GPU accelerated dataframes so should be a pretty big improvement here. Haven’t looked at the tabular stuff but apparently it’s coming along, with collaboration between fastai devs and RAPIDS. It’s in the fastai-dev repo.

TomB, thanks for chiming in! Yes, that’s right: I worked on this issue before PyTorch added the support of batch_size=None + yielding a batch of indices from a regular sampler.

I’m planning to test PyTorch’s implementation and I expect it to supersede my hacky solution.