Fastai2 dataloader is slower than pytroch dataloader

I am trying to build a top-speed repository with fastai2 core.

I am testing fastai2 new dataloader mechanism, and comparing it to PyTorch high-performance dataloader (with prefetching and fast collation)

In order to compare apples to apples, I am using exactly the same CPU augmentations.

With PyTorch dataloader, I am getting a maximal speed of ~2800 img/sec (input resolution of 128).
With fastai2 data loader, I am getting significantly lower speeds, ~1700 imgs/sec.
I have tried to add prefetching, enable/disable memory pinning, and changing the number of workers in fastai2 dataloader. Still, I could not get near the speed of PyTorch high-performance data loader.

would appreciate feedback and advice


1 Like

Best would be to use Python’s profiler to see where the time is being spent. fastai2 is using the same underlying PyTorch classes behind the scenes.

Thanks for the answer.

The test i described above is the cleanest possible comparison i could think of:
just stress test of the two dataloaders, without actual model training. The result are significant and consistent.

We off-course encountered the problem first in real training:
on 8xV100 imagenet training on input resolution of 128 and standard augmentations,
with PyTorch high-performance dataloader we reached 9500 img/sec
with fastai2 dataloader (all other elements are exactly the same) we reached only 6800 img/sec.

We spent two days trying to enhance fastai2 dataloader with prefetching, and playing with its
inner parameters. Still we could not match PyTorch dataloader throughput.

since dataloader is a multi-threaded CPU-GPU task, profiling it correctly with generic profiles is quite hard, especially on the cloud.


If you set num_workers to 0 then you can make it single process. Easiest way to track down a perf difference like this, I think, is to find where it lies. So try removing the GPU and multi-threading entirely. If you still see the perf diff, then profiling will work fine. If you don’t see it, add GPU - profiling should still largely work ok. You can also try this:

You can also try setting the env var CUDA_LAUNCH_BLOCKING=1.

If single process with GPU has same perf, then the issue must be something to do with multiproc - which we would need to debug separately…