I am trying to build a top-speed repository with fastai2 core.
I am testing fastai2 new dataloader mechanism, and comparing it to PyTorch high-performance dataloader (with prefetching and fast collation)
In order to compare apples to apples, I am using exactly the same CPU augmentations.
With PyTorch dataloader, I am getting a maximal speed of ~2800 img/sec (input resolution of 128).
With fastai2 data loader, I am getting significantly lower speeds, ~1700 imgs/sec.
I have tried to add prefetching, enable/disable memory pinning, and changing the number of workers in fastai2 dataloader. Still, I could not get near the speed of PyTorch high-performance data loader.
The test i described above is the cleanest possible comparison i could think of:
just stress test of the two dataloaders, without actual model training. The result are significant and consistent.
We off-course encountered the problem first in real training:
on 8xV100 imagenet training on input resolution of 128 and standard augmentations,
with PyTorch high-performance dataloader we reached 9500 img/sec
with fastai2 dataloader (all other elements are exactly the same) we reached only 6800 img/sec.
We spent two days trying to enhance fastai2 dataloader with prefetching, and playing with its
inner parameters. Still we could not match PyTorch dataloader throughput.
since dataloader is a multi-threaded CPU-GPU task, profiling it correctly with generic profiles is quite hard, especially on the cloud.
If you set num_workers to 0 then you can make it single process. Easiest way to track down a perf difference like this, I think, is to find where it lies. So try removing the GPU and multi-threading entirely. If you still see the perf diff, then profiling will work fine. If you don’t see it, add GPU - profiling should still largely work ok. You can also try this: