CPU bottleneck

After doing lesson 1, I wanted to try it on my own dataset. I tried it with around 160k labelled images I have from work. The images are around 1440x1080 pixels on disk (sometime larger sometime smaller)…

It is training, but rather slowly. CPU seems to be my bottleneck even though torch.cuda.is_available() returns true. So I assume that pytorch is using my GPU. I am on windows. The CPU is at 100% and the GPU is barely used:

I was wondering if there’s anything I can do to speed training? Maybe pre-process my images on disk in some way before training?

Any help would be welcome,



I don’t know how to help you but there’s this page about performance in the documentation, maybe you’ll find something there for you :slight_smile:

Have you tried:
to check if torch have cuda and cudnn support?

This often helps, even just resizing to the required input size.

1 Like

Reducing the image size to be around 300px on disk really helped.

Anyone got Pillow-SIMD working on windows?

1 Like

From my understanding image transforms are done on the CPU right now? @sgugger

I was wondering if doing the image transformation on the graphics card per batch before feeding the network would actually help?

On windows, even with images scalled to be 300px, the CPU seems the bottleneck.


I’m not sure doing transforms on the GPU is possible, at least I’m not aware of anyone doing that. GPUs are good at doing basic operations on matrices (and now also tensors with the new tensor cores), but I think most transforms can’t be reduced to simple matrix or tensor operation.

If that’s right, the only thing you can do is apply the tfms before and separately from the training with apply_tfms and then train on those transformed images.

Well, if you apply transforms in advance, that will reduce the randomness of the transformation unless there is no randomness involved in your transform, like resizing or padding.

I have the same issue here, coz for computer vision, image augmentation is inevitable and doing it with a CPU is always my bottleneck for training. In this case, a DL rig with 4 GPUs won’t make any sense. I really want to know how are people solving such problems!

My idea comes from the fact that most games are rendered using the GPU, textures are simply images. Those textures are applied to polygons, so they are zoomed, rotated etc etc. This is why I was thinking we could simply do something similar before feeding a batch in the training loop.

The big problem with doing data augmentation on the GPU is that you need to already have images of the same size to batch them together, which is very often not the case. Other than that, most the data augmentation can be adapted to be done on the GPU since in fastai it’s done on pytorch tensors.


So if we pre-processed our images on disk to be already all the same size. You could theoretically do everything on the GPU? This would be very very nice.

Oh good to know. So given images of the same size, what would it take to process the transforms on the GPU ?

@sgugger, are you saying that we can just “simply” put the “transforms pytorch tensors” on gpu, since fastai is just doing transforms on pytorch tensors on cpu? Preprocessing the image data into the same size is not a big issue at all…

You would need to adapt the current pipeline, but everything that is done could be done on a batch of images on the GPU, yes.


NVIDIA seems to have a tool that does image augmentation on the GPU:

1 Like

As you can see, I am facing the same problem.
In the Task Manager, you can change the metrics to show how much of Cuda power you using, as I did.
And it easy to see that my GPU working as well, but I have a CPU bottleneck.

@sgugger @PierreO have either of you succeeded in using GPU for augmentations with fastai v1?

Augmentation on the GPU is one of the main new features of v2.


Sounds good. However, I am working on a project using fastai v1 and am interested in using GPU augmentations to decrease training time. How would I go about doing that? To me, it isn’t clear how fastai is applying the augmentations. Is it happening through the DeviceDataLoader?

@sgugger I dug through the fastai source code to better figure out and I think I am misunderstanding something because to me it seems fastai should already be using GPU for transforms. Here’s what I mean:

Let’s start with a Learner object and I train it with the Learner.fit function. Now this goes to a separate fit function. We can see for the training that it iterates through learn.data.train_dl. learn.data.train_dl comes from the DataBunch that was originally passed into the Learner object. We can see that DataBunch.train_dl is created as a DeviceDataLoader object with device=self.deviceand self.device is set to defaults.device which is CUDA. Looking at DeviceDataLoader we see that iterating through the dataloader calls self.proc_batch, which puts the image onto the device (supposed to be CUDA), and applies the transforms.

So this means the transforms should be applied on the GPU, no? Did I miss something?