Help in reading nivida-smi output


I wonder if I should add something in my code to use “better” or “more” my GPU (if it is possible)

This is screenshot of nivida-smi and nivida-smi dmon while running from a notebook a learner.get_preds on a bunch of 70K images. Isn’t there too many “0” values ? It looks like the script is not using full power of the GPU. Or is it?

Any feedback would be welcome :slight_smile:

Thank you

1 Like

What is your batch size? Looks like you have plenty of room to increase it since you have so much GPU memory available.

Also try looking at top to see how much CPU you are using.

And does your instance have an SSD or spinning disk?

batch size is 64
I can push it to 256, I don’t have memory issue, but it does not make the get_preds to run faster.

my instance is google cloud plateform
n1-highmem-4 (4 vCPUs, 26 GB memory)
GPU 1 x NVIDIA Tesla P4
500 GB Standard persistent disk Google managed

Here is the screeshot of top command

Looks like you’re CPU-bound (all 4 cores are near 100% in that screenshot).

1 Like

Doesn’t it mean my get_preds is running on CPU instead of GPU?

Probably not. It is only certain parts of the code that can run on the GPU.

Think of it like a funnel. Each layer of the funnel is only so wide. The CPU feeds the GPU. If the CPU’s width (speed) is narrower than the GPU then it doesn’t matter how big the GPU is, it won’t increase the flow of results coming out of the bottom because your system is only as fast as its slowest part.

Hopefully this makes sense: (the blue is supposed to be your data)

If GPU usage is low and CPU is high you’re probably CPU bound. If GPU usage is high and CPU is low you’re probably GPU bound. If both are low you’re probably disk bound. And if both are high you’re probably just right.


Thank you very much for the help
Very clear explanation

I changed the instance to 20 vCPUs, 130 GB memory.
And indeed it’s getting much better



Is there a theorical “perfect fit configuration” regarding number of CPU’s and CPU memory to go with a GPU NVIDIA Tesla P4?

Or should I just keep this 20 vCPUs, 130 GB memory ?

This is interesting. When you were using 4 vcpus your mem util was about 8GB, after upping the cpus to 20 your mem util has actually come down to ~4GB … while 16 out of 20 vCPUs are pegging … your GPU is still not being utilized all of the time.

I’d be interested in knowing what kind of processing can create this type of usage pattern. Seems to me that your CPUs are still working hard while the GPU isn’t doing much “most” of the time.

So, either, the way the work is being parceled out to the GPU isn’t giving it much to work with or the bottleneck is somewhere else … ie in the pathway between the GPU and the CPU

I’m not sure if there is a way to look for queueing on the PCI lanes to see if there’s a backlog there?

What image size are you using? Looking at your numbers, i suspect the image resizing and augmentation operations are bottle necking your cpu. Generally you should not need a 20 thread process to get the most out of your gpu.

If you are using large raw images, a solution is to resize your images to the input size of the cnn (or slightly larger) in a separate preprocessing step.


If you tell me where to look for it I can give you any log you’d like :wink:

I am not sure anymore what notebook I was running exactly but image size was either 68 or 128.
and default tfms = get_transforms()

original imsize 1024x750

you recommend to resize the 200k images of the dataset to 128 and save them as 128 on the disk?
(I wouldn’t have. Is that “known” practice to optimize GPU usage? seems weird)

You should absolutely do this. This was a standard part of the fastai course in past versions.

To test this out, try resizing a few thousand to 128, then see if there is an improvement in speed and gpu utilization.

1 Like

Ah ok.
I will run that over night then. Thank you :space_invader: