Training performance sometimes slow/jittery

Hi,

I just started going through the course and I’ve been finding that sometimes when I train, it doesn’t seem to progress through the training data smoothly, but rather seems to get stuck for maybe a second every several examples (right now it appears to be mostly pausing 8, ie 3101, 3109 etc).

Interestingly, when I watch nvidia-smi the GPU utilization seems to drop to 0% in some samples as well, whereas when things are running smoothly (they aren’t right now), I think it stays consistently in high 90’s.

I’m using the Google Cloud recommended Standard Compute + Storage option, and otherwise things seem to work well.

Any idea what is causing this?

I should add that the other day when this was happening, I seemed to have solved it by lowering image sizes from 256 to 224. But I’m now continuing to train the same model/data/sizes as I was that day and getting jittery training again…

Sounds like there might be some other bottlenecks in the system, preventing the GPU from being fully utilized.

Since the training jitters every 8th batch, my guess is that the system you’re running is using a quad-core cpu (8 threads). Trying setting num_workers=6 (or anything less than 8) for your dataloader and see if that doesn’t improve things. I’ve experienced similar jittering as you described, and reducing the number of dataloader processes seems to work well for me.

I don’t know exactly why this is the case, but my guess is that some other system processes are running on one of the cores, causing that core to use longer to load the data, so that the other dataloader processes need to wait for this one core to finish up, before loading the next series of batches.

1 Like

I have seen this too with my local GPU, CPU not saturated, minibatches taking variable amounts of wall time. Could it be caused by garbage collection in the GPU?

Another possible bottleneck could be the hard drive, as reading from the hard drive is relatively slow (compared to e.g. reading from memory). Reading the data from an SSD, or even better, a NVMe drive, might help.

If you’re on linux, you could check if this is the issue running something like sudo iotop in the terminal.

So interestingly, setting num_workers=6 does not solve the problem but does make it jitter every 6 instead of every 8. So it’s definitely involved somehow.

What am I looking for in iotop?
IO% for all 6 processes seems to stay consistently around 70%, with Disk Read each about 400-500K/s

Might be worth trying num_workers=1 too.

After some more research, the results from iotop seem a bit difficult to interpret. If you have a faster disk available, the simplest might be to try to read the data from the faster disk and see if the training speed increases.

However, the IO column in iotop shows “the percentage of time the thread spent waiting on I/O”. So if this number is high for the python processes shown in the list, they spend a lot of time waiting for data to be read from the disk. As for what would be considered high, and how much this would affect traning speed, I can’t tell.