Unexplained long delay between batches with little/no CPU, GPU, or Disk activity

xjdeng · February 28, 2019, 5:29pm

I’m training a dataset of about 100,000 images and each batch is taking about 20 minutes on my machine. However, maybe only 3 out of those 20 minutes involves heavy CPU, GPU, and/or disk activity. The rest of the time, the training progress bars are not moving and there’s little/no GPU, CPU, or disk activity so I have no idea where the bottkeneck is and what fastai is waiting on. However I do see a litte more than usual network activity during the delay like 50-100 Kbps/s being transferred in both directions.

Oh, and I’m using Windows. It seems other Windows users are having this problem. I’ve also tried running it on a smaller subset of the data like 1000 images and am still getting an unexplained delay that’s maybe slightly shorter than on the entire 100,000 images (didn’t time it.)

My machine has 2 GPUs, a very low end 940M with 2GB and a 1080 with 11GB. Each epoch on the 1080 still takes maybe 80-95% as long as on the 940M despite the latter GPU being light years ahead of the former due to these unexplained delays.

Can anyone move me one step closer to a resolution?

marcmuc · February 28, 2019, 6:59pm

Have you tried setting num_workers=0 when creating the databunch? ImageDatabunch defaults to num workers = num cpus. I know on windows pytorch has a lot of problems with multiprocessing, so it might be worth switching that off by setting it to zero. (When using images that might mean the loading/transforming on the cpu will likely become the bottleneck then, but at least that would be explainable )

marcmuc · February 28, 2019, 7:05pm

xjdeng · February 28, 2019, 7:12pm

Amazing! This increased my training speed like 5-10x (now training normally)!!

xjdeng · June 7, 2019, 1:14pm

Update: this seems to be fixed for me when it comes to Images and Tabular but I’m still having the same problem on Windows when initalizing a TextLMDataBunch.from_csv even if i add num_workers=0

Im using fastai 1.0.52 and pytorch gpu 1.1.0 for windows on a GTX 1080

gustav · July 2, 2019, 12:16pm

Did you solve the problem? I have the same issue.

xjdeng · July 2, 2019, 12:37pm

Unfortunately, no. However, I tried using Keras on the same computer and it, too, was suffering from similar delays for no reason. Also tried upgrading to Cuda 10,still no dice.

This is on my work computer which is running Windows 10 with a 1080.

I also have this issue on my personal laptop with a 960M but it doesn’t seem as severe. Also running windows 10

JamieCao · July 14, 2019, 4:03am

I have the same issue with almost exact configuration as you.
I am currently on fastai 1.5.0, CUDA 10 on an 1080Ti, windows 10.
The pattern to me seems to be between training and validation there is a long waiting period (around 1 minute+ in the case for around 1000 images) which only one core of CPU being pulled to 100%, and then during the actual training all the CPU would be pulled to 100%.
Adding num_workers =0 improve the training speed for all three data bunch method and its now working normally as the same hardware on linux.
Judging from the performance I suspect that the problem may be some part of loading data in fastai or pytorch cannot automatically be multi-threaded, but I am not sure exactly.

mocha · May 6, 2020, 1:09pm

still hangs in 2020 …
no cpu core is busy nor gpu as well
no idea what is waiting / doing in the background.

mocha · May 7, 2020, 3:33pm

The delay gone when I switched from Windows 10 to Ubuntu 20.04

xjdeng · May 7, 2020, 3:46pm

Another reason why I use Google Colab or dual boot to Linux. But unfortunately, I’m stuck with Windows on my work PC.