Training time explodes when using smaller images and increasing batches size

Peppe · November 25, 2021, 10:56am

Hi,
I have been playing with pretraining a Generator superresolution model (to then use into a GAN), that takes images of resolution 64x64 and upscale them by a factor x8 to 512x512.

Training on the DIV2K dataset was taking about 3.5 minutes per epoch, where my inputs to the network where the LowRes images downscaled from the HR images.

I tried to reduce the input images(patches) to 8x8, so reduced the inputs spatial size by a factor x64 (note that these are all precomputed on my SSD drive), and increased the batch_size by a factor x64 to keep using at maximum my GPU memory.
However the training time has exploded to 2.5 Hours per epoch. Why?

I was expecting some overhead due to loading from the SSD more images of size 8x8, whilst before was only one of size 64x64 but should not the multithreads workers of the dataloaders object hide this latency by prefechting the data?

Thanks!
Peppe

bwarner · November 26, 2021, 9:26pm

Unfortunately, I cannot tell what’s going on from your description.

You could try profiling the 3.5 min epoch run followed by the 2.5 hour epoch run with the SimpleProfilerCallback I created to see where the bottleneck is. (Perhaps pair with the ShortEpochCallback so the latter doesn’t run to long)

If that doesn’t give enough information, there’s the PyTorch Profiler with a lot more detailed output.

Peppe · November 27, 2021, 1:54pm

Thanks, I’ll give it a try!