How to understand the training time per epoch for different data sets

nahso · August 9, 2020, 8:19am

Hi,

I am training two cnn_learner with resnet34 architecture. Both the models have similar dataset with size = 224 and bs = 64. I can’t understand why model trained using data_two takes thrice as long per epoch compared to model trained using data_one. I am using Google cloud platform with the default configurations (as described here)

Below is batch_stats for the two models.

PalaashAgrawal · August 9, 2020, 9:17am

@nahso
Nothing you can do anything about. I think its mostly because of GPU kernel overhead. Possible that the test set in data_two is causing that.You can try removing it completely during training. You can also try running the model on data_two and data_one on two different notebooks.

nahso · August 9, 2020, 11:22am

@PalaashAgrawal Thanks. I tried that. Removing the test data does not make any difference. And am running them in two different notebooks. Forgot to mention that these images are actually from different objects - one is cat-dog image set and other one is a set of skin images.

PalaashAgrawal · August 9, 2020, 12:05pm

I dont think the fact that one of them is cat vs dog dataset and the other is a skin image dataset would affect training time by a lot, if at all. But again, nothing you can do, really! Try training the model on CPU and see it makes it any faster (by doing learn.model.cpu() )(sometimes, when there is a lot of GPU overhead, the CPU might work faster). Other than that, I really am out of ideas.