100% right on the overhead. If you narrow down the transforms and just apply them as it would expect to during the validation set, you can usually save on time there too (as it’s done on the fly).
I proved this here in my most recent find. With a GPU I was able to get a resnet18 to be close to real time (~40ms), and this can very easily be modified for cpu as well
(For some other numbers a resnet18 on CPU and single image using this method I got 74.7ms)