Hi, I’m curious about the usage of these new callbacks as well as their interaction with PeakMemMetric.
I’ve first tried just calling learn.to_parallel() on a ml.p3.8xlarge EC2 instance (4 V100’s). I assumed that I should multiply my batch size by number of GPUs, but this resulted in a CUDA OOM crash (original bs worked with 1 GPU). I then tested with/without to_parallel with the same batch size and ended up with the same epoch times and PeakMemMetric results. I’m running this through SageMaker so it’s a bit difficult to query the GPUs with nvidia-smi, but it seems that there might be only 1 active. Anybody have thoughts/suggestions?
Also, what’s the expected behavior of PeakMemMetric when using parallel or distributed training?
I came across an issue with to_distributed with unet_learner given arbitrary input sizes for the image, the computation takes much longer than if provided fixed image size, is this a common behaviour for the module?