Hi, I’m curious about the usage of these new callbacks as well as their interaction with PeakMemMetric.
I’ve first tried just calling
learn.to_parallel() on a
ml.p3.8xlarge EC2 instance (4 V100’s). I assumed that I should multiply my batch size by number of GPUs, but this resulted in a CUDA OOM crash (original bs worked with 1 GPU). I then tested with/without
to_parallel with the same batch size and ended up with the same epoch times and PeakMemMetric results. I’m running this through SageMaker so it’s a bit difficult to query the GPUs with nvidia-smi, but it seems that there might be only 1 active. Anybody have thoughts/suggestions?
Also, what’s the expected behavior of PeakMemMetric when using parallel or distributed training?
Edit: btw, testing with a